Introduce yourself (the charm of zoom)
Wednesday class will be online too. Stay safe and dry!
Course GitHub organization invitation
Lab sessions
There will be recordings (when needed) in the future.
No need to submit the lab “work”.
There will be “solutions” posted (after the following Monday lecture) for future lab sessions (when there are questions).
Using R is a course objective
try to use R as much as possible for lab sessions and homework assignments
free to use any language for course project
Project and homework submission via GitHub
Homework assignment starts week 2. 1st assignment due on week 4. Expected frequency: bi-weekly.
Will provide optional reading material on Course Webpage. (No poll for python, it will be provided through this mechanism)
Github page contains the most up-to-date materials.
Regret on sending multiple locations for posting GitHub ids.
Find a dataset of interest to you.
Turn in a brief one-page description by the end of week 3. (3/30)
Submit a mid-term report (2 - 4 pages, no more than 4 please) by the end of week 12. (7/30)
Present your work to your peers week 15 and 16. (10/30)
Submit a final report (4 - 8 pages, no more than 8 please) by the end of the semester by December 5.
Submit code to your own private GitHub repository on the course GitHub organization by December 5. (Report + Code, 10/30)
(Optional) make a GitHub page for your project.
Amazon data http://jmcauley.ucsd.edu/data/amazon/, https://nijianmo.github.io/amazon/index.html, https://cseweb.ucsd.edu/~jmcauley/datasets.html
Sports/eSports prediction
Hurricane prediction!
1000 human genome project
Reproduce findings of a paper in your field (could be hard).
Google “data science projects” to get more ideas
Include the brief description with modifications if needed
Give an abstract on your plan
Current progress and future plan
Introduce the dataset. Explain why you choose it. Explain what questions you want to ask and explore using the dataset.
Analysis. Explain the statistical methods that you use for analyzing the dataset. Explain what you have done to generate the results (make your analysis reproducible).
Results. Illustrate your results. Use figures and tables to imiprove readability.
Discussions. This is the place to put in almost whatever you want to share. Some difficulties you met in the analysis, what you learned from the analysis, some future directions.
Be memory conscious when dealing with big data. E.g., human genome has about \(3 \times 10^9\) bases, each of which belongs to {A, C, G, T}. How much storage if we store \(10^6\) SNPs (single nucleotide polymorphisms) of \(1000\) individuals as single (4GB), double (8GB), int32 (4GB), int16 (2GB), int8 (1GB), PLINK library format 2bit/SNP (250MB)?
Know the limit. Overflow and Underflow. For double precision, \(\pm 10 ^{\pm 308}\). In most situations, underflow is “preferred” over overflow. Overflow often causes crashes. Underflow yields zeros (which however could lead to \(0 / 0\) situations).
Example 1, in logistic regression, \(p_i = \frac{\exp{(x_i^T \beta})}{1 + \exp{(x_i^T \beta})} = \frac{1}{1 + \exp{(-x_i^T \beta})}\). The former expression can easily lead to \(\infty / \infty = NaN\), while the latter expression leads to graceful underflow.
Example 2, calculation of the probability of large amount of iid (independent and identically distributed) random variables (r.v.). Consider operation in log-space.
Compiled versus interpreted languages.
Compiled languages: C/C++, Fortran, … directly compiled to machine code that is executed by CPU. Advantage: fast, take less memory. Disadvantage: relatively longer development time, hard to debug.
Interpreted language: R, Matlab, SAS IML, … Interpreted by interpreter. Advantage: fast for prototyping. Disadvantage: excruciatingly slow for loops.
Mixed (compiled and then interpreted by virtual machine): Python, JAVA. Advantage: extremely convenient for data preprocessing and manipulation; relatively short development time. Disadvantage: not as fast as compiled language.
Scripting: Unix/Linux scripts, Perl, Python. Extremely useful for data preprocessing and manipulation.
Database language: SQL, Hadoop. Data analysis never happens if we do not know how to retrieve data from databases.
More about computer languages
To improve efficiency of interpreted languages such as R code, avoid loops as much as possible. Aka, vectorize code.
For some tasks where looping is necessary (cannot vectorize code), consider conding in C/C++ or Fortran. It is convenient to incorporate compiled code into R.
To be versatile in dealing with big data, master at least on language in each category.
Don’t reinvent wheels. Make good use of libraries BLAS, LAPACK, Boost, Scipy, Numpy, …
Distinction between compiled language and interpreted language is getting blurred. The compiler package in R for JIT (just-in-time) compilation technology.
(reading assignment)
Checkout Google’s R style Guide, Style guide in Advanced R and the tidyverse style guide.
R can do any basic mathematical computations.
symbol | use |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ | power |
%% | modulus |
exp() | exponent |
log() | natural logarithm |
sqrt() | square root |
round() | rounding |
floor() | flooring |
ceiling() | ceiling |
You can create an R object to save results of a computation or other command.
Example 1
x <- 3 + 5
x
## [1] 8
Example 2
x < - 3 + 5
## [1] FALSE
x
## [1] 8
Example 3
sum.result <- x + 5
sum.result
## [1] 13
Example 4
sum(2:5)
## [1] 14
sum
## function (..., na.rm = FALSE) .Primitive("sum")
sum <- 3 + 4 + 5
sum(5:8)
## [1] 26
sum
## [1] 12
The commands “objects()” and “ls()” will provide a list of every object that you’ve created in a session.
objects()
## [1] "sum" "sum.result" "x"
ls()
## [1] "sum" "sum.result" "x"
The “rm()” and “remove()” commands let you delete objects (tip: always clearn-up your workspace as the first command)
rm(list=ls()) # clean up workspace
Many commands in R generate a vector of output, rather than a single number.
The “c()” command: creates a vector containing a list of specific elements.
Example 1
c(7, 3, 6, 0)
## [1] 7 3 6 0
c(73:60)
## [1] 73 72 71 70 69 68 67 66 65 64 63 62 61 60
c(7:3, 6:0)
## [1] 7 6 5 4 3 6 5 4 3 2 1 0
c(rep(7:3, 6), 0)
## [1] 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 0
Example 2 The command “seq()” creates a sequence of numbers.
seq(7)
## [1] 1 2 3 4 5 6 7
seq(3, 70, by = 6)
## [1] 3 9 15 21 27 33 39 45 51 57 63 69
seq(3, 70, length = 6)
## [1] 3.0 16.4 29.8 43.2 56.6 70.0
Use brackets to select element of a vector.
x <- 73:60
x[2]
## [1] 72
x[2:5]
## [1] 72 71 70 69
x[-(2:5)]
## [1] 73 68 67 66 65 64 63 62 61 60
Can access by “name” (safe with column/row order changes)
y <- 1:3
names(y) <- c("do", "re", "mi")
y[3]
## mi
## 3
y["mi"]
## mi
## 3
R commands on vectors
command | usage |
---|---|
sum() | sum over elements in vector |
mean() | compute average value |
sort() | sort elements in a vector |
min(), max() | min and max values of a vector |
length() | length of a vector |
summary() | returns the min, Q1, median, mean, Q3, and max values of a vector |
Exercise Write a command to generate a random permutation of the numbers between 1 and 5 and save it to an object.