R Basics (cont.)

Announcement
Course Project
Consequences of computer storage / arithmetic
Programming Languages
R basics

Announcement

Introduce yourself (the charm of zoom)
Wednesday class will be online too. Stay safe and dry!
Course GitHub organization invitation
Lab sessions
- There will be recordings (when needed) in the future.
- No need to submit the lab “work”.
- There will be “solutions” posted (after the following Monday lecture) for future lab sessions (when there are questions).
Using R is a course objective
- try to use R as much as possible for lab sessions and homework assignments
- free to use any language for course project
Project and homework submission via GitHub
Homework assignment starts week 2. 1st assignment due on week 4. Expected frequency: bi-weekly.
Will provide optional reading material on Course Webpage. (No poll for python, it will be provided through this mechanism)
Github page contains the most up-to-date materials.
Regret on sending multiple locations for posting GitHub ids.

Course Project

Find a dataset of interest to you.
Turn in a brief one-page description by the end of week 3. (3/30)
Submit a mid-term report (2 - 4 pages, no more than 4 please) by the end of week 12. (7/30)
Present your work to your peers week 15 and 16. (10/30)
Submit a final report (4 - 8 pages, no more than 8 please) by the end of the semester by December 5.
Submit code to your own private GitHub repository on the course GitHub organization by December 5. (Report + Code, 10/30)
(Optional) make a GitHub page for your project.

Project ideas/Dataset resources

Amazon data http://jmcauley.ucsd.edu/data/amazon/, https://nijianmo.github.io/amazon/index.html, https://cseweb.ucsd.edu/~jmcauley/datasets.html
Netflix challenge
Sports/eSports prediction
Hurricane prediction!
1000 human genome project
Reproduce findings of a paper in your field (could be hard).
Google “data science projects” to get more ideas

Brief Description components

Introduce the dataset (data type, origin, etc). Explain why you choose the dataset. List some questions you want to explore with the dataset.

Mid-term report components

Include the brief description with modifications if needed
Give an abstract on your plan
- What analyses you want to perform for answering your questions
Current progress and future plan

Final report components

Introduce the dataset. Explain why you choose it. Explain what questions you want to ask and explore using the dataset.
Analysis. Explain the statistical methods that you use for analyzing the dataset. Explain what you have done to generate the results (make your analysis reproducible).
Results. Illustrate your results. Use figures and tables to imiprove readability.
Discussions. This is the place to put in almost whatever you want to share. Some difficulties you met in the analysis, what you learned from the analysis, some future directions.

Consequences of computer storage / arithmetic

Be memory conscious when dealing with big data. E.g., human genome has about \(3 \times 10^9\) bases, each of which belongs to {A, C, G, T}. How much storage if we store \(10^6\) SNPs (single nucleotide polymorphisms) of \(1000\) individuals as single (4GB), double (8GB), int32 (4GB), int16 (2GB), int8 (1GB), PLINK library format 2bit/SNP (250MB)?
Know the limit. Overflow and Underflow. For double precision, \(\pm 10 ^{\pm 308}\). In most situations, underflow is “preferred” over overflow. Overflow often causes crashes. Underflow yields zeros (which however could lead to \(0 / 0\) situations).
- Example 1, in logistic regression, \(p_i = \frac{\exp{(x_i^T \beta})}{1 + \exp{(x_i^T \beta})} = \frac{1}{1 + \exp{(-x_i^T \beta})}\). The former expression can easily lead to \(\infty / \infty = NaN\), while the latter expression leads to graceful underflow.
- Example 2, calculation of the probability of large amount of iid (independent and identically distributed) random variables (r.v.). Consider operation in log-space.

Programming Languages

Compiled versus interpreted languages.

Compiled languages: C/C++, Fortran, … directly compiled to machine code that is executed by CPU. Advantage: fast, take less memory. Disadvantage: relatively longer development time, hard to debug.
Interpreted language: R, Matlab, SAS IML, … Interpreted by interpreter. Advantage: fast for prototyping. Disadvantage: excruciatingly slow for loops.
Mixed (compiled and then interpreted by virtual machine): Python, JAVA. Advantage: extremely convenient for data preprocessing and manipulation; relatively short development time. Disadvantage: not as fast as compiled language.
Scripting: Unix/Linux scripts, Perl, Python. Extremely useful for data preprocessing and manipulation.
Database language: SQL, Hadoop. Data analysis never happens if we do not know how to retrieve data from databases.

More about computer languages

To improve efficiency of interpreted languages such as R code, avoid loops as much as possible. Aka, vectorize code.
For some tasks where looping is necessary (cannot vectorize code), consider conding in C/C++ or Fortran. It is convenient to incorporate compiled code into R.
To be versatile in dealing with big data, master at least on language in each category.
Don’t reinvent wheels. Make good use of libraries BLAS, LAPACK, Boost, Scipy, Numpy, …
Distinction between compiled language and interpreted language is getting blurred. The compiler package in R for JIT (just-in-time) compilation technology.

R basics

styles

(reading assignment)

Checkout Google’s R style Guide, Style guide in Advanced R and the tidyverse style guide.

Arithmetic

R can do any basic mathematical computations.

symbol	use
+	addition
-	subtraction
*	multiplication
/	division
^	power
%%	modulus
exp()	exponent
log()	natural logarithm
sqrt()	square root
round()	rounding
floor()	flooring
ceiling()	ceiling

Objects

You can create an R object to save results of a computation or other command.

Example 1

x <- 3 + 5
x

## [1] 8

In most languages, the direction of passing through the value into the object goes from right to left (e.g. with “=”). However, R allows both directions (which is actually bad!). In this course, we encourage the use of “<-” or “=”. There are people liking “=” over “<-” for the reason that “<-” sometimes break into two operators “< -”.

Example 2

x < - 3 + 5

## [1] FALSE

## [1] 8

For naming conventions, stick with either “.” or "_" (refer to the style guide).

Example 3

sum.result <- x + 5
sum.result

## [1] 13

important: many names are already taken for built-in R functions. Make sure that you don’t override them.

Example 4

sum(2:5)

## [1] 14

sum

## function (..., na.rm = FALSE)  .Primitive("sum")

sum <- 3 + 4 + 5
sum(5:8)

## [1] 26

sum

## [1] 12

R is case-sensitive. “Math.7360” is different from “math.7360”.

Locating and deleting objects:

The commands “objects()” and “ls()” will provide a list of every object that you’ve created in a session.

objects()

## [1] "sum"        "sum.result" "x"

ls()

## [1] "sum"        "sum.result" "x"

The “rm()” and “remove()” commands let you delete objects (tip: always clearn-up your workspace as the first command)

rm(list=ls())  # clean up workspace

Vectors

Many commands in R generate a vector of output, rather than a single number.

The “c()” command: creates a vector containing a list of specific elements.

Example 1

c(7, 3, 6, 0)

## [1] 7 3 6 0

c(73:60)

##  [1] 73 72 71 70 69 68 67 66 65 64 63 62 61 60

c(7:3, 6:0)

##  [1] 7 6 5 4 3 6 5 4 3 2 1 0

c(rep(7:3, 6), 0)

##  [1] 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 0

Example 2 The command “seq()” creates a sequence of numbers.

seq(7)

## [1] 1 2 3 4 5 6 7

seq(3, 70, by = 6)

##  [1]  3  9 15 21 27 33 39 45 51 57 63 69

seq(3, 70, length = 6)

## [1]  3.0 16.4 29.8 43.2 56.6 70.0

Operations on vectors

Use brackets to select element of a vector.

x <- 73:60
x[2]

## [1] 72

x[2:5]

## [1] 72 71 70 69

x[-(2:5)]

##  [1] 73 68 67 66 65 64 63 62 61 60

Can access by “name” (safe with column/row order changes)

y <- 1:3
names(y) <- c("do", "re", "mi")
y[3]

## mi 
##  3

y["mi"]

## mi 
##  3

R commands on vectors

command	usage
sum()	sum over elements in vector
mean()	compute average value
sort()	sort elements in a vector
min(), max()	min and max values of a vector
length()	length of a vector
summary()	returns the min, Q1, median, mean, Q3, and max values of a vector

Exercise Write a command to generate a random permutation of the numbers between 1 and 5 and save it to an object.