reference R4DS
R uses formulas for most modelling functions.
A simple example: y ~ x
translates to: \(y = \beta_0 + x\beta_1\).
Below is a summary for some popular uses
Use | Example | Translation |
---|---|---|
Simple example | y ~ x |
\(y = \beta_0 + x\beta_1\) |
Exclude intercept | y ~ x - 1 or y ~ x + 0 |
\(y = x\beta\) |
Add a continuous variable | y ~ x1 + x2 |
\(y = \beta_0 + x_1 \beta_1 + x_2 \beta_2\) |
Add a categorical variable | y ~ x1 + sex |
\(y = \beta_0 + x_1 \beta_1 + \mbox{sex_male} \beta_2\) |
Interactions (continuous and categorical) | y ~ x1 * sex |
\(y = \beta_0 + x_1 \beta1 + \mbox{sex_male} \beta_2 + x_1 \mbox{sex_male} \beta_{12}\) |
Interactions (continuous and continuous) | y ~ x1 * x_2 |
\(y = \beta_0 + x_1 \beta1 + x_2 \beta_2 + x_1 x_2 \beta_{12}\) |
You can specify transformations inside the model formula too (besides creating new transformed variables in your data frame).
For example, log(y) ~ sqrt(x1) + x2
is transformed to \(\log(y) = \beta_0 + \sqrt{x_1} \beta_1 + x_2 \beta_2\).
One thing to note is that if your transformation involves +
, *
, ^
or -
, you will need to wrap it in I()
so R doesn’t treat it like part of the model specification.
For example, y ~ x + I(x ^ 2)
is translated to \(y = \beta_0 + x \beta_1 + x^2 \beta_2\).
And remember, R automatically drops redundant variables so x + x
become x
.
y ~ x ^ 2 + x
y ~ x + x ^ 2 + x ^ 3
y ~ poly(x, 3)
y ~ (x1 + x2 + x3) ^2 + (x2 + x3) * (x4 + x5)
In the class, we used stepwise regression. Now we visit the other two sequential methods:
Forward selection:
Determine a shreshold significance level for entering predictors into the model
Begin with a model without any predictors, i.e., \(y = \beta_0 + \epsilon\)
For each predictor not in the model, fit a model that adds that predictor to the working model. Record the t-statistic and p-value for the added predictor.
Do any of the predictors considered in step 3 have a p-value less than the shreshold specified in step 1?
Yes: Add the predictor with the smallest p-value to the working model and return to step 3
No: Stop. The working model is the final model.
Backwards elimination
Determine a threshold significance level for removing predictors from the model.
Begin with a model that includes all predictors.
Examine the t-statistic and p-value of the partial regression coefficients associated with each predictor.
Do any of the predictors have a p-value greater than the threshold specified in step 1?
Yes: Remove the predictor with the largest p-value from the working model and return to step 3.
No: Stop. The working model is the final model.
Stepwise selection
Combines forward selection and backwards elimination steps.
Now perform Model selection on the cheddar dataset from the faraway package.
Use taste
as the response variable.
taste ~ 1
library(faraway)
(small.lm <- lm(taste ~ 1, data = cheddar))
##
## Call:
## lm(formula = taste ~ 1, data = cheddar)
##
## Coefficients:
## (Intercept)
## 24.53
finish the code below
# start from the large model that include all
biglm <- lm(taste ~ (Acetic + H2S + Lactic) ^ 2, data = cheddar)
Acetic = 4.5
, H2S = 3.8
and Lactic = 1.00
?What is the prediction confidence interval? What causes the difference between them?