Goals

After working through this handout, you should:

Fitting linear models in R

Linear models describe the relationship between one or more independent variables (covariates) and a dependent (response) variable. As we discussed, several methods exist for fitting linear models and these models have many extensions beyond simple linear regression. Still, we will begin with simple linear regression with continuous covariates and fit the model by minimizing the sum-of-squares deviation. We will use the lm function for this, which uses a new formula format.

#Let's use one of the built-in R data sets. The data give the speed 
#of cars and the distances taken to stop.
data(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
#First plot dist as a function of speed
plot(cars$speed, cars$dist, pch = 19, xlab = "Speed", ylab = "Distance")

#Looks like the relationship is approx. linear
#Let's fit a linear model, note the formula syntax
lfit <- lm(cars$dist ~ cars$speed)

#Summarize the fit
summary(lfit)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
#You can access components of lfit with $ notation
lfit$coefficients
## (Intercept)  cars$speed 
##  -17.579095    3.932409
lfit$residuals
##          1          2          3          4          5          6          7 
##   3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584  -3.744993 
##          8          9         10         11         12         13         14 
##   4.255007  12.255007  -8.677401   2.322599 -15.609810  -9.609810  -5.609810 
##         15         16         17         18         19         20         21 
##  -1.609810  -7.542219   0.457781   0.457781  12.457781 -11.474628  -1.474628 
##         22         23         24         25         26         27         28 
##  22.525372  42.525372 -21.407036 -15.407036  12.592964 -13.339445  -5.339445 
##         29         30         31         32         33         34         35 
## -17.271854  -9.271854   0.728146 -11.204263   2.795737  22.795737  30.795737 
##         36         37         38         39         40         41         42 
## -21.136672 -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
##         43         44         45         46         47         48         49 
##   2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285  43.201285 
##         50 
##   4.268876
#Add best fit line 
plot(cars$speed, cars$dist, pch = 19, xlab = "Speed", ylab = "Distance")
abline(lfit$coefficients)

#Another way to fit the same model, just different syntax
lfit <- lm(dist ~ speed, data = cars)
summary(lfit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
#Units for the regression coefficients are same as covariates 
#can standardize to put in units of SDs
sspeed <- (cars$speed - mean(cars$speed))/sd(cars$speed)
lfit <- lm(cars$dist ~ sspeed)
summary(lfit)
## 
## Call:
## lm(formula = cars$dist ~ sspeed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.980      2.175  19.761  < 2e-16 ***
## sspeed        20.793      2.197   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Let’s look at another example, this time with more than one covariate. We will also allow for an interaction, that is for the effect of one covariate to depend on another covariate.

data(trees)
head(trees)
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7
#First fit model of volume ~ girth and height
lfit <- lm(Volume ~ Girth + Height, data = trees)
summary(lfit)
## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
#Add interaction
lfit <- lm(Volume ~ Girth + Height + Girth:Height, data = trees)
summary(lfit)
## 
## Call:
## lm(formula = Volume ~ Girth + Height + Girth:Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16
#This is a shortcut to include main effects and interaction
lfit <- lm(Volume ~ Girth * Height, data = trees)
summary(lfit)
## 
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16
#Let's see how standardizing coefficients changes things with an interaction
sG <- (trees$Girth - mean(trees$Girth))/sd(trees$Girth)
sH <- (trees$Height - mean(trees$Height))/sd(trees$Height)
#Compare coefficients, p-values and r2
lfit <- lm(trees$Volume ~ sG * sH)
summary(lfit)
## 
## Call:
## lm(formula = trees$Volume ~ sG * sH)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.8179     0.5447  52.910  < 2e-16 ***
## sG           13.7384     0.6083  22.585  < 2e-16 ***
## sH            3.1022     0.6032   5.143 2.07e-05 ***
## sG:sH         2.6925     0.4874   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

Lastly, let’s look at an example with continuous and discrete independent variables, which is analogous to an ANOVA or ANCOVA. We will also examine alternative summaries of the model fit here. Back to the teeth growth example for this one.

data(ToothGrowth)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
#Standard dose
sDose <- (ToothGrowth$dose - mean(ToothGrowth$dose))/sd(ToothGrowth$dose)
#Fit without then with an interaction
lfit <- lm(ToothGrowth$len ~ ToothGrowth$supp + sDose)
summary(lfit)
## 
## Call:
## lm(formula = ToothGrowth$len ~ ToothGrowth$supp + sDose)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         20.6633     0.7733  26.721  < 2e-16 ***
## ToothGrowth$suppVC  -3.7000     1.0936  -3.383   0.0013 ** 
## sDose                6.1400     0.5514  11.135 6.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16
anova(lfit)
## Analysis of Variance Table
## 
## Response: ToothGrowth$len
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## ToothGrowth$supp  1  205.35  205.35  11.447  0.001301 ** 
## sDose             1 2224.30 2224.30 123.989 6.314e-16 ***
## Residuals        57 1022.56   17.94                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lfit <- lm(ToothGrowth$len ~ ToothGrowth$supp * sDose)
summary(lfit)
## 
## Call:
## lm(formula = ToothGrowth$len ~ ToothGrowth$supp * sDose)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2264 -2.8462  0.0504  2.2893  7.9386 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               20.6633     0.7455  27.718  < 2e-16 ***
## ToothGrowth$suppVC        -3.7000     1.0543  -3.510 0.000894 ***
## sDose                      4.9124     0.7518   6.534 2.03e-08 ***
## ToothGrowth$suppVC:sDose   2.4553     1.0632   2.309 0.024631 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared:  0.7296, Adjusted R-squared:  0.7151 
## F-statistic: 50.36 on 3 and 56 DF,  p-value: 6.521e-16
anova(lfit)
## Analysis of Variance Table
## 
## Response: ToothGrowth$len
##                        Df  Sum Sq Mean Sq  F value    Pr(>F)    
## ToothGrowth$supp        1  205.35  205.35  12.3170 0.0008936 ***
## sDose                   1 2224.30 2224.30 133.4151 < 2.2e-16 ***
## ToothGrowth$supp:sDose  1   88.92   88.92   5.3335 0.0246314 *  
## Residuals              56  933.63   16.67                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Let's visualize the results
lfit$coefficients
##              (Intercept)       ToothGrowth$suppVC                    sDose 
##                20.663333                -3.700000                 4.912390 
## ToothGrowth$suppVC:sDose 
##                 2.455297
plot(sDose, ToothGrowth$len, pch = 19, col = c("orange", "cadetblue")[as.numeric(ToothGrowth$sup)],
  xlab = "Stand. dose", ylab = "Length")
abline(a = lfit$coefficients[1], b = lfit$coefficients[3], col = "orange",
  lwd = 2)
abline(a = lfit$coefficients[1] + lfit$coefficients[2], b = lfit$coefficients[3] +
  lfit$coefficients[4], col = "cadetblue", lwd = 2)

legend(-1.1, 32, c("OJ", "VC"), fill = c("orange", "cadetblue"))

\[ \] \[ \] \[ \]