After working through this handout, you should:
Be able to fit linear models in R
Be able to summarize and interpret linear model fits
Linear models describe the relationship between one or more
independent variables (covariates) and a dependent (response) variable.
As we discussed, several methods exist for fitting linear models and
these models have many extensions beyond simple linear regression.
Still, we will begin with simple linear regression with continuous
covariates and fit the model by minimizing the sum-of-squares deviation.
We will use the lm
function for this, which uses a new
formula format.
#Let's use one of the built-in R data sets. The data give the speed
#of cars and the distances taken to stop.
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
#First plot dist as a function of speed
plot(cars$speed, cars$dist, pch = 19, xlab = "Speed", ylab = "Distance")
#Looks like the relationship is approx. linear
#Let's fit a linear model, note the formula syntax
lfit <- lm(cars$dist ~ cars$speed)
#Summarize the fit
summary(lfit)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
#You can access components of lfit with $ notation
lfit$coefficients
## (Intercept) cars$speed
## -17.579095 3.932409
lfit$residuals
## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
## 8 9 10 11 12 13 14
## 4.255007 12.255007 -8.677401 2.322599 -15.609810 -9.609810 -5.609810
## 15 16 17 18 19 20 21
## -1.609810 -7.542219 0.457781 0.457781 12.457781 -11.474628 -1.474628
## 22 23 24 25 26 27 28
## 22.525372 42.525372 -21.407036 -15.407036 12.592964 -13.339445 -5.339445
## 29 30 31 32 33 34 35
## -17.271854 -9.271854 0.728146 -11.204263 2.795737 22.795737 30.795737
## 36 37 38 39 40 41 42
## -21.136672 -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
## 43 44 45 46 47 48 49
## 2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285 43.201285
## 50
## 4.268876
#Add best fit line
plot(cars$speed, cars$dist, pch = 19, xlab = "Speed", ylab = "Distance")
abline(lfit$coefficients)
#Another way to fit the same model, just different syntax
lfit <- lm(dist ~ speed, data = cars)
summary(lfit)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
#Units for the regression coefficients are same as covariates
#can standardize to put in units of SDs
sspeed <- (cars$speed - mean(cars$speed))/sd(cars$speed)
lfit <- lm(cars$dist ~ sspeed)
summary(lfit)
##
## Call:
## lm(formula = cars$dist ~ sspeed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.980 2.175 19.761 < 2e-16 ***
## sspeed 20.793 2.197 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Let’s look at another example, this time with more than one covariate. We will also allow for an interaction, that is for the effect of one covariate to depend on another covariate.
data(trees)
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
#First fit model of volume ~ girth and height
lfit <- lm(Volume ~ Girth + Height, data = trees)
summary(lfit)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
#Add interaction
lfit <- lm(Volume ~ Girth + Height + Girth:Height, data = trees)
summary(lfit)
##
## Call:
## lm(formula = Volume ~ Girth + Height + Girth:Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
#This is a shortcut to include main effects and interaction
lfit <- lm(Volume ~ Girth * Height, data = trees)
summary(lfit)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
#Let's see how standardizing coefficients changes things with an interaction
sG <- (trees$Girth - mean(trees$Girth))/sd(trees$Girth)
sH <- (trees$Height - mean(trees$Height))/sd(trees$Height)
#Compare coefficients, p-values and r2
lfit <- lm(trees$Volume ~ sG * sH)
summary(lfit)
##
## Call:
## lm(formula = trees$Volume ~ sG * sH)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.8179 0.5447 52.910 < 2e-16 ***
## sG 13.7384 0.6083 22.585 < 2e-16 ***
## sH 3.1022 0.6032 5.143 2.07e-05 ***
## sG:sH 2.6925 0.4874 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
Lastly, let’s look at an example with continuous and discrete independent variables, which is analogous to an ANOVA or ANCOVA. We will also examine alternative summaries of the model fit here. Back to the teeth growth example for this one.
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
#Standard dose
sDose <- (ToothGrowth$dose - mean(ToothGrowth$dose))/sd(ToothGrowth$dose)
#Fit without then with an interaction
lfit <- lm(ToothGrowth$len ~ ToothGrowth$supp + sDose)
summary(lfit)
##
## Call:
## lm(formula = ToothGrowth$len ~ ToothGrowth$supp + sDose)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.6633 0.7733 26.721 < 2e-16 ***
## ToothGrowth$suppVC -3.7000 1.0936 -3.383 0.0013 **
## sDose 6.1400 0.5514 11.135 6.31e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
anova(lfit)
## Analysis of Variance Table
##
## Response: ToothGrowth$len
## Df Sum Sq Mean Sq F value Pr(>F)
## ToothGrowth$supp 1 205.35 205.35 11.447 0.001301 **
## sDose 1 2224.30 2224.30 123.989 6.314e-16 ***
## Residuals 57 1022.56 17.94
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lfit <- lm(ToothGrowth$len ~ ToothGrowth$supp * sDose)
summary(lfit)
##
## Call:
## lm(formula = ToothGrowth$len ~ ToothGrowth$supp * sDose)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2264 -2.8462 0.0504 2.2893 7.9386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.6633 0.7455 27.718 < 2e-16 ***
## ToothGrowth$suppVC -3.7000 1.0543 -3.510 0.000894 ***
## sDose 4.9124 0.7518 6.534 2.03e-08 ***
## ToothGrowth$suppVC:sDose 2.4553 1.0632 2.309 0.024631 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared: 0.7296, Adjusted R-squared: 0.7151
## F-statistic: 50.36 on 3 and 56 DF, p-value: 6.521e-16
anova(lfit)
## Analysis of Variance Table
##
## Response: ToothGrowth$len
## Df Sum Sq Mean Sq F value Pr(>F)
## ToothGrowth$supp 1 205.35 205.35 12.3170 0.0008936 ***
## sDose 1 2224.30 2224.30 133.4151 < 2.2e-16 ***
## ToothGrowth$supp:sDose 1 88.92 88.92 5.3335 0.0246314 *
## Residuals 56 933.63 16.67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Let's visualize the results
lfit$coefficients
## (Intercept) ToothGrowth$suppVC sDose
## 20.663333 -3.700000 4.912390
## ToothGrowth$suppVC:sDose
## 2.455297
plot(sDose, ToothGrowth$len, pch = 19, col = c("orange", "cadetblue")[as.numeric(ToothGrowth$sup)],
xlab = "Stand. dose", ylab = "Length")
abline(a = lfit$coefficients[1], b = lfit$coefficients[3], col = "orange",
lwd = 2)
abline(a = lfit$coefficients[1] + lfit$coefficients[2], b = lfit$coefficients[3] +
lfit$coefficients[4], col = "cadetblue", lwd = 2)
legend(-1.1, 32, c("OJ", "VC"), fill = c("orange", "cadetblue"))
\[ \] \[ \] \[ \]