After working through this handout, you should:
know what bootstrapping is, and why it is useful
be able to conduct simple permutation tests and bootstrapping in R *
Bootstrapping is a computational procedure that uses random sampling with replacement to assess confidence and uncertainty in a sample estimate. The basic idea of bootstrapping is that inference about a parameter from sample data can be modeled by resampling the sample data and performing inference from the resampled data. Bootstrapping is popular because of its simplicity and flexibility. Let’s look at two examples for applying bootstrapping. In the first example, let’s use bootstrapping to measure our uncertainty in our estimate of a population mean.
#First simulate data, here we 'measure' the height of 25
#Individuals (e.g., rabbits) sampled randomly from a population and
#Assume height is normally distributed with mean = 2.5 and SD = 0.5
hgt <- rnorm(25, mean = 2.5, sd = 0.5)
#Sample mean provides a point estimate of the mean
est <- mean(hgt)
est
## [1] 2.491418
#We can compute the standard error and 95% confidence interval this
#assumes data come from a normal distribution which happens to be true
#SE = SD/sqrt(N)
se <- sd(hgt)/sqrt(length(hgt))
se
## [1] 0.1147851
#Mean +- SE
est + se
## [1] 2.606204
est - se
## [1] 2.376633
# 95% CI = est +- 1.96 * SE
est + 1.96 * se
## [1] 2.716397
est - 1.96 * se
## [1] 2.26644
#Now let's do the same thing with bootstrapping first, we will
#generate 1000 bootstrap replicate data sets and corresponding
#means by resampling our sample
est_bs <- rep(NA, 1000)
for (i in 1:1000) {
hgt_bs <- sample(hgt, 25, replace = TRUE)
est_bs[i] <- mean(hgt_bs)
}
#The SD of means is an estimate of the SE
se_bs <- sd(est_bs)
#Mean +- SE
est + se_bs
## [1] 2.607214
est - se_bs
## [1] 2.375623
#The 2.5th and 97.5th percentiles of the sample mean distribution
#define the bootstrap 95% CI we will use a new function, quantile, for this
quantile(est_bs, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 2.272000 2.723979
Notice that in this case bootstrapping gave results quite similar to those based on parametric assumptions. This will not always be the case and bootstrapping can be used in cases where parametric assumptions do not hold.
Let’s look at another example. Here, we will use a data set available
from R. The details are not important for us, just note that the data
comprise CO2 uptake rates for a grass species along with ambient CO2
levels. We want to know how strongly uptake is correlated with ambient
CO2. We can use the function cor to compute the correlation. The
function cor.test computes the correlation, 95% CIs on the
correlation and a P-value for the null hypothesis that the correlation
is not equal to zero. We will perform this test and compare it to
bootstrap estimates of the same quantities.
#First load the data
data(CO2)
#Here are the first few rows of data
head(CO2)
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
## standard, parametric correlation test
cor.test(CO2$conc, CO2$uptake) #,alternative='greater')
##
## Pearson's product-moment correlation
##
## data: CO2$conc and CO2$uptake
## t = 5.0245, df = 82, p-value = 2.906e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3022189 0.6336595
## sample estimates:
## cor
## 0.4851774
#Now let's get bootstrap estimates
est <- cor(CO2$conc, CO2$uptake)
N <- dim(CO2)[1] #Number of rows, sample size
#1000 bootstrap estimates of the correlation
est_bs <- rep(NA, 1000)
for (i in 1:1000) {
index_bs <- sample(1:N, N, replace = TRUE)
est_bs[i] <- cor(CO2$conc[index_bs], CO2$uptake[index_bs])
}
#Bootstrap CIs and P-value analogue, note 2x for two-sided
est
## [1] 0.4851774
quantile(est_bs, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 0.3040534 0.6431026
2 * min(mean(est_bs < 0), mean(est_bs > 0))
## [1] 0
Permutation tests are computational procedures for null hypothesis testing. This is done by randomizing or permuting observations among treatments or covariates. Unlike bootstrapping, this procedure generally involves some form of sampling without replacement, that is all observations are used and only used once. In many cases, permutation tests are valid alternatives to regression models or t-tests. They share the same advantages with the bootstrapping, namely they do not rely on parametric (distribution-based) assumptions, and thus are quite flexible. Moreover, details of permutations can be tuned to incorporate structure in the data that might not be readily dealt with using standard, parametric statistical procedures. For now, let’s work a simple example where we use a permutation procedure to determine whether the difference between the means for two treatments is greater than expected by chance, that is greater than expected under a null hypothesis of no true effect.
#First load the data
data(ToothGrowth)
#This data set has tooth length (growth) in 60 guinea pigs that
#were given vitamin C doses as orange juice or ascorbic acid
#(VC)
mnOJ <- mean(ToothGrowth$len[ToothGrowth$supp == "OJ"])
mnVC <- mean(ToothGrowth$len[ToothGrowth$supp == "VC"])
adif <- abs(mnOJ - mnVC) #Absolute different as test statistic
#Now randomize treatment 1000 times and compute distribution of
#differences
adif_perm <- rep(NA, 1000)
N <- length(ToothGrowth$supp)
for (i in 1:1000) {
ranSupp <- sample(ToothGrowth$supp, N, replace = FALSE)
mnOJ <- mean(ToothGrowth$len[ranSupp == "OJ"])
mnVC <- mean(ToothGrowth$len[ranSupp == "VC"])
adif_perm[i] <- abs(mnOJ - mnVC)
}
#compute p-value
mean(adif_perm >= adif)
## [1] 0.056
#Visualize null and observed
hist(adif_perm, main = "Null distribution", xlab = "Absolute difference",
cex.lab = 1.2)
abline(v = adif, lwd = 2, col = "darkred")