This tutorial covers models and simulations in R, with a focus on bootstrap, permutation and simulation models.
A bootstrap model gets its name from the old 19th century expression to “lift yourself up by your own bootstraps”, as a metaphor for accomplishing a task without external forces. In the context of data analysis, the term bootstrap refers to the generation of ‘new’ data from existing data.
A very common use of a bootstrap model is to generate a null distribution from the data, to test whether the observed mean or variance is different than what we would expect by chance.
You may not realize it, but you have already taken the first step towards a bootstrap model in R by using the sample()
function.
Let’s look at the data from the Global Garlic Mustard Field Survey. The data are available here. Details on this dataset are covered in the previous tutorial introducting climate models: link:
GMdat<-read.csv("../EcologyTutorials/GGMFS_Teaching_Data.csv",header=T)
GMsubDat<-GMdat[,c("Longitude","Latitude","Region","Pop_Size")] # Subset to columns of interest
GMsubDat<-GMsubDat[complete.cases(GMsubDat),] # Remove missing data
Let’s calculate the Mean and 95% CI for Pop_Size
. In classic frequentist statistics, we might assume a normal distribution, and then calculate the Mean and Standard Error (SE). You may recall from intro stats that the mean of a random sample drawn from a Gaussian distribution will fall within ~1.96x the standard error of the mean, 95% of the time. This is a complicated way of saying that the parametric 95% CI of a random sample is Mean +/- 1.96*SE. You may also recall that the SE of a sample mean is Standard Deviation (SD) divided by the square root of the sample size: \[\frac{SD}{\sqrt{(N-1)}}\].
# Calculate mean, SE and CI from parametric model
X<-mean(GMsubDat$Pop_Size)
SE<-sd(GMsubDat$Pop_Size)/sqrt(nrow(GMsubDat)-1)
CI<-c(X-SE*1.96,X+SE*1.96)
# Round and print values
X<-round(X)
CI<-round(CI)
paste("Raw mean is:",X)
## [1] "Raw mean is: 4601"
paste("Raw 95% CI is:",CI[1],"to",CI[2])
## [1] "Raw 95% CI is: 561 to 8640"
This assumes a normal distribution Let’s take a look at the distribution of Pop_Size
, and then add the mean and 95% CI
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
qplot(Pop_Size,data=GMsubDat)+theme_bw()+geom_vline(aes(xintercept=X),colour="red")+geom_vline(aes(xintercept=CI[1]),colour="blue")+geom_vline(aes(xintercept=CI[2]),colour="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data highly right-skewed, ranging from near-zero to almost 1,000,000. They are clearly not normal, violating the assumption of the parametric estimate.
How can we reduce the skew and improve assumptions of normality in the data?
We can try to apply a transformation to make the data look more like a normal distribution:
# Calculate mean and 95% CI on transformed data
GMsubDat$log_Size<-log(GMsubDat$Pop_Size)
Xlog<-mean(GMsubDat$log_Size)
SElog<-sd(GMsubDat$log_Size)/sqrt(nrow(GMsubDat)-1)
CIlog<-c(Xlog-SElog*1.96,Xlog+SElog*1.96)
# Plot data + mean + 95% CI
qplot(log_Size,data=GMsubDat)+theme_bw()+geom_vline(aes(xintercept=Xlog),colour="red")+geom_vline(aes(xintercept=CIlog[1]),colour="blue")+geom_vline(aes(xintercept=CIlog[2]),colour="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Much better, but still a bit skewed. Let’s back-transform the mean and CI calculated on the log-transformed data, and compare with the original
# Original
cat("Raw mean is:",X,"\n",
"Raw 95% CI is:",CI[1],"to",CI[2],"\n")
## Raw mean is: 4601
## Raw 95% CI is: 561 to 8640
# Back-Transform from log-scale to original measurement scale
Xlog<-exp(Xlog)
CIlog<-exp(CIlog)
# Round
Xlog<-round(Xlog)
CIlog<-round(CIlog)
cat("Transformed mean is:",Xlog,"\n",
"Transformed 95% CI is:",CIlog[1],"to",CIlog[2],"\n")
## Transformed mean is: 124
## Transformed 95% CI is: 100 to 153
Compare the raw vs. log-transformed values. That’s more than an order of magnitude difference in the mean, and a much smaller 95% CI!
We can also use a bootstrap model estimate the mean and 95% CI directly from the data. The beauty of this approach is that it is not biased by the distribution of the data. To do this we:
We can do this fairly easily in R using sample()
in a for
loop and saving the output in a vector
Iters<-999 # Number of iterations == the number ot times the mean is resampled
BootOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
for (i in 1:Iters){
# Resample the data, calculate the mean, and save it in a vector; repeat
BootOut[i]<-mean(sample(GMsubDat$Pop_Size,nrow(GMsubDat),replace=T))
}
head(BootOut)
## [1] 2744.159 6655.851 5189.162 4463.876 5039.405 8465.112
Now calculate the mean and CI from the sorted data (steps 4-6). Note that the values are drawn directly from the data.
Xboot<-mean(BootOut)
CIlow<-sort(BootOut)[floor(length(BootOut)*0.025)] # Lower 2.5% CI
CIhigh<-sort(BootOut)[ceiling(length(BootOut)*0.975)] # Upper 2.5% CI
CIboot<-c(CIlow,CIhigh)
Plot for comparison.
# Plot
qplot(BootOut)+theme_bw()+geom_vline(aes(xintercept=Xboot),colour="red")+geom_vline(aes(xintercept=CIlow),colour="blue")+geom_vline(aes(xintercept=CIhigh),colour="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Why does the 95% CI seem so much wider on this plot than the previous two?
This is a plot of estimated population means from the bootstrap iterations, whereas the previous graphs plot the raw data. Therefore the x-axis scales are quite different.
Let’s look at all of them together:
# Transform bootstrap values
Xboot<-round(Xboot)
CIboot<-round(CIboot)
cat("Raw mean is:",X,"\n",
"Raw 95% CI is:",CI[1],"to",CI[2],"\n",
"Transformed mean is:",Xlog,"\n",
"Transformed 95% CI is:",CIlog[1],"to",CIlog[2],"\n",
"Bootstrap mean is:",Xboot,"\n",
"Bootstrap 95% CI is:",CIboot[1],"to",CIboot[2],"\n")
## Raw mean is: 4601
## Raw 95% CI is: 561 to 8640
## Transformed mean is: 124
## Transformed 95% CI is: 100 to 153
## Bootstrap mean is: 4563
## Bootstrap 95% CI is: 1579 to 9099
Which method is most accurate and why?
We can also use a bootstrap simulation as a non-parametric statistic. In the dataset on Alliaria petiolata (garlic mustard) from the Global Garlic Mustard Field Survey there is a factor called Region, which tells us which continent the population is on. Let’s test if the North American populations are larger than populations in Europe.
# Raw data
RawMod<-lm(Pop_Size~Region,data=GMsubDat)
(RawCoef<-summary(RawMod)$coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 453.8455 3168.477 0.1432377 0.88617623
## RegionNorthAm 7147.6219 4159.867 1.7182332 0.08654573
# Log-transformed
LogMod<-lm(log_Size~Region,data=GMsubDat)
(LogCoef<-summary(LogMod)$coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.016636 0.1600753 25.092168 1.768767e-83
## RegionNorthAm 1.382294 0.2101615 6.577295 1.538992e-10
Recall: The ‘RegionNorthAm’ parameter is the estimated difference in the mean of North American population size relative to Europe. The 2nd model is on a log-scale so it needs to be back-transformed.
# Calculate effect size and parametric 95%CI, and round
RawES<-round(RawCoef[2])
RawCI<-round(c(RawCoef[2]-1.96*RawCoef[4],RawCoef[2]+1.96*RawCoef[4]))
# Calculate for log-transformed
LogES<-exp(LogCoef[2])
LogES<-round(LogES)
LogCI<-exp(c(LogCoef[2]-1.96*LogCoef[4],LogCoef[2]+1.96*LogCoef[4]))
LogCI<-round(LogCI)
cat("The effect size based on raw data is:",RawES,"\n",
"with parametric 95% CI of",RawCI[1],"to",RawCI[2],"\n",
"The effect size based on transformed data is:","\n",
LogES,"with parametric 95% CI of",LogCI[1],"to",LogCI[2],"\n")
## The effect size based on raw data is: 7148
## with parametric 95% CI of -1006 to 15301
## The effect size based on transformed data is:
## 4 with parametric 95% CI of 3 to 6
What does the range of CI values tell you about the hypothesis that introduced populations (North America) are larger than populations in the native range (Europe)?
We sample, with replacement, just as we did with the above bootstrap model. However, this time we sample separately for each region and calculate the difference in the mean each time
Iters<-999 # Number of iterations == the number ot times the mean is resampled
BootESOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
# Make separate EU/NA groups for resampling
EUpops<-GMsubDat$Pop_Size[grep("Europe",GMsubDat$Region)]
NApops<-GMsubDat$Pop_Size[grep("NorthAm",GMsubDat$Region)]
for (i in 1:Iters){
# Resample the data, calculate the mean, and save it in a vector; repeat
BootESOut[i]<-
mean(sample(NApops,length(NApops),replace=T)) -
mean(sample(EUpops,length(EUpops),replace=T))
}
head(BootESOut)
## [1] 4449.814 3912.270 6682.209 3375.779 11315.238 7300.743
Look at the frequency histogram.
qplot(BootESOut)+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Compare the results
# Calculate mean & 95% CI
XbootES<-round(mean(BootESOut))
CIlowES<-sort(BootESOut)[floor(length(BootESOut)*0.025)] # Lower 2.5% CI
CIhighES<-sort(BootESOut)[ceiling(length(BootESOut)*0.975)] # Upper 2.5% CI
CIbootES<-round(c(CIlow,CIhigh))
# Compare results
cat("The effect size based on raw data is:",RawES,"\n",
"with parametric 95% CI of",RawCI[1],"to",RawCI[2],"\n",
"The effect size based on transformed data is:",LogES,"\n",
"with parametric 95% CI of",LogCI[1],"to",LogCI[2],"\n",
"The effect size of the bootstrap model is:",XbootES,"\n",
"with 95% CI of",CIbootES[1],"to",RawCI[2],"\n")
## The effect size based on raw data is: 7148
## with parametric 95% CI of -1006 to 15301
## The effect size based on transformed data is: 4
## with parametric 95% CI of 3 to 6
## The effect size of the bootstrap model is: 7174
## with 95% CI of 1579 to 15301
What do the x- and y-axis represent, and why does the distribution have this shape?
In the above model, we know our effect size is significant – i.e. North American populations are significantly bigger than native populations. But, there are two problems:
Here’s where a null model comes in handy. We first define and simulate a null model, then compare our observed value(s) to the null model.
Defining the null model is a little more tricky – you have to think in terms of the null hypothesis. In our example, the null hypothesis is that there is no difference in the size of European and North American populations. To generate a null distribution:
Iters<-999 # Number of iterations == the number ot times the mean is resampled
PermOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
SimDat<-GMsubDat # Make a new dataset for resampling
for (i in 1:Iters){
# 1. Reshuffle Region labels (sample, without replacement)
SimDat$Region<-sample(SimDat$Region,replace=F)
# 2. Calculate effect size and save in output vector
PermOut[i]<-
mean(SimDat$Pop_Size[grep("NorthAm",SimDat$Region)]) -
mean(SimDat$Pop_Size[grep("Europe",SimDat$Region)])
}
head(PermOut)
## [1] 4547.470 -4492.450 1739.325 1769.074 -3308.059 -2086.433
# Calculate mean difference of raw data
(RawX<-aggregate(GMsubDat$Pop_Size,list(GMsubDat$Region),mean))
RawDiff<-RawX$x[2]-RawX$x[1]
qplot(PermOut)+theme_bw()+geom_vline(aes(xintercept=RawDiff),colour="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The p-value is the probability of obtaining the observed value by chance. In this case the observed value is the observed difference in the mean RawDiff
, above. The null model simulates ‘observed values obtained by chance’. So the estimated p-value is the proportion of values less than the observed value for a 1-tailed test.
p<-sum(PermOut>RawDiff)/length(PermOut)
paste("The estimated p-value is:",p)
## [1] "The estimated p-value is: 0"
If the observed value is lower than all of the simulated values, then the best we can say is: \[p < \frac{1}{N_{iters}}\]
where Niters
is the total number of iterations.
If we did not have an a priori prediction about the direction of the effect, then it is a 2-tailed test. This is easy to do. We just divide alpha by two, or multiply the p-value by 2.
In our case, our null hypothesis was no difference between European vs. North American populations (rather than testing whether North American populations are bigger).
So p < 0.01 for 1001 iterations or p < 0.001 if there is no overlap in 2001 iterations.
It’s important to think carefully about your bootstrap and null models and the implicit assumptions you make when you run them. For example, we assume that each observation has the same probability of being chosen in each iteration. This wouldn’t be true if there was some sort of inherent correlation among data points. For example, the data we are working with were sampled non-randomly.
qplot(Longitude,Latitude,data=GMdat,alpha=I(0.3))+theme_bw()
We may also have two or more factors in a model that are correlated. For example, multiple traits measured on the same individuals or the same traits measured on the same individuals but at different time points. In this kind of ‘repeated measures’ experiments, should run the bootstrap and null model tests at the level of individual, not observation. That is, reshuffle ID tags for a null model, leaving the rest of the row/column data untouched.
Once you understand the basic coding for bootraps and permutation tests, you can look into available packages to reduce the amount of code you need to write and have them run faster.
The lmPerm
package is the permutation equivalent of lm()
The boot
package has tools for writing bootstrap models
The foreach
and doParallel
packages allow you to use multithreading in your functions and for-loops.