Introduction

This tutorial covers models and simulations in R, with a focus on bootstrap, permutation and simulation models.

1. Bootstrap Basics

A bootstrap model gets its name from the old 19th century expression to “lift yourself up by your own bootstraps”, as a metaphor for accomplishing a task without external forces. In the context of data analysis, the term bootstrap refers to the generation of ‘new’ data from existing data.

A very common use of a bootstrap model is to generate a null distribution from the data, to test whether the observed mean or variance is different than what we would expect by chance.

You may not realize it, but you have already taken the first step towards a bootstrap model in R by using the sample() function.

Let’s look at the data from the Global Garlic Mustard Field Survey. The data are available here. Details on this dataset are covered in the previous tutorial introducting climate models: link:

GMdat<-read.csv("../EcologyTutorials/GGMFS_Teaching_Data.csv",header=T)
GMsubDat<-GMdat[,c("Longitude","Latitude","Region","Pop_Size")] # Subset to columns of interest
GMsubDat<-GMsubDat[complete.cases(GMsubDat),] # Remove missing data

Parametric Assumptions

Let’s calculate the Mean and 95% CI for Pop_Size. In classic frequentist statistics, we might assume a normal distribution, and then calculate the Mean and Standard Error (SE). You may recall from intro stats that the mean of a random sample drawn from a Gaussian distribution will fall within ~1.96x the standard error of the mean, 95% of the time. This is a complicated way of saying that the parametric 95% CI of a random sample is Mean +/- 1.96*SE. You may also recall that the SE of a sample mean is Standard Deviation (SD) divided by the square root of the sample size: \[\frac{SD}{\sqrt{(N-1)}}\].

# Calculate mean, SE and CI from parametric model
X<-mean(GMsubDat$Pop_Size)
SE<-sd(GMsubDat$Pop_Size)/sqrt(nrow(GMsubDat)-1)
CI<-c(X-SE*1.96,X+SE*1.96)

# Round and print values
X<-round(X)
CI<-round(CI)
paste("Raw mean is:",X)

## [1] "Raw mean is: 4601"

paste("Raw 95% CI is:",CI[1],"to",CI[2])

## [1] "Raw 95% CI is: 561 to 8640"

This assumes a normal distribution Let’s take a look at the distribution of Pop_Size, and then add the mean and 95% CI

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

qplot(Pop_Size,data=GMsubDat)+theme_bw()+geom_vline(aes(xintercept=X),colour="red")+geom_vline(aes(xintercept=CI[1]),colour="blue")+geom_vline(aes(xintercept=CI[2]),colour="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data highly right-skewed, ranging from near-zero to almost 1,000,000. They are clearly not normal, violating the assumption of the parametric estimate.

How can we reduce the skew and improve assumptions of normality in the data?

Transform

We can try to apply a transformation to make the data look more like a normal distribution:

# Calculate mean and 95% CI on transformed data
GMsubDat$log_Size<-log(GMsubDat$Pop_Size)
Xlog<-mean(GMsubDat$log_Size)
SElog<-sd(GMsubDat$log_Size)/sqrt(nrow(GMsubDat)-1)
CIlog<-c(Xlog-SElog*1.96,Xlog+SElog*1.96)

# Plot data + mean + 95% CI
qplot(log_Size,data=GMsubDat)+theme_bw()+geom_vline(aes(xintercept=Xlog),colour="red")+geom_vline(aes(xintercept=CIlog[1]),colour="blue")+geom_vline(aes(xintercept=CIlog[2]),colour="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Much better, but still a bit skewed. Let’s back-transform the mean and CI calculated on the log-transformed data, and compare with the original

# Original
cat("Raw mean is:",X,"\n",
    "Raw 95% CI is:",CI[1],"to",CI[2],"\n")

## Raw mean is: 4601 
##  Raw 95% CI is: 561 to 8640

# Back-Transform from log-scale to original measurement scale
Xlog<-exp(Xlog)
CIlog<-exp(CIlog)

# Round 
Xlog<-round(Xlog)
CIlog<-round(CIlog)
cat("Transformed mean is:",Xlog,"\n",
    "Transformed 95% CI is:",CIlog[1],"to",CIlog[2],"\n")

## Transformed mean is: 124 
##  Transformed 95% CI is: 100 to 153

Compare the raw vs. log-transformed values. That’s more than an order of magnitude difference in the mean, and a much smaller 95% CI!

1. Bootstrap Mean

We can also use a bootstrap model estimate the mean and 95% CI directly from the data. The beauty of this approach is that it is not biased by the distribution of the data. To do this we:

Resample the data, with replacement
Calculate the mean
Repeat N times
Calculate the mean of the N replicated sample means. This is the bootstrap mean.
The 0.025*Nth smallest mean is the lower 95% CI
The 0.975*Nth smallest mean is the upper 95% CI

We can do this fairly easily in R using sample() in a for loop and saving the output in a vector

Iters<-999 # Number of iterations == the number ot times the mean is resampled 
BootOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
for (i in 1:Iters){
  # Resample the data, calculate the mean, and save it in a vector; repeat
  BootOut[i]<-mean(sample(GMsubDat$Pop_Size,nrow(GMsubDat),replace=T)) 
}
head(BootOut)

## [1] 2744.159 6655.851 5189.162 4463.876 5039.405 8465.112

Now calculate the mean and CI from the sorted data (steps 4-6). Note that the values are drawn directly from the data.

Xboot<-mean(BootOut)
CIlow<-sort(BootOut)[floor(length(BootOut)*0.025)] # Lower 2.5% CI
CIhigh<-sort(BootOut)[ceiling(length(BootOut)*0.975)] # Upper 2.5% CI
CIboot<-c(CIlow,CIhigh)

Plot for comparison.

# Plot
qplot(BootOut)+theme_bw()+geom_vline(aes(xintercept=Xboot),colour="red")+geom_vline(aes(xintercept=CIlow),colour="blue")+geom_vline(aes(xintercept=CIhigh),colour="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Why does the 95% CI seem so much wider on this plot than the previous two?

This is a plot of estimated population means from the bootstrap iterations, whereas the previous graphs plot the raw data. Therefore the x-axis scales are quite different.

Compare methods

Let’s look at all of them together:

# Transform bootstrap values
Xboot<-round(Xboot)
CIboot<-round(CIboot)

cat("Raw mean is:",X,"\n",
    "Raw 95% CI is:",CI[1],"to",CI[2],"\n",
    "Transformed mean is:",Xlog,"\n",
    "Transformed 95% CI is:",CIlog[1],"to",CIlog[2],"\n",
    "Bootstrap mean is:",Xboot,"\n",
    "Bootstrap 95% CI is:",CIboot[1],"to",CIboot[2],"\n")

## Raw mean is: 4601 
##  Raw 95% CI is: 561 to 8640 
##  Transformed mean is: 124 
##  Transformed 95% CI is: 100 to 153 
##  Bootstrap mean is: 4563 
##  Bootstrap 95% CI is: 1579 to 9099

Which method is most accurate and why?

2. Bootstrap: Effect Size

We can also use a bootstrap simulation as a non-parametric statistic. In the dataset on Alliaria petiolata (garlic mustard) from the Global Garlic Mustard Field Survey there is a factor called Region, which tells us which continent the population is on. Let’s test if the North American populations are larger than populations in Europe.

Parametric Models

# Raw data
RawMod<-lm(Pop_Size~Region,data=GMsubDat)
(RawCoef<-summary(RawMod)$coefficients)

##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)    453.8455   3168.477 0.1432377 0.88617623
## RegionNorthAm 7147.6219   4159.867 1.7182332 0.08654573

# Log-transformed
LogMod<-lm(log_Size~Region,data=GMsubDat)
(LogCoef<-summary(LogMod)$coefficients)

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)   4.016636  0.1600753 25.092168 1.768767e-83
## RegionNorthAm 1.382294  0.2101615  6.577295 1.538992e-10

Recall: The ‘RegionNorthAm’ parameter is the estimated difference in the mean of North American population size relative to Europe. The 2nd model is on a log-scale so it needs to be back-transformed.

# Calculate effect size and parametric 95%CI, and round
RawES<-round(RawCoef[2])
RawCI<-round(c(RawCoef[2]-1.96*RawCoef[4],RawCoef[2]+1.96*RawCoef[4]))

# Calculate for log-transformed
LogES<-exp(LogCoef[2])
LogES<-round(LogES)
LogCI<-exp(c(LogCoef[2]-1.96*LogCoef[4],LogCoef[2]+1.96*LogCoef[4]))
LogCI<-round(LogCI)

cat("The effect size based on raw data is:",RawES,"\n",
    "with parametric 95% CI of",RawCI[1],"to",RawCI[2],"\n",
    "The effect size based on transformed data is:","\n",
    LogES,"with parametric 95% CI of",LogCI[1],"to",LogCI[2],"\n")

## The effect size based on raw data is: 7148 
##  with parametric 95% CI of -1006 to 15301 
##  The effect size based on transformed data is: 
##  4 with parametric 95% CI of 3 to 6

What does the range of CI values tell you about the hypothesis that introduced populations (North America) are larger than populations in the native range (Europe)?

Bootstrap model

We sample, with replacement, just as we did with the above bootstrap model. However, this time we sample separately for each region and calculate the difference in the mean each time

Resample the data from each region, with replacement
Calculate the difference between means
Repeat N times
Calculate the mean of the N replicated samples. This is the bootstrap effect size.
The 0.025*Nth smallest mean is the lower 95% CI
The 0.975*Nth smallest mean is the upper 95% CI

Iters<-999 # Number of iterations == the number ot times the mean is resampled 
BootESOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
# Make separate EU/NA groups for resampling
EUpops<-GMsubDat$Pop_Size[grep("Europe",GMsubDat$Region)]
NApops<-GMsubDat$Pop_Size[grep("NorthAm",GMsubDat$Region)]
for (i in 1:Iters){
  # Resample the data, calculate the mean, and save it in a vector; repeat
  BootESOut[i]<-
    mean(sample(NApops,length(NApops),replace=T)) -
    mean(sample(EUpops,length(EUpops),replace=T))
}
head(BootESOut)

## [1]  4449.814  3912.270  6682.209  3375.779 11315.238  7300.743

Look at the frequency histogram.

qplot(BootESOut)+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Compare the results

# Calculate mean & 95% CI
XbootES<-round(mean(BootESOut))
CIlowES<-sort(BootESOut)[floor(length(BootESOut)*0.025)] # Lower 2.5% CI
CIhighES<-sort(BootESOut)[ceiling(length(BootESOut)*0.975)] # Upper 2.5% CI
CIbootES<-round(c(CIlow,CIhigh))

# Compare results
cat("The effect size based on raw data is:",RawES,"\n",
    "with parametric 95% CI of",RawCI[1],"to",RawCI[2],"\n",
    "The effect size based on transformed data is:",LogES,"\n",
    "with parametric 95% CI of",LogCI[1],"to",LogCI[2],"\n",
    "The effect size of the bootstrap model is:",XbootES,"\n",
    "with 95% CI of",CIbootES[1],"to",RawCI[2],"\n")

## The effect size based on raw data is: 7148 
##  with parametric 95% CI of -1006 to 15301 
##  The effect size based on transformed data is: 4 
##  with parametric 95% CI of 3 to 6 
##  The effect size of the bootstrap model is: 7174 
##  with 95% CI of 1579 to 15301

What do the x- and y-axis represent, and why does the distribution have this shape?

3. Bootstrap Null Models

In the above model, we know our effect size is significant – i.e. North American populations are significantly bigger than native populations. But, there are two problems:

We don’t know the exact P-value.
The above model wouldn’t work for constrained data. For example, what if we wanted to test the significance of eigenvalues from a principal components model? Eigenvalues are constrained to be greater than zero. Since we calculate 95% CI from resampled data, rather than a parametric model, we can’t just check if our 95% CI overlap zero.

Here’s where a null model comes in handy. We first define and simulate a null model, then compare our observed value(s) to the null model.

Defining the null model is a little more tricky – you have to think in terms of the null hypothesis. In our example, the null hypothesis is that there is no difference in the size of European and North American populations. To generate a null distribution:

Resample the region codes, with replacement
Calculate the difference between means
Repeat N times
Compare the observed effect size to the null distribution
The p-value is proportion of values in the null model greater/less than the observed effect size (or divide by 2 for a 2-tailed test)

Iters<-999 # Number of iterations == the number ot times the mean is resampled 
PermOut<-rep(NA,Iters) # Make an empty vector with # cells = number of iterations
SimDat<-GMsubDat # Make a new dataset for resampling
for (i in 1:Iters){
  # 1. Reshuffle Region labels (sample, without replacement)
  SimDat$Region<-sample(SimDat$Region,replace=F)
  # 2. Calculate effect size and save in output vector
  PermOut[i]<-
    mean(SimDat$Pop_Size[grep("NorthAm",SimDat$Region)]) -
    mean(SimDat$Pop_Size[grep("Europe",SimDat$Region)])
}
head(PermOut)

## [1]  4547.470 -4492.450  1739.325  1769.074 -3308.059 -2086.433

Plot Null

# Calculate mean difference of raw data
(RawX<-aggregate(GMsubDat$Pop_Size,list(GMsubDat$Region),mean))

RawDiff<-RawX$x[2]-RawX$x[1]
qplot(PermOut)+theme_bw()+geom_vline(aes(xintercept=RawDiff),colour="red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Calculate p-value

The p-value is the probability of obtaining the observed value by chance. In this case the observed value is the observed difference in the mean RawDiff, above. The null model simulates ‘observed values obtained by chance’. So the estimated p-value is the proportion of values less than the observed value for a 1-tailed test.

p<-sum(PermOut>RawDiff)/length(PermOut)
paste("The estimated p-value is:",p)

## [1] "The estimated p-value is: 0"

If the observed value is lower than all of the simulated values, then the best we can say is: \[p < \frac{1}{N_{iters}}\]

where Niters is the total number of iterations.

If we did not have an a priori prediction about the direction of the effect, then it is a 2-tailed test. This is easy to do. We just divide alpha by two, or multiply the p-value by 2.

In our case, our null hypothesis was no difference between European vs. North American populations (rather than testing whether North American populations are bigger).

So p < 0.01 for 1001 iterations or p < 0.001 if there is no overlap in 2001 iterations.

4. Bootstrap Caveats

It’s important to think carefully about your bootstrap and null models and the implicit assumptions you make when you run them. For example, we assume that each observation has the same probability of being chosen in each iteration. This wouldn’t be true if there was some sort of inherent correlation among data points. For example, the data we are working with were sampled non-randomly.

qplot(Longitude,Latitude,data=GMdat,alpha=I(0.3))+theme_bw()

We may also have two or more factors in a model that are correlated. For example, multiple traits measured on the same individuals or the same traits measured on the same individuals but at different time points. In this kind of ‘repeated measures’ experiments, should run the bootstrap and null model tests at the level of individual, not observation. That is, reshuffle ID tags for a null model, leaving the rest of the row/column data untouched.

5. Make it faster

Once you understand the basic coding for bootraps and permutation tests, you can look into available packages to reduce the amount of code you need to write and have them run faster.

The lmPerm package is the permutation equivalent of lm()

The boot package has tools for writing bootstrap models

The foreach and doParallel packages allow you to use multithreading in your functions and for-loops.

Bootstrap, Randomization
& Permutation

Rob Colautti

Introduction

1. Bootstrap Basics

Parametric Assumptions

Transform

1. Bootstrap Mean

Compare methods

2. Bootstrap: Effect Size

Parametric Models

Bootstrap model

3. Bootstrap Null Models

Plot Null

Calculate p-value

4. Bootstrap Caveats

5. Make it faster

Bootstrap, Randomization & Permutation

Rob Colautti

Introduction

1. Bootstrap Basics

Parametric Assumptions

Transform

1. Bootstrap Mean

Compare methods

2. Bootstrap: Effect Size

Parametric Models

Bootstrap model

3. Bootstrap Null Models

Plot Null

Calculate p-value

4. Bootstrap Caveats

5. Make it faster

Bootstrap, Randomization
& Permutation