1. Introduction

This tutorial is based on an analysis of the Global Garlic Mustard Field Survey. Details on the methods are available in our open-access article in Neobiota.

To follow along, you should download the teaching dataset into your current working directory. Note that this data is intended for teaching purposes only and is not the same as the official, final, published dataset.

2. Field Data

Briefly, the data are plot-level measurements from sample sites across North America and Europe. Let’s take a look at the data. Each row is a site, with plot-level data in columns, so there are a lot of columns. Let’s look at a subset.

GMdat<-read.csv("./GGMFS_Teaching_Data.csv",header=T)
names(GMdat)[c(1:54,353:373)]
##  [1] "RespondentID"     "StartDate"        "EndDate"         
##  [4] "IP_Address"       "Pop_Code"         "Region"          
##  [7] "Family_Name"      "Given_Name"       "email"           
## [10] "Collection_Date"  "Latitude"         "Longitude"       
## [13] "Altitude"         "Pop_Size"         "Pct_Canopy_Cover"
## [16] "EarthTrek_name"   "Population_Type"  "Type_Other"      
## [19] "Understory"       "Roadside"         "Hand_Removal"    
## [22] "Herbicide"        "Mowing"           "Biocontrol"      
## [25] "None_Known"       "Sampled_Before"   "NPlots"          
## [28] "P1Ros"            "P1Adult"          "P1Ros0"          
## [31] "P1Ros20"          "P1Ros40"          "P1Ros60"         
## [34] "P1Ros80"          "P1Adult0"         "P1Adult20"       
## [37] "P1Adult40"        "P1Adult60"        "P1Adult80"       
## [40] "P1Fruits0"        "P1Fruits20"       "P1Fruits40"      
## [43] "P1Fruits60"       "P1Fruits80"       "P1Leaves0"       
## [46] "P1Leaves20"       "P1Leaves40"       "P1Leaves60"      
## [49] "P1Leaves80"       "P1Dmg0"           "P1Dmg20"         
## [52] "P1Dmg40"          "P1Dmg60"          "P1Dmg80"         
## [55] "P8AdultFung"      "P8Fung0"          "P8Fung20"        
## [58] "P8Fung40"         "P8Fung60"         "P8Fung80"        
## [61] "P9RosFung"        "P9AdultFung"      "P9Fung0"         
## [64] "P9Fung20"         "P9Fung40"         "P9Fung60"        
## [67] "P9Fung80"         "P10RosFung"       "P10AdultFung"    
## [70] "P10Fung0"         "P10Fung20"        "P10Fung40"       
## [73] "P10Fung60"        "P10Fung80"        "CoverPic_Mean"

The first 26 columns define the site, while most of the remaining columns contain plot (P1-P10) and transect data (0,20,40,60).

3. Climate Data

In addition to these measurements, it would be nice to have some information about the climate at these locations. There are lots of ways to do this, but let’s take a look at the Climond dataset, which is an iteration of the WorldClim dataset. This is a huge dataset with many ‘layers’ representing different bioclimatic variables, which are explained on the WorldClim website and Climond paper. We’ll just focus on just a few bioclimatic variables available here.

The files are in an odd .adf format, which is a human-readable XML file. Unzip the data to a folder called Climond in your working folder, then take a look at the files

Files <- list.files("./Climond",pattern='w001001.adf',recursive=TRUE,full.names=TRUE)
Files
## [1] "./Climond/bio01/w001001.adf" "./Climond/bio05/w001001.adf"
## [3] "./Climond/bio06/w001001.adf" "./Climond/bio12/w001001.adf"
## [5] "./Climond/bio13/w001001.adf" "./Climond/bio14/w001001.adf"
library(raster)
## Loading required package: sp
Predictors <- stack(Files)
# Names based on Bioclim variables used (bio1,bio5,bio6,bio12,bio13,bio14)
# see: http://www.worldclim.org/bioclim
names(Predictors)<-c("MeanTemp","MaxTemp","MinTemp","TotalPrec","MaxPrec","MinPrec")

Raster Plots

The raster package makes it easy to plot

# Temperature
plot(Predictors$MeanTemp,xlim=c(-125,45),ylim=c(30,60),1,main="",xaxt="n",yaxt="n",legend=F)
points(cbind(GMdat$Longitude,GMdat$Latitude), col="#003F91FF", cex=1.5, pch=21)
points(cbind(GMdat$Longitude,GMdat$Latitude), col="#98B9F266", cex=1, pch=16)

#Moisture
plot(Predictors$TotalPrec,xlim=c(-125,45),ylim=c(30,60),1,main="",xaxt="n",yaxt="n",legend=F)
points(cbind(GMdat$Longitude,GMdat$Latitude), col="#003F91FF", cex=1.5, pch=21)
points(cbind(GMdat$Longitude,GMdat$Latitude), col="#98B9F266", cex=1, pch=16)

ggplot

Unfortunately, plotting in ggplot2() is not straight-forward yet, so we won’t cover it here. If you have a lot of time to spend figuring it out, you can start here: http://ggplot2.tidyverse.org/reference/geom_tile.html

4. Extract Data

Given that each point lies on top of a ‘tile’ or ‘pixel’ representing the climate data, we can extract the bioclim data for the locations of our sample sites. This is easily done with the ‘extract’ function.

BioDat <- extract(Predictors, GMdat[,c("Longitude","Latitude")])
head(BioDat)
##       MeanTemp MaxTemp   MinTemp TotalPrec MaxPrec MinPrec
## [1,]  9.846876 26.3355  -7.07419 1044.9996 24.3118 16.6692
## [2,]  9.846876 26.3355  -7.07419 1044.9996 24.3118 16.6692
## [3,]  6.770368 25.9194 -11.37100  805.9997 19.0638 11.5200
## [4,] 12.819939 30.5968  -4.69355 1111.0001 24.8095 16.8297
## [5,] 12.819939 30.5968  -4.69355 1111.0001 24.8095 16.8297
## [6,]  9.846117 25.5968  -4.78387  944.0002 25.0346 10.1243
class(BioDat)
## [1] "matrix"

This returns a matrix with one row for each location in GMdat, and one column for each bioclimatic variable in Predictors.

Take a look at the column averages.

colMeans(data.frame(BioDat))
##  MeanTemp   MaxTemp   MinTemp TotalPrec   MaxPrec   MinPrec 
##        NA        NA        NA        NA        NA        NA

Missing Data

We are missing some data. Let’s figure out which points, and plot them on a map.

Missing<-GMdat[is.na(rowSums(BioDat)),c("Longitude","Latitude")]
Missing
##     Longitude Latitude
## 54  -123.3061 48.49167
## 55   -73.2974 41.13036
## 59  -123.2451 49.25312
## 281 -123.2452 49.25307

Not too many missing data

xArea<-c(min(Missing$Longitude)-1,max(Missing$Longitude)+1)
yArea<-c(min(Missing$Latitude)-1,max(Missing$Latitude)+1)
plot(Predictors$MeanTemp,xlim=xArea,ylim=yArea,1,main="",xaxt="n",yaxt="n",legend=F)
points(cbind(Missing$Longitude,Missing$Latitude), col="#003F91FF", cex=1.5, pch=21)
points(cbind(Missing$Longitude,Missing$Latitude), col="#98B9F266", cex=1, pch=16)

Doesn’t look too bad. No points out in the ocean or anything. These samples were collected near large water bodies that don’t have climate stations. Since the bioclim variables are in a grid format, the specific location is on a ‘water’ tile. We can solve this problem by averaging the data of nearby tiles with the ‘buffer’ parameter. The scale of the buffer value is usually in meters, but you may have to play around with to find the ideal number

NoDat<-!complete.cases(BioDat)
# 1km radius
test1<-extract(Predictors,GMdat[NoDat,c("Longitude","Latitude")],buffer=1000,fun=mean)
## Error in apply(x, 2, fun2): dim(X) must have a positive length
# 10km radius
test2<-extract(Predictors,GMdat[NoDat,c("Longitude","Latitude")],buffer=10000,fun=mean)
## Error in apply(x, 2, fun2): dim(X) must have a positive length
# 100km radius
test3<-extract(Predictors,GMdat[NoDat,c("Longitude","Latitude")],buffer=100000,fun=mean)

Now add the interpolated values to the dataset

BioDat[NoDat,]<-test3

We can now merge these into a new dataset. For simplicity, we’ll just keep a few of the GMdat columns.

PlotData<-cbind(GMdat[c("Longitude","Latitude","Pop_Size")],data.frame(BioDat))
head(PlotData)
##   Longitude Latitude Pop_Size  MeanTemp MaxTemp   MinTemp TotalPrec
## 1 -80.52499 37.37326      300  9.846876 26.3355  -7.07419 1044.9996
## 2 -80.53314 37.36705      350  9.846876 26.3355  -7.07419 1044.9996
## 3 -79.53098 44.03024      400  6.770368 25.9194 -11.37100  805.9997
## 4 -76.28833 39.43528      140 12.819939 30.5968  -4.69355 1111.0001
## 5 -76.29972 39.43330     1000 12.819939 30.5968  -4.69355 1111.0001
## 6  15.86667 46.41667       10  9.846117 25.5968  -4.78387  944.0002
##   MaxPrec MinPrec
## 1 24.3118 16.6692
## 2 24.3118 16.6692
## 3 19.0638 11.5200
## 4 24.8095 16.8297
## 5 24.8095 16.8297
## 6 25.0346 10.1243

5. PCA

We may want to use Principal Components Analysis to reduce the number of climate variables and avoid problems of autocorrelation among them.

BioDatFrame<-as.data.frame(BioDat)
bioPCA<-princomp(BioDatFrame,cor=T)
loadings(bioPCA)
## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## MeanTemp   0.266  0.728  0.147                0.614
## MaxTemp    0.380  0.226  0.673               -0.590
## MinTemp   -0.151  0.610 -0.578               -0.520
## TotalPrec  0.566 -0.137 -0.253        -0.768       
## MaxPrec    0.455        -0.236 -0.772  0.365       
## MinPrec    0.484 -0.142 -0.270  0.632  0.522       
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.167  0.167  0.167  0.167  0.167  0.167
## Cumulative Var  0.167  0.333  0.500  0.667  0.833  1.000

Climate data is not as intercorrelated as it seems – even 5 of 6 PCs don’t account for 90% of (co)variation in the correlation matrix. Still, we might want to save these since they are by definition uncorrelated with each other.

PCs<-data.frame(bioPCA$scores)
FullData<-cbind(PlotData,data.frame(bioPCA$scores))
names(FullData)
##  [1] "Longitude" "Latitude"  "Pop_Size"  "MeanTemp"  "MaxTemp"  
##  [6] "MinTemp"   "TotalPrec" "MaxPrec"   "MinPrec"   "Comp.1"   
## [11] "Comp.2"    "Comp.3"    "Comp.4"    "Comp.5"    "Comp.6"

Use regular expressions to improve the names

gsub("Comp\\.","PC",names(FullData)) # Check to make sure we are changing the correct names
##  [1] "Longitude" "Latitude"  "Pop_Size"  "MeanTemp"  "MaxTemp"  
##  [6] "MinTemp"   "TotalPrec" "MaxPrec"   "MinPrec"   "PC1"      
## [11] "PC2"       "PC3"       "PC4"       "PC5"       "PC6"
# And combine back to original dataset
names(FullData)<-gsub("Comp\\.","PC",names(FullData))

Visualize the climate ‘space’

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
qplot(PC1,PC2,data=FullData,size=I(2),alpha=I(0.3))+theme_bw()

qplot(MeanTemp,TotalPrec,data=FullData,size=I(2),alpha=I(0.3))+theme_bw()

6. Spatial Models

The problem with analyzing these spatial data is that they are not independent points – points that are closer together in space are more likely to share similar climates. A simple way to visualize this is with a map, see the mapping tutorial for more mapping examples.

MyMap <- ggplot(aes(Longitude,Latitude),data=FullData) + borders("world", colour=NA, fill="black") + theme_bw() + scale_colour_gradient2() + coord_fixed(xlim=c(-100,50), ylim=c(30,75))
MyMap + geom_point(aes(colour=PC1),alpha=0.3,data=FullData) + ggtitle("PC1")

MyMap + geom_point(aes(colour=PC2),alpha=0.3,data=FullData) + ggtitle("PC2")

Dealing with models involving spatial data is tricky. One way is to use randomization, bootstraps, or simulations to generate a null model (see future lecture). Another is to incorporate spatial scale into the statistical analysis. For information on how to do this, see the paper and R code by Dormann et al.