Overview

You should now be familiar with linear models (LM), generalized linear models (GLM) and (generalized) linear mixed models (MM).

You should be familiar with the similarities and differences so that you understand how GLM and MM expand on LM in different ways. In this tutorial we add another tool to the statistical toolbox. Generalized Additive Models (GAM) are an extension of polynomial regression but allowing more complex curves and interactions.

A general function for a GAM is:

\[g(Y) = \beta_0 + s_1(X_1) + ... + s_k(X_k) + \epsilon\]

The terms g() and s() denote functions or transformations of the data. The g() term we have seen before, this is the link function just like we saw with the generalized linear models. The link function just transforms the variable based on its error distribution (e.g. binomial, Poisson, Gaussian). The s() functions are smoothing functions.

The smoothing functions are estimated directly from the data, rather than fitting slopes like we do with standard linear regression. This involves some complex math that you won’t need to know to apply and interpret GAM results, so we won’t examine it here. Just remember that s() is some kind of function. The complexity of the smoothing function is given by a parameter \(\lambda\).

\(\lambda\) is a parameter that is commonly used in machine learning models. For now, just know a higher \(\lambda\) value produces a smoother curve.

Setup

We’ll be using the mgcv library for Generalized Additive Models (GAM) and Generalized Additive Mixed Models (GAMM):

library(mgcv)

## Warning: package 'mgcv' was built under R version 4.0.5

## Loading required package: nlme

## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

The usual setup for plotting, data management, and likelihood ratio test (for model selection)

library(ggplot2) # plotting library
library(dplyr) # data management

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:nlme':
## 
##     collapse

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.0.5

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.0.5

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())

Import data from the 2013 Science Paper “Rapid adaptation to climate facilitates range expansion of an invasive plant”.

LythrumDat<-read.csv("https://colauttilab.github.io/Data/ColauttiBarrett2013Data.csv",header=T)

Data Overview

This data is from a published paper, and it’s availabe for free from the Dryad repository: https://doi.org/10.5061/dryad.878m3

The Dryad Repository DataDryad.org is a great place to find datasets to practice your analytical skills or to investigate new ideas.

Like most data in the repository, the dataset we will be working on is a bit more complicated than most of the examples we’ve used in most of the previous self-tutorials. So we’ll take some time to understand the data before moving on. Full detail is available in the Science paper “Colautti & Barrett (2013_”. But here we’ll focus on a few aspects of the data.

The main question we’ll look at today is:

How does natural selection act on Flowering time?

The paper reports on a reciprocal transplant experiment. This is a classic experimental design in which genotypes are collected from different geographic locations, and reared together in different environments.

In this case, seeds were collected from different maternal ‘families’ along a latitudinal gradient, and then grown at 3 locations along the same gradient. This is a common approach for natural populations. Since two individuals from the same maternal family share somewhere between 1/4 (half-sib) to 1/2 (full-sib) of their genes in common, we can use statistical methods to estimate the effects of genetic similarity on phenotype.

In this experiment, there were 4 growing seasons from 2007 to 2010. Traits measured in each year are Julian Day of first flower, height of the vegetative portion of the plant, inflorescence biomass and number of fruits. Genetic families were sampled from six locations and the entire experiment was replicated at three geographic locations along a latitudinal gradient: Timmons (ON), Newmarket (ON) and northern Virginia.

This map from the Supplementary Material of the paper shows the locations of the seed collections (arrows) and common garden locations (squares).

Looking at the structure of the data:

str(LythrumDat)

Ind – A unique identifier for each plant in the experiment
Site – The common garden location: BEF = South, KSR = Mid, Timmins = North
Row & Pos – location within the common garden
Mat – A unique identifier for each maternal family
Pop – The Population (seed location) in alphabetical order from North (A) to South (T)
Region – The geographical region of the saampled population. There are 3 regions (North, Mid, South) with two populations from each region
FlwrXX – Julian Day of first flower
FVegXX – Height of the vegetative part of the plant, measured on the day that the plant started to flower
HVegXX – Height of the vegetative part of the plant, measured at the end (post-harvest)
InfMassXX – Mass of the inflorescence, which previous studies show is a good estimate of total seeds production
Fruits07 – Number of fruits measured in 2007; can be used to test relationship between fruit number and inflorescence biomass.

NOTE: XX above is a 2-digit number that refers to year of collection from 2007 through 2010.

Fitness

Fitness in this experiment can be measured as lifetime reproduction over the four growing seasons. For this we can simply sum the InfMassXX columns, but first we should replace NA with 0.

SelDat<-LythrumDat %>%
  mutate(InfMass07 = ifelse(is.na(InfMass07),0,InfMass07),
         InfMass08 = ifelse(is.na(InfMass08),0,InfMass08),
         InfMass09 = ifelse(is.na(InfMass09),0,InfMass09),
         InfMass10 = ifelse(is.na(InfMass10),0,InfMass10)
         ) %>%
  mutate(Fitness = InfMass07 + InfMass08 + InfMass09 + InfMass10)

Now we can try to visualize the distribution of fitness. We should probably look at each site separately

qplot(x=Fitness, facets="Site", data=SelDat)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Phenotypes

The main traits under selection are flowering time and vegetative size. These are measured in each year but they are fairly correlated.

qplot(x=Flwr08, y=Flwr09, data=SelDat)

## Warning: Removed 106 rows containing missing values (geom_point).

To make the analysis easier, we can average the flowering time and size across years.

Before we can do this, we have to adjust the FlwrXX columns because the values here include the year of measurement. You can see this by looking at the values on the x vs y-axis and you can see that the values of y are advanced by about 365 days.

To standardize day within year, we’ll take a shortcut, and just adjust each to the earliest value within each year.

SelDat<-SelDat %>%
  mutate(Flwr07 = Flwr07 - min(Flwr07,na.rm=T),
         Flwr08 = Flwr08 - min(Flwr08,na.rm=T),
         Flwr09 = Flwr09 - min(Flwr09,na.rm=T),
         Flwr10 = Flwr10 - min(Flwr10,na.rm=T))

Now we can calculate the average across years:

SelDat<-SelDat %>% rowwise() %>%
  mutate(FlwrAvg = mean(c(Flwr07,Flwr08,Flwr09,Flwr10), na.rm=T),
         FVegAvg = mean(c(FVeg07,FVeg08,FVeg09,FVeg10), na.rm=T),
         )

Note the use of rowwise()!

What would happen if we didn’t include rowwise() %>% ? Try it and then run the code below to see if you are correct.

And we should also look at these distributions

qplot(x=FlwrAvg, facets="Site", data=SelDat)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 49 rows containing non-finite values (stat_bin).

qplot(x=FVegAvg, facets="Site", data=SelDat)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 49 rows containing non-finite values (stat_bin).

Natural Selection

We can measure natural selection on a trait by looking at the relationship between a trait and fitness. For example, we might predict that selection on flowering time will be stronger in the north than in the south:

qplot(x=FlwrAvg, y=Fitness, facets="Site", data=SelDat)

## Warning: Removed 49 rows containing missing values (geom_point).

Before getting into the GAM, let’s review how we might analyze this using linear models.

In natural selection, we are generally interested in two phenomena:

Directional Selection – This is just the linear relationship between a trait and fitness and may be used to predict evolution in response to selection.
Stabilizing/Disruptive Selection – This is a higher-order regression, typically a quadratic regression with a squared term. It tests whether selection favours an optimum phenotype (stabilizing selection) or extreme phenotypes (disruptive selection).

Linear Models

Before we get into linear models, we must be careful to think about our missing data. In the case of fitness, we assume NA = 0 meaning that the plant either died or produced no viable seeds. But we also have NA for flowering time and vegetative size. In these cases, the plant may have died early in the experiment before reaching maturity, so we have no way of knowing what the phenotype is. These should be removed from our anlaysis:

SelDat<-SelDat %>%
  filter(!is.na(FlwrAvg) &
         !is.na(FVegAvg))

Linear Regression

We can run a simple linear regression model with separate slopes for each Site:

LRModF<-lm(Fitness ~ FlwrAvg*Site, data=SelDat)
summary(LRModF)

## 
## Call:
## lm(formula = Fitness ~ FlwrAvg * Site, data = SelDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -97.74 -35.74 -13.10  23.20 261.14 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            30.2330    13.2146   2.288  0.02270 * 
## FlwrAvg                 1.3365     0.6122   2.183  0.02966 * 
## Site2_KSR              57.7041    22.1604   2.604  0.00958 **
## Site3_Timmins          74.0692    31.7812   2.331  0.02030 * 
## FlwrAvg:Site2_KSR      -1.1806     0.7449  -1.585  0.11383   
## FlwrAvg:Site3_Timmins  -2.5168     0.7621  -3.302  0.00105 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.18 on 377 degrees of freedom
## Multiple R-squared:  0.1928, Adjusted R-squared:  0.182 
## F-statistic:    18 on 5 and 377 DF,  p-value: 5.08e-16

We see that the slope in the southern site (FlwrAvg) is positive, but is close to zero at the mid-latitude site (FlwrAvg + FlwrAvg:Site2_SKR), with a strongly negative slope in the northern site (FlwrAvg + FlwrAvg:Site3_Timmins).

qplot(x=FlwrAvg, y=Fitness, facets="Site", data=SelDat) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

We can do the same for vegetative size:

LRModV<-lm(Fitness ~ FVegAvg*Site, data=SelDat)
summary(LRModV)

## 
## Call:
## lm(formula = Fitness ~ FVegAvg * Site, data = SelDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -141.458  -28.697   -9.094   19.136  244.030 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -63.4359    16.6691  -3.806 0.000165 ***
## FVegAvg                 1.7012     0.2265   7.509 4.35e-13 ***
## Site2_KSR              76.2705    23.8233   3.202 0.001483 ** 
## Site3_Timmins         112.0152    27.0907   4.135 4.38e-05 ***
## FVegAvg:Site2_KSR      -0.7170     0.3014  -2.379 0.017855 *  
## FVegAvg:Site3_Timmins  -1.9019     0.3233  -5.884 8.86e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.63 on 377 degrees of freedom
## Multiple R-squared:  0.316,  Adjusted R-squared:  0.3069 
## F-statistic: 34.83 on 5 and 377 DF,  p-value: < 2.2e-16

Again we get a shift from positive to negative slope going from the southern garden (1_BEF) to then northern garden (3_Timmins).

qplot(x=FVegAvg, y=Fitness, facets="Site", data=SelDat) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

But one problem with these separate analyses is that the traits are correlated:

qplot(x=FlwrAvg, y=FVegAvg, data=SelDat)

Plants that delay flowering generally get bigger than plants that flower early. This is the problem of collinearity, which in this case we can solve by including both in the model:

Multiple Regression

MRMod<-lm(Fitness ~ FlwrAvg*FVegAvg*Site, data=SelDat)
summary(MRMod)

## 
## Call:
## lm(formula = Fitness ~ FlwrAvg * FVegAvg * Site, data = SelDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -126.669  -25.110   -7.486   20.026  205.124 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -113.74618   34.04146  -3.341 0.000918 ***
## FlwrAvg                          1.44280    1.59714   0.903 0.366920    
## FVegAvg                          2.85601    0.52872   5.402 1.18e-07 ***
## Site2_KSR                     -227.70370   64.90667  -3.508 0.000507 ***
## Site3_Timmins                  163.07499   96.23817   1.694 0.091011 .  
## FlwrAvg:FVegAvg                 -0.04011    0.02170  -1.848 0.065368 .  
## FlwrAvg:Site2_KSR                3.88728    2.13520   1.821 0.069477 .  
## FlwrAvg:Site3_Timmins           -1.93022    2.13471  -0.904 0.366473    
## FVegAvg:Site2_KSR                4.00876    0.84895   4.722 3.32e-06 ***
## FVegAvg:Site3_Timmins           -2.20459    1.08088  -2.040 0.042096 *  
## FlwrAvg:FVegAvg:Site2_KSR       -0.05796    0.02566  -2.259 0.024485 *  
## FlwrAvg:FVegAvg:Site3_Timmins    0.03194    0.02584   1.236 0.217135    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.29 on 371 degrees of freedom
## Multiple R-squared:  0.4792, Adjusted R-squared:  0.4638 
## F-statistic: 31.04 on 11 and 371 DF,  p-value: < 2.2e-16

Compare these coefficients with above. We can also test the fit of the model using model selection. Since the simpler model is Flowering time OR vegetative size, we should compare both.

lrtest(LRModF,MRMod)

## Likelihood ratio test
## 
## Model 1: Fitness ~ FlwrAvg * Site
## Model 2: Fitness ~ FlwrAvg * FVegAvg * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   7 -2090.1                         
## 2  13 -2006.2  6 167.89  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(LRModV,MRMod)

## Likelihood ratio test
## 
## Model 1: Fitness ~ FVegAvg * Site
## Model 2: Fitness ~ FlwrAvg * FVegAvg * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   7 -2058.4                         
## 2  13 -2006.2  6 104.46  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also do a LRT to figure out if we need the three-way interaction and all of the two-way interactions.

Test the 3-way and each 2-way term in MRMod using LRT

Take the time to do this. If you have trouble, then you don’t know model selection as well as you think you do! Take time to review the Model Selection Tutorial if you need to.

QC Problem

Now let’s take a look at our model assumptions:

qplot(x=predict(MRMod),y=residuals(MRMod))

There is clearly a big problem with the assumption of homogenous variance. Looking back at the graphs of fitness vs FlwrAvg and FVegAvg we can see that there may be some non-linearity in the relationships.

Polynomial Regression

In the Linear Models Tutorial we saw how polynomial regression was a special case of a linear model.

In Polynomial regression, we want add different powers of the same predictor:

\[ Y \sim \beta_0 + \beta_1X_1 + \beta_2X_{1}^{2} + \beta_3X_{1}^{3} + ...\]

Technically, we should zero-centre each predictor (\(X_k\)) before raising it to a power, otherwise we will have collinear predictors.

We also want separate parameters for each common garden location, which means we want three slopes for each \(\beta\) slope in the model above.

Both of these steps are simplified by using the poly() function. Just make sure you understand that this function is a shortcut for writing out the longer polynomial.

For a selection analysis we would typically be intersted in only the quadratic term, which would test for stabilizing or disruptive selection. However, we can also test for more complicated fitness relationships using model selection.

QuadModF<-lm(Fitness ~ poly(FlwrAvg,degree=2)*Site, data=SelDat)
CubModF<-lm(Fitness ~ poly(FlwrAvg,degree=3)*Site, data=SelDat)
lrtest(QuadModF,CubModF)

## Likelihood ratio test
## 
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site
## Model 2: Fitness ~ poly(FlwrAvg, degree = 3) * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1  10 -2067.8                     
## 2  13 -2067.4  3 0.7794     0.8544

Here we can see that the qubic function does not fit better. Does the quadratic model fit better than the simpler linear model?

lrtest(QuadModF,LRModF)

## Likelihood ratio test
## 
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site
## Model 2: Fitness ~ FlwrAvg * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  10 -2067.8                         
## 2   7 -2090.1 -3 44.511  1.175e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, it’s a highly significant difference, so the quadratic model fits the data much better. This shows that there is curvature to the relationship. We can look at the summary and plot to better understand how:

summary(QuadModF)

## 
## Call:
## lm(formula = Fitness ~ poly(FlwrAvg, degree = 2) * Site, data = SelDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.693  -31.652   -8.849   20.444  233.147 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                -11.51      34.05  -0.338  0.73548
## poly(FlwrAvg, degree = 2)1               -2054.10     894.45  -2.297  0.02220
## poly(FlwrAvg, degree = 2)2               -1349.23     454.08  -2.971  0.00316
## Site2_KSR                                   54.91      35.34   1.554  0.12112
## Site3_Timmins                               67.17      43.19   1.555  0.12076
## poly(FlwrAvg, degree = 2)1:Site2_KSR      2363.41     908.84   2.600  0.00968
## poly(FlwrAvg, degree = 2)2:Site2_KSR      -263.41     525.35  -0.501  0.61638
## poly(FlwrAvg, degree = 2)1:Site3_Timmins  1634.58    1010.19   1.618  0.10649
## poly(FlwrAvg, degree = 2)2:Site3_Timmins  1330.42     503.01   2.645  0.00852
##                                            
## (Intercept)                                
## poly(FlwrAvg, degree = 2)1               * 
## poly(FlwrAvg, degree = 2)2               **
## Site2_KSR                                  
## Site3_Timmins                              
## poly(FlwrAvg, degree = 2)1:Site2_KSR     **
## poly(FlwrAvg, degree = 2)2:Site2_KSR       
## poly(FlwrAvg, degree = 2)1:Site3_Timmins   
## poly(FlwrAvg, degree = 2)2:Site3_Timmins **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.16 on 374 degrees of freedom
## Multiple R-squared:  0.2813, Adjusted R-squared:  0.266 
## F-statistic:  18.3 on 8 and 374 DF,  p-value: < 2.2e-16

From high-school calculus remember that the squared term determines whether the curve is convex or concave. A positive quadratic represents disruptive selection while a negative value represents stabilizing selection. In this case, the squared terms go from -1349 to almost zero, representing a shift from stabilizing to linear or slightly disruptive selection.

qplot(x=FlwrAvg, y=Fitness, facets="Site", data=SelDat) + 
  geom_smooth(method="lm", formula=y ~ poly(x,degree=3))

We can do the same for vegetative size

QuadModV<-lm(Fitness ~ poly(FVegAvg,degree=2)*Site, data=SelDat)

lrtest(QuadModV,LRModV)

## Likelihood ratio test
## 
## Model 1: Fitness ~ poly(FVegAvg, degree = 2) * Site
## Model 2: Fitness ~ FVegAvg * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  10 -2038.6                         
## 2   7 -2058.4 -3 39.597  1.297e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(QuadModV)

## 
## Call:
## lm(formula = Fitness ~ poly(FVegAvg, degree = 2) * Site, data = SelDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -122.248  -26.836   -5.335   18.067  224.208 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                73.835      5.431  13.595  < 2e-16
## poly(FVegAvg, degree = 2)1                772.888    181.365   4.262 2.57e-05
## poly(FVegAvg, degree = 2)2                 14.072    156.038   0.090 0.928191
## Site2_KSR                                  15.503      6.899   2.247 0.025206
## Site3_Timmins                             -44.154      7.800  -5.661 3.01e-08
## poly(FVegAvg, degree = 2)1:Site2_KSR     -133.538    202.679  -0.659 0.510388
## poly(FVegAvg, degree = 2)2:Site2_KSR     -619.472    183.940  -3.368 0.000836
## poly(FVegAvg, degree = 2)1:Site3_Timmins -755.281    219.104  -3.447 0.000631
## poly(FVegAvg, degree = 2)2:Site3_Timmins -173.994    191.251  -0.910 0.363532
##                                             
## (Intercept)                              ***
## poly(FVegAvg, degree = 2)1               ***
## poly(FVegAvg, degree = 2)2                  
## Site2_KSR                                *  
## Site3_Timmins                            ***
## poly(FVegAvg, degree = 2)1:Site2_KSR        
## poly(FVegAvg, degree = 2)2:Site2_KSR     ***
## poly(FVegAvg, degree = 2)1:Site3_Timmins ***
## poly(FVegAvg, degree = 2)2:Site3_Timmins    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.18 on 374 degrees of freedom
## Multiple R-squared:  0.3831, Adjusted R-squared:   0.37 
## F-statistic: 29.04 on 8 and 374 DF,  p-value: < 2.2e-16

Again, we find the quadratic is significant but this time selection is generally disruptive in the south but stabilizing in the north and intermediate at the mid-latitude site.

qplot(x=FVegAvg, y=Fitness, facets="Site", data=SelDat) + 
  geom_smooth(method="lm", formula=y ~ poly(x,degree=3))

Again, these traits are not independent, so we should include them both in the same model:

QuadMod<-lm(Fitness ~ poly(FlwrAvg,degree=2)*Site + poly(FVegAvg,degree=2)*Site, data=SelDat)

Does it fit better than each one individually?

lrtest(QuadMod,QuadModF)

## Likelihood ratio test
## 
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site + poly(FVegAvg, degree = 2) * 
##     Site
## Model 2: Fitness ~ poly(FlwrAvg, degree = 2) * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  16 -2002.5                         
## 2  10 -2067.8 -6 130.62  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(QuadMod,QuadModV)

## Likelihood ratio test
## 
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site + poly(FVegAvg, degree = 2) * 
##     Site
## Model 2: Fitness ~ poly(FVegAvg, degree = 2) * Site
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  16 -2002.5                         
## 2  10 -2038.6 -6 72.104  1.512e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, highly significant. Now compare these coefficients to the individual models:

summary(QuadMod)

## 
## Call:
## lm(formula = Fitness ~ poly(FlwrAvg, degree = 2) * Site + poly(FVegAvg, 
##     degree = 2) * Site, data = SelDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.08  -23.96   -9.12   18.28  198.74 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                  3.647     30.836   0.118 0.905921
## poly(FlwrAvg, degree = 2)1               -1853.721    831.155  -2.230 0.026330
## poly(FlwrAvg, degree = 2)2                -697.995    425.608  -1.640 0.101861
## Site2_KSR                                   47.030     32.182   1.461 0.144764
## Site3_Timmins                               57.072     38.388   1.487 0.137945
## poly(FVegAvg, degree = 2)1                1039.247    193.520   5.370  1.4e-07
## poly(FVegAvg, degree = 2)2                 189.334    157.384   1.203 0.229746
## poly(FlwrAvg, degree = 2)1:Site2_KSR       587.104    861.118   0.682 0.495799
## poly(FlwrAvg, degree = 2)2:Site2_KSR      -592.250    506.524  -1.169 0.243061
## poly(FlwrAvg, degree = 2)1:Site3_Timmins  1281.159    927.975   1.381 0.168241
## poly(FlwrAvg, degree = 2)2:Site3_Timmins   744.959    466.852   1.596 0.111412
## Site2_KSR:poly(FVegAvg, degree = 2)1       171.472    234.100   0.732 0.464345
## Site3_Timmins:poly(FVegAvg, degree = 2)1  -896.203    227.966  -3.931 0.000101
## Site2_KSR:poly(FVegAvg, degree = 2)2      -428.963    192.086  -2.233 0.026138
## Site3_Timmins:poly(FVegAvg, degree = 2)2  -333.839    189.678  -1.760 0.079233
##                                             
## (Intercept)                                 
## poly(FlwrAvg, degree = 2)1               *  
## poly(FlwrAvg, degree = 2)2                  
## Site2_KSR                                   
## Site3_Timmins                               
## poly(FVegAvg, degree = 2)1               ***
## poly(FVegAvg, degree = 2)2                  
## poly(FlwrAvg, degree = 2)1:Site2_KSR        
## poly(FlwrAvg, degree = 2)2:Site2_KSR        
## poly(FlwrAvg, degree = 2)1:Site3_Timmins    
## poly(FlwrAvg, degree = 2)2:Site3_Timmins    
## Site2_KSR:poly(FVegAvg, degree = 2)1        
## Site3_Timmins:poly(FVegAvg, degree = 2)1 ***
## Site2_KSR:poly(FVegAvg, degree = 2)2     *  
## Site3_Timmins:poly(FVegAvg, degree = 2)2 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.04 on 368 degrees of freedom
## Multiple R-squared:  0.489,  Adjusted R-squared:  0.4696 
## F-statistic: 25.15 on 14 and 368 DF,  p-value: < 2.2e-16

Our first problem is that there are a lot of coefficients, what a mess! And we are only including two phenotypic traits! Imagine if we tried to do this for 3 or more traits.

QC

A second problem is that our residuals do no give us confidence in this model.

qplot(x=predict(QuadMod),y=residuals(QuadMod))

We can try to improve by log-transforming the response variable, or using glm() with family=poisson. This will improve the residuals but there is still room for improvement.

Try this: redo the QuadMod model as a generalized linear model and look at the residuals.

There is one little trick to get the glm to work as a poisson, which is toround your response variables to a whole number. Once you do this, you should have no trouble writing the code for the glm. If you do, it means you don’t understand glm as well as you think you do! In that case, review the Generalized Linear Models Tutorial.

GAM

This brings us to generalized additive models. With a simple line of code we can find a complex function for each of our phenotypic traits.

We can run simple GAM analyses using the gam() function from the mgcv library.

Since gam is just an extension of lm and glm, so the formula is very similar. The main difference is that we want to add a smoothing parameter s() to any predictors where we want to estimate a nonlinear curve:

GAMod<-gam(Fitness ~ s(FlwrAvg) + s(FVegAvg), data=SelDat)
summary(GAMod)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Fitness ~ s(FlwrAvg) + s(FVegAvg)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.403      2.565   24.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df     F p-value    
## s(FlwrAvg) 4.702  5.793 21.05  <2e-16 ***
## s(FVegAvg) 3.932  4.908 23.78  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.369   Deviance explained = 38.4%
## GCV = 2585.6  Scale est. = 2520.6    n = 383

The main difference is that we have a single, smoothed term for each phenotypic trait, rather than linear and quadratic (or higher-order) coefficients. We also have an adjusted R-squared value showing the fit of the model.

Different Intercepts

But that model just fits a single curve, what if we want different intercepts?

GAMod2<-gam(Fitness ~ s(FlwrAvg) + s(FVegAvg) + Site, data=SelDat)
summary(GAMod2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Fitness ~ s(FlwrAvg) + s(FVegAvg) + Site
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     43.779      7.675   5.704 2.41e-08 ***
## Site2_KSR       45.687     10.179   4.488 9.61e-06 ***
## Site3_Timmins    9.612     14.330   0.671    0.503    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(FlwrAvg) 7.119  8.177  5.856 5.69e-07 ***
## s(FVegAvg) 4.110  5.124 23.826  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.428   Deviance explained = 44.8%
## GCV = 2375.2  Scale est. = 2287      n = 383

The above model accounts for differences in the mean across sites. But what if we want separate smoothing functions for each site? The obvious solution might be

Different curves

GAMod3<-gam(Fitness ~ s(FlwrAvg)*Site + s(FVegAvg)*Site, data=SelDat)

## Warning in sp[i] <- ind: number of items to replace is not a multiple of
## replacement length

## Warning in sp[i] <- ind: number of items to replace is not a multiple of
## replacement length

## Error in model.frame.default(formula = Fitness ~ Site + s(FlwrAvg):Site + : invalid type (list) for variable 's(FlwrAvg)'

But this generates an error. If we use the help to look at the smoothing function ?s, we can see the solution

GAMod3<-gam(Fitness ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site), data=SelDat)

## Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, : Can't find by variable

summary(GAMod3)

## Error in summary(GAMod3): object 'GAMod3' not found

Still another error… looking at the by parameter in ?s we see a clue: by a numeric or factor variable of the same dimension of each covariate. Let’s troubleshoot:

We know the dimension is the same because they are all vectors in the same data frame, with the same number of observations, so the error must be in the type of variable

is.factor(SelDat$Site)

## [1] FALSE

is.numeric(SelDat$Site)

## [1] FALSE

We can solve this by using the as.factor() function in s(b=as.factor(Site)) but we know Site should be a factor anyway, so we can just replace the type it in the original dataset.

SelDat$Site<-as.factor(SelDat$Site)
GAMod3<-gam(Fitness ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site), data=SelDat)
summary(GAMod3)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Fitness ~ s(FlwrAvg, by = Site) + s(FVegAvg, by = Site)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   57.416      8.589   6.685 8.88e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                            edf Ref.df      F  p-value    
## s(FlwrAvg):Site1_BEF     4.791  5.276  3.600 0.003400 ** 
## s(FlwrAvg):Site2_KSR     5.761  6.505 15.461  < 2e-16 ***
## s(FlwrAvg):Site3_Timmins 1.000  1.000 12.457 0.000471 ***
## s(FVegAvg):Site1_BEF     1.382  1.670 42.535  < 2e-16 ***
## s(FVegAvg):Site2_KSR     8.560  8.878 15.086  < 2e-16 ***
## s(FVegAvg):Site3_Timmins 1.918  2.439  0.986 0.465720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.543   Deviance explained = 57.1%
## GCV = 1952.6  Scale est. = 1828.2    n = 383

Look carefully at the smooth terms and compare to the quadratic model (QuadMod) above. Unlike the lm and glm we don’t have an overall intercept and slope with deviations in site means and slope for. Instead, we have a single smoothing term estimated separately for each site.

We can use model selection to compare the GAM with the quadratic model:

AIC(QuadMod) - AIC(GAMod3)

## [1] 47.81079

Remember the smaller AIC means a better fit, with a \(\Delta AIC < 2\) being considered equivalent models. In this case the difference is quite large, representing a better model. Because gam is ‘generalized’ like glm, we can also try fitting the log-linear (Poisson) model, using a link function for the response variable.

Link family

BUT remember that the poisson distribution is for integers, so we need to round our biomass to whole numbers or else we will get a very annoying set of warning messages (one for each row of data). This will lose a small amount of information (tenths of a gram), so just to be safe we’ll make a new column:

SelDat$iFit<-as.integer(SelDat$Fitness)

GAMod4<-gam(iFit ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site), 
            family=poisson, data=SelDat)
AIC(GAMod3) - AIC(GAMod4)

## [1] -5745.954

In this case, we get a negative \(\Delta AIC\), indicating a worse fit for GAMod4.

Looking back at our Fitness data, we can see that it is mostly the zeros in the northern site that make it look like it might be log-normal.

Zero-inflated

A normal distribution with lots of zeros is called zero-inflated and it is a difficult problem for biological models. One option is to separate the data into two analyses – for example, recoding the fitness data = 1 wherever fitness is not zero, and then running a logistic regression. In addition, we can exclude observations with zero fitness.

In the case of our Lythrum dataset, we could do both: a logistic regression to predict survival to reproduction and a gaussian model to predict reproductive fitness.

In our case, we are just comparing the fit of different models, rather than testing whether a particular term is significant, so the zero-inflated data are less of a problem.

GAMM

We can also extend GAM to include random effects, which give us Generalized Additive Mixed Models (GAMM). This is an extension of the Genalized Linear Mixed Models (GLMM) that we covered in the Mixed Models Tutorial

The gamm() function from the mgcv library can be used for this.

In our dataset, seed family would be a good random effect, since they are sampled randomly for each population. That means that we will have trouble fitting separate slopes for each family at each site. Remember that our maternal family column must be coded as a factor:

SelDat$Mat<-as.factor(SelDat$Mat)

Before doing this, we should check our sample size:

Nobs<-SelDat %>% group_by(Mat,Site) %>% summarize(N=n())

## `summarise()` has grouped output by 'Mat'. You can override using the `.groups` argument.

qplot(Nobs$N)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Unfortunately, there is not good replication at the family level (just 1 or 2 surviving plants per family per site).

Instead, we can just estimate the variance among family means/intercepts in each garden site:

GAMM<-gamm(Fitness ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site), 
            random=list(Mat=~1|Site),data=SelDat)

Repeated Measures

In the Mixed Models Tutorial we looked at a type of dataset called Repeated Measures because the same individuals were measured at different time points.

In that tutorial we also looked at a method for fitting random intercepts and slopes. This turns out to be a good fit to the data, but what if the responses over time were nonlinear? In that case we could use a GAMM. The model setup is a bit more complicated though.

First, load the data:

ImmuneDat<-read.csv("https://colauttilab.github.io/Data/ImmuneData.csv",header=T)

Remember in this data set we have paired M-F siblings, with the same IndID, so we should recode.

ImmuneDat$IndID<-paste0(ImmuneDat$IndID,ImmuneDat$Sex)

We’ll also encode the ID variables and Sex as factors to avoid errors:

ImmuneDat<-ImmuneDat %>%
  mutate(FamID=as.factor(FamID),
         IndID=as.factor(IndID),
         Sex=as.factor(Sex))

There’s a few different random models we could use here. First, let’s imagine that we are geneticists so we are interested in these specific genetic families. For example, we might want to see how different families respond over time and then follow-up with genetic screening to identify candidate genes. In this case, we should treat FamID as a fixed effect, with individuals as random effects because they are randomly chosen from each family.

So now we want to estimate separate curves for each family:

GAMi<-gam(IgG ~ FamID + s(Time,by=FamID), data=ImmuneDat)

AND we want to account for random variation among individuals. For this, we use a bs="fs" parameter in the smoothing function. This is a bit different to the other kinds of mixed models.

GAM for Big Data

One drawback of gam is that it is computationally much more intensive than glm, making it inefficient for large datasets. However, the bam() function from the same mgcv package is similar to gamand gamm but optimized for larger datasets.

Note that this can take a long time to run!

BAMi<-bam(IgG ~ FamID + s(Time,by=FamID) + 
              s(Time, IndID, bs="fs"), data=ImmuneDat)