You should now be familiar with linear models (LM), generalized linear models (GLM) and (generalized) linear mixed models (MM).
You should be familiar with the similarities and differences so that you understand how GLM and MM expand on LM in different ways. In this tutorial we add another tool to the statistical toolbox. Generalized Additive Models (GAM) are an extension of polynomial regression but allowing more complex curves and interactions.
A general function for a GAM is:
\[g(Y) = \beta_0 + s_1(X_1) + ... + s_k(X_k) + \epsilon\]
The terms g()
and s()
denote functions or transformations of the data. The g()
term we have seen before, this is the link function just like we saw with the generalized linear models. The link function just transforms the variable based on its error distribution (e.g. binomial, Poisson, Gaussian). The s()
functions are smoothing functions.
The smoothing functions are estimated directly from the data, rather than fitting slopes like we do with standard linear regression. This involves some complex math that you won’t need to know to apply and interpret GAM results, so we won’t examine it here. Just remember that s()
is some kind of function. The complexity of the smoothing function is given by a parameter \(\lambda\).
\(\lambda\) is a parameter that is commonly used in machine learning models. For now, just know a higher \(\lambda\) value produces a smoother curve.
We’ll be using the mgcv
library for Generalized Additive Models (GAM) and Generalized Additive Mixed Models (GAMM):
library(mgcv)
## Warning: package 'mgcv' was built under R version 4.0.5
## Loading required package: nlme
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
The usual setup for plotting, data management, and likelihood ratio test (for model selection)
library(ggplot2) # plotting library
library(dplyr) # data management
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
Import data from the 2013 Science Paper “Rapid adaptation to climate facilitates range expansion of an invasive plant”.
<-read.csv("https://colauttilab.github.io/Data/ColauttiBarrett2013Data.csv",header=T) LythrumDat
This data is from a published paper, and it’s availabe for free from the Dryad repository: https://doi.org/10.5061/dryad.878m3
The Dryad Repository DataDryad.org is a great place to find datasets to practice your analytical skills or to investigate new ideas.
Like most data in the repository, the dataset we will be working on is a bit more complicated than most of the examples we’ve used in most of the previous self-tutorials. So we’ll take some time to understand the data before moving on. Full detail is available in the Science paper “Colautti & Barrett (2013_”. But here we’ll focus on a few aspects of the data.
The main question we’ll look at today is:
How does natural selection act on Flowering time?
The paper reports on a reciprocal transplant experiment. This is a classic experimental design in which genotypes are collected from different geographic locations, and reared together in different environments.
In this case, seeds were collected from different maternal ‘families’ along a latitudinal gradient, and then grown at 3 locations along the same gradient. This is a common approach for natural populations. Since two individuals from the same maternal family share somewhere between 1/4 (half-sib) to 1/2 (full-sib) of their genes in common, we can use statistical methods to estimate the effects of genetic similarity on phenotype.
In this experiment, there were 4 growing seasons from 2007 to 2010. Traits measured in each year are Julian Day of first flower, height of the vegetative portion of the plant, inflorescence biomass and number of fruits. Genetic families were sampled from six locations and the entire experiment was replicated at three geographic locations along a latitudinal gradient: Timmons (ON), Newmarket (ON) and northern Virginia.
This map from the Supplementary Material of the paper shows the locations of the seed collections (arrows) and common garden locations (squares).
Looking at the structure of the data:
str(LythrumDat)
NOTE: XX above is a 2-digit number that refers to year of collection from 2007 through 2010.
Fitness in this experiment can be measured as lifetime reproduction over the four growing seasons. For this we can simply sum the InfMassXX columns, but first we should replace NA with 0.
<-LythrumDat %>%
SelDatmutate(InfMass07 = ifelse(is.na(InfMass07),0,InfMass07),
InfMass08 = ifelse(is.na(InfMass08),0,InfMass08),
InfMass09 = ifelse(is.na(InfMass09),0,InfMass09),
InfMass10 = ifelse(is.na(InfMass10),0,InfMass10)
%>%
) mutate(Fitness = InfMass07 + InfMass08 + InfMass09 + InfMass10)
Now we can try to visualize the distribution of fitness. We should probably look at each site separately
qplot(x=Fitness, facets="Site", data=SelDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The main traits under selection are flowering time and vegetative size. These are measured in each year but they are fairly correlated.
qplot(x=Flwr08, y=Flwr09, data=SelDat)
## Warning: Removed 106 rows containing missing values (geom_point).
To make the analysis easier, we can average the flowering time and size across years.
Before we can do this, we have to adjust the FlwrXX columns because the values here include the year of measurement. You can see this by looking at the values on the x vs y-axis and you can see that the values of y are advanced by about 365 days.
To standardize day within year, we’ll take a shortcut, and just adjust each to the earliest value within each year.
<-SelDat %>%
SelDatmutate(Flwr07 = Flwr07 - min(Flwr07,na.rm=T),
Flwr08 = Flwr08 - min(Flwr08,na.rm=T),
Flwr09 = Flwr09 - min(Flwr09,na.rm=T),
Flwr10 = Flwr10 - min(Flwr10,na.rm=T))
Now we can calculate the average across years:
<-SelDat %>% rowwise() %>%
SelDatmutate(FlwrAvg = mean(c(Flwr07,Flwr08,Flwr09,Flwr10), na.rm=T),
FVegAvg = mean(c(FVeg07,FVeg08,FVeg09,FVeg10), na.rm=T),
)
Note the use of
rowwise()
!
What would happen if we didn’t include rowwise() %>%
? Try it and then run the code below to see if you are correct.
And we should also look at these distributions
qplot(x=FlwrAvg, facets="Site", data=SelDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 49 rows containing non-finite values (stat_bin).
qplot(x=FVegAvg, facets="Site", data=SelDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 49 rows containing non-finite values (stat_bin).
We can measure natural selection on a trait by looking at the relationship between a trait and fitness. For example, we might predict that selection on flowering time will be stronger in the north than in the south:
qplot(x=FlwrAvg, y=Fitness, facets="Site", data=SelDat)
## Warning: Removed 49 rows containing missing values (geom_point).
Before getting into the GAM, let’s review how we might analyze this using linear models.
In natural selection, we are generally interested in two phenomena:
Before we get into linear models, we must be careful to think about our missing data. In the case of fitness, we assume NA = 0 meaning that the plant either died or produced no viable seeds. But we also have NA for flowering time and vegetative size. In these cases, the plant may have died early in the experiment before reaching maturity, so we have no way of knowing what the phenotype is. These should be removed from our anlaysis:
<-SelDat %>%
SelDatfilter(!is.na(FlwrAvg) &
!is.na(FVegAvg))
We can run a simple linear regression model with separate slopes for each Site:
<-lm(Fitness ~ FlwrAvg*Site, data=SelDat)
LRModFsummary(LRModF)
##
## Call:
## lm(formula = Fitness ~ FlwrAvg * Site, data = SelDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97.74 -35.74 -13.10 23.20 261.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.2330 13.2146 2.288 0.02270 *
## FlwrAvg 1.3365 0.6122 2.183 0.02966 *
## Site2_KSR 57.7041 22.1604 2.604 0.00958 **
## Site3_Timmins 74.0692 31.7812 2.331 0.02030 *
## FlwrAvg:Site2_KSR -1.1806 0.7449 -1.585 0.11383
## FlwrAvg:Site3_Timmins -2.5168 0.7621 -3.302 0.00105 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.18 on 377 degrees of freedom
## Multiple R-squared: 0.1928, Adjusted R-squared: 0.182
## F-statistic: 18 on 5 and 377 DF, p-value: 5.08e-16
We see that the slope in the southern site (FlwrAvg) is positive, but is close to zero at the mid-latitude site (FlwrAvg + FlwrAvg:Site2_SKR), with a strongly negative slope in the northern site (FlwrAvg + FlwrAvg:Site3_Timmins).
qplot(x=FlwrAvg, y=Fitness, facets="Site", data=SelDat) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
We can do the same for vegetative size:
<-lm(Fitness ~ FVegAvg*Site, data=SelDat)
LRModVsummary(LRModV)
##
## Call:
## lm(formula = Fitness ~ FVegAvg * Site, data = SelDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.458 -28.697 -9.094 19.136 244.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -63.4359 16.6691 -3.806 0.000165 ***
## FVegAvg 1.7012 0.2265 7.509 4.35e-13 ***
## Site2_KSR 76.2705 23.8233 3.202 0.001483 **
## Site3_Timmins 112.0152 27.0907 4.135 4.38e-05 ***
## FVegAvg:Site2_KSR -0.7170 0.3014 -2.379 0.017855 *
## FVegAvg:Site3_Timmins -1.9019 0.3233 -5.884 8.86e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.63 on 377 degrees of freedom
## Multiple R-squared: 0.316, Adjusted R-squared: 0.3069
## F-statistic: 34.83 on 5 and 377 DF, p-value: < 2.2e-16
Again we get a shift from positive to negative slope going from the southern garden (1_BEF) to then northern garden (3_Timmins).
qplot(x=FVegAvg, y=Fitness, facets="Site", data=SelDat) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
But one problem with these separate analyses is that the traits are correlated:
qplot(x=FlwrAvg, y=FVegAvg, data=SelDat)
Plants that delay flowering generally get bigger than plants that flower early. This is the problem of collinearity, which in this case we can solve by including both in the model:
<-lm(Fitness ~ FlwrAvg*FVegAvg*Site, data=SelDat)
MRModsummary(MRMod)
##
## Call:
## lm(formula = Fitness ~ FlwrAvg * FVegAvg * Site, data = SelDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126.669 -25.110 -7.486 20.026 205.124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -113.74618 34.04146 -3.341 0.000918 ***
## FlwrAvg 1.44280 1.59714 0.903 0.366920
## FVegAvg 2.85601 0.52872 5.402 1.18e-07 ***
## Site2_KSR -227.70370 64.90667 -3.508 0.000507 ***
## Site3_Timmins 163.07499 96.23817 1.694 0.091011 .
## FlwrAvg:FVegAvg -0.04011 0.02170 -1.848 0.065368 .
## FlwrAvg:Site2_KSR 3.88728 2.13520 1.821 0.069477 .
## FlwrAvg:Site3_Timmins -1.93022 2.13471 -0.904 0.366473
## FVegAvg:Site2_KSR 4.00876 0.84895 4.722 3.32e-06 ***
## FVegAvg:Site3_Timmins -2.20459 1.08088 -2.040 0.042096 *
## FlwrAvg:FVegAvg:Site2_KSR -0.05796 0.02566 -2.259 0.024485 *
## FlwrAvg:FVegAvg:Site3_Timmins 0.03194 0.02584 1.236 0.217135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.29 on 371 degrees of freedom
## Multiple R-squared: 0.4792, Adjusted R-squared: 0.4638
## F-statistic: 31.04 on 11 and 371 DF, p-value: < 2.2e-16
Compare these coefficients with above. We can also test the fit of the model using model selection. Since the simpler model is Flowering time OR vegetative size, we should compare both.
lrtest(LRModF,MRMod)
## Likelihood ratio test
##
## Model 1: Fitness ~ FlwrAvg * Site
## Model 2: Fitness ~ FlwrAvg * FVegAvg * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -2090.1
## 2 13 -2006.2 6 167.89 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(LRModV,MRMod)
## Likelihood ratio test
##
## Model 1: Fitness ~ FVegAvg * Site
## Model 2: Fitness ~ FlwrAvg * FVegAvg * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -2058.4
## 2 13 -2006.2 6 104.46 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also do a LRT to figure out if we need the three-way interaction and all of the two-way interactions.
Test the 3-way and each 2-way term in MRMod using LRT
Take the time to do this. If you have trouble, then you don’t know model selection as well as you think you do! Take time to review the Model Selection Tutorial if you need to.
Now let’s take a look at our model assumptions:
qplot(x=predict(MRMod),y=residuals(MRMod))
There is clearly a big problem with the assumption of homogenous variance. Looking back at the graphs of fitness vs FlwrAvg and FVegAvg we can see that there may be some non-linearity in the relationships.
In the Linear Models Tutorial we saw how polynomial regression was a special case of a linear model.
In Polynomial regression, we want add different powers of the same predictor:
\[ Y \sim \beta_0 + \beta_1X_1 + \beta_2X_{1}^{2} + \beta_3X_{1}^{3} + ...\]
Technically, we should zero-centre each predictor (\(X_k\)) before raising it to a power, otherwise we will have collinear predictors.
We also want separate parameters for each common garden location, which means we want three slopes for each \(\beta\) slope in the model above.
Both of these steps are simplified by using the poly()
function. Just make sure you understand that this function is a shortcut for writing out the longer polynomial.
For a selection analysis we would typically be intersted in only the quadratic term, which would test for stabilizing or disruptive selection. However, we can also test for more complicated fitness relationships using model selection.
<-lm(Fitness ~ poly(FlwrAvg,degree=2)*Site, data=SelDat)
QuadModF<-lm(Fitness ~ poly(FlwrAvg,degree=3)*Site, data=SelDat)
CubModFlrtest(QuadModF,CubModF)
## Likelihood ratio test
##
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site
## Model 2: Fitness ~ poly(FlwrAvg, degree = 3) * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 10 -2067.8
## 2 13 -2067.4 3 0.7794 0.8544
Here we can see that the qubic function does not fit better. Does the quadratic model fit better than the simpler linear model?
lrtest(QuadModF,LRModF)
## Likelihood ratio test
##
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site
## Model 2: Fitness ~ FlwrAvg * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 10 -2067.8
## 2 7 -2090.1 -3 44.511 1.175e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, it’s a highly significant difference, so the quadratic model fits the data much better. This shows that there is curvature to the relationship. We can look at the summary and plot to better understand how:
summary(QuadModF)
##
## Call:
## lm(formula = Fitness ~ poly(FlwrAvg, degree = 2) * Site, data = SelDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -108.693 -31.652 -8.849 20.444 233.147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.51 34.05 -0.338 0.73548
## poly(FlwrAvg, degree = 2)1 -2054.10 894.45 -2.297 0.02220
## poly(FlwrAvg, degree = 2)2 -1349.23 454.08 -2.971 0.00316
## Site2_KSR 54.91 35.34 1.554 0.12112
## Site3_Timmins 67.17 43.19 1.555 0.12076
## poly(FlwrAvg, degree = 2)1:Site2_KSR 2363.41 908.84 2.600 0.00968
## poly(FlwrAvg, degree = 2)2:Site2_KSR -263.41 525.35 -0.501 0.61638
## poly(FlwrAvg, degree = 2)1:Site3_Timmins 1634.58 1010.19 1.618 0.10649
## poly(FlwrAvg, degree = 2)2:Site3_Timmins 1330.42 503.01 2.645 0.00852
##
## (Intercept)
## poly(FlwrAvg, degree = 2)1 *
## poly(FlwrAvg, degree = 2)2 **
## Site2_KSR
## Site3_Timmins
## poly(FlwrAvg, degree = 2)1:Site2_KSR **
## poly(FlwrAvg, degree = 2)2:Site2_KSR
## poly(FlwrAvg, degree = 2)1:Site3_Timmins
## poly(FlwrAvg, degree = 2)2:Site3_Timmins **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.16 on 374 degrees of freedom
## Multiple R-squared: 0.2813, Adjusted R-squared: 0.266
## F-statistic: 18.3 on 8 and 374 DF, p-value: < 2.2e-16
From high-school calculus remember that the squared term determines whether the curve is convex or concave. A positive quadratic represents disruptive selection while a negative value represents stabilizing selection. In this case, the squared terms go from -1349 to almost zero, representing a shift from stabilizing to linear or slightly disruptive selection.
qplot(x=FlwrAvg, y=Fitness, facets="Site", data=SelDat) +
geom_smooth(method="lm", formula=y ~ poly(x,degree=3))
We can do the same for vegetative size
<-lm(Fitness ~ poly(FVegAvg,degree=2)*Site, data=SelDat) QuadModV
lrtest(QuadModV,LRModV)
## Likelihood ratio test
##
## Model 1: Fitness ~ poly(FVegAvg, degree = 2) * Site
## Model 2: Fitness ~ FVegAvg * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 10 -2038.6
## 2 7 -2058.4 -3 39.597 1.297e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(QuadModV)
##
## Call:
## lm(formula = Fitness ~ poly(FVegAvg, degree = 2) * Site, data = SelDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.248 -26.836 -5.335 18.067 224.208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.835 5.431 13.595 < 2e-16
## poly(FVegAvg, degree = 2)1 772.888 181.365 4.262 2.57e-05
## poly(FVegAvg, degree = 2)2 14.072 156.038 0.090 0.928191
## Site2_KSR 15.503 6.899 2.247 0.025206
## Site3_Timmins -44.154 7.800 -5.661 3.01e-08
## poly(FVegAvg, degree = 2)1:Site2_KSR -133.538 202.679 -0.659 0.510388
## poly(FVegAvg, degree = 2)2:Site2_KSR -619.472 183.940 -3.368 0.000836
## poly(FVegAvg, degree = 2)1:Site3_Timmins -755.281 219.104 -3.447 0.000631
## poly(FVegAvg, degree = 2)2:Site3_Timmins -173.994 191.251 -0.910 0.363532
##
## (Intercept) ***
## poly(FVegAvg, degree = 2)1 ***
## poly(FVegAvg, degree = 2)2
## Site2_KSR *
## Site3_Timmins ***
## poly(FVegAvg, degree = 2)1:Site2_KSR
## poly(FVegAvg, degree = 2)2:Site2_KSR ***
## poly(FVegAvg, degree = 2)1:Site3_Timmins ***
## poly(FVegAvg, degree = 2)2:Site3_Timmins
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.18 on 374 degrees of freedom
## Multiple R-squared: 0.3831, Adjusted R-squared: 0.37
## F-statistic: 29.04 on 8 and 374 DF, p-value: < 2.2e-16
Again, we find the quadratic is significant but this time selection is generally disruptive in the south but stabilizing in the north and intermediate at the mid-latitude site.
qplot(x=FVegAvg, y=Fitness, facets="Site", data=SelDat) +
geom_smooth(method="lm", formula=y ~ poly(x,degree=3))
Again, these traits are not independent, so we should include them both in the same model:
<-lm(Fitness ~ poly(FlwrAvg,degree=2)*Site + poly(FVegAvg,degree=2)*Site, data=SelDat) QuadMod
Does it fit better than each one individually?
lrtest(QuadMod,QuadModF)
## Likelihood ratio test
##
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site + poly(FVegAvg, degree = 2) *
## Site
## Model 2: Fitness ~ poly(FlwrAvg, degree = 2) * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 16 -2002.5
## 2 10 -2067.8 -6 130.62 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(QuadMod,QuadModV)
## Likelihood ratio test
##
## Model 1: Fitness ~ poly(FlwrAvg, degree = 2) * Site + poly(FVegAvg, degree = 2) *
## Site
## Model 2: Fitness ~ poly(FVegAvg, degree = 2) * Site
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 16 -2002.5
## 2 10 -2038.6 -6 72.104 1.512e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, highly significant. Now compare these coefficients to the individual models:
summary(QuadMod)
##
## Call:
## lm(formula = Fitness ~ poly(FlwrAvg, degree = 2) * Site + poly(FVegAvg,
## degree = 2) * Site, data = SelDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.08 -23.96 -9.12 18.28 198.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.647 30.836 0.118 0.905921
## poly(FlwrAvg, degree = 2)1 -1853.721 831.155 -2.230 0.026330
## poly(FlwrAvg, degree = 2)2 -697.995 425.608 -1.640 0.101861
## Site2_KSR 47.030 32.182 1.461 0.144764
## Site3_Timmins 57.072 38.388 1.487 0.137945
## poly(FVegAvg, degree = 2)1 1039.247 193.520 5.370 1.4e-07
## poly(FVegAvg, degree = 2)2 189.334 157.384 1.203 0.229746
## poly(FlwrAvg, degree = 2)1:Site2_KSR 587.104 861.118 0.682 0.495799
## poly(FlwrAvg, degree = 2)2:Site2_KSR -592.250 506.524 -1.169 0.243061
## poly(FlwrAvg, degree = 2)1:Site3_Timmins 1281.159 927.975 1.381 0.168241
## poly(FlwrAvg, degree = 2)2:Site3_Timmins 744.959 466.852 1.596 0.111412
## Site2_KSR:poly(FVegAvg, degree = 2)1 171.472 234.100 0.732 0.464345
## Site3_Timmins:poly(FVegAvg, degree = 2)1 -896.203 227.966 -3.931 0.000101
## Site2_KSR:poly(FVegAvg, degree = 2)2 -428.963 192.086 -2.233 0.026138
## Site3_Timmins:poly(FVegAvg, degree = 2)2 -333.839 189.678 -1.760 0.079233
##
## (Intercept)
## poly(FlwrAvg, degree = 2)1 *
## poly(FlwrAvg, degree = 2)2
## Site2_KSR
## Site3_Timmins
## poly(FVegAvg, degree = 2)1 ***
## poly(FVegAvg, degree = 2)2
## poly(FlwrAvg, degree = 2)1:Site2_KSR
## poly(FlwrAvg, degree = 2)2:Site2_KSR
## poly(FlwrAvg, degree = 2)1:Site3_Timmins
## poly(FlwrAvg, degree = 2)2:Site3_Timmins
## Site2_KSR:poly(FVegAvg, degree = 2)1
## Site3_Timmins:poly(FVegAvg, degree = 2)1 ***
## Site2_KSR:poly(FVegAvg, degree = 2)2 *
## Site3_Timmins:poly(FVegAvg, degree = 2)2 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46.04 on 368 degrees of freedom
## Multiple R-squared: 0.489, Adjusted R-squared: 0.4696
## F-statistic: 25.15 on 14 and 368 DF, p-value: < 2.2e-16
Our first problem is that there are a lot of coefficients, what a mess! And we are only including two phenotypic traits! Imagine if we tried to do this for 3 or more traits.
A second problem is that our residuals do no give us confidence in this model.
qplot(x=predict(QuadMod),y=residuals(QuadMod))
We can try to improve by log-transforming the response variable, or using glm()
with family=poisson
. This will improve the residuals but there is still room for improvement.
Try this: redo the QuadMod model as a generalized linear model and look at the residuals.
There is one little trick to get the glm to work as a poisson, which is toround your response variables to a whole number. Once you do this, you should have no trouble writing the code for the glm. If you do, it means you don’t understand glm as well as you think you do! In that case, review the Generalized Linear Models Tutorial.
This brings us to generalized additive models. With a simple line of code we can find a complex function for each of our phenotypic traits.
We can run simple GAM analyses using the gam()
function from the mgcv
library.
Since gam
is just an extension of lm
and glm
, so the formula is very similar. The main difference is that we want to add a smoothing parameter s()
to any predictors where we want to estimate a nonlinear curve:
<-gam(Fitness ~ s(FlwrAvg) + s(FVegAvg), data=SelDat)
GAModsummary(GAMod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Fitness ~ s(FlwrAvg) + s(FVegAvg)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.403 2.565 24.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(FlwrAvg) 4.702 5.793 21.05 <2e-16 ***
## s(FVegAvg) 3.932 4.908 23.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.369 Deviance explained = 38.4%
## GCV = 2585.6 Scale est. = 2520.6 n = 383
The main difference is that we have a single, smoothed term for each phenotypic trait, rather than linear and quadratic (or higher-order) coefficients. We also have an adjusted R-squared value showing the fit of the model.
But that model just fits a single curve, what if we want different intercepts?
<-gam(Fitness ~ s(FlwrAvg) + s(FVegAvg) + Site, data=SelDat)
GAMod2summary(GAMod2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Fitness ~ s(FlwrAvg) + s(FVegAvg) + Site
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.779 7.675 5.704 2.41e-08 ***
## Site2_KSR 45.687 10.179 4.488 9.61e-06 ***
## Site3_Timmins 9.612 14.330 0.671 0.503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(FlwrAvg) 7.119 8.177 5.856 5.69e-07 ***
## s(FVegAvg) 4.110 5.124 23.826 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.428 Deviance explained = 44.8%
## GCV = 2375.2 Scale est. = 2287 n = 383
The above model accounts for differences in the mean across sites. But what if we want separate smoothing functions for each site? The obvious solution might be
<-gam(Fitness ~ s(FlwrAvg)*Site + s(FVegAvg)*Site, data=SelDat) GAMod3
## Warning in sp[i] <- ind: number of items to replace is not a multiple of
## replacement length
## Warning in sp[i] <- ind: number of items to replace is not a multiple of
## replacement length
## Error in model.frame.default(formula = Fitness ~ Site + s(FlwrAvg):Site + : invalid type (list) for variable 's(FlwrAvg)'
But this generates an error. If we use the help to look at the smoothing function ?s
, we can see the solution
<-gam(Fitness ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site), data=SelDat) GAMod3
## Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, : Can't find by variable
summary(GAMod3)
## Error in summary(GAMod3): object 'GAMod3' not found
Still another error… looking at the by
parameter in ?s
we see a clue: by a numeric or factor variable of the same dimension of each covariate
. Let’s troubleshoot:
We know the dimension is the same because they are all vectors in the same data frame, with the same number of observations, so the error must be in the type of variable
is.factor(SelDat$Site)
## [1] FALSE
is.numeric(SelDat$Site)
## [1] FALSE
We can solve this by using the as.factor()
function in s(b=as.factor(Site))
but we know Site should be a factor anyway, so we can just replace the type it in the original dataset.
$Site<-as.factor(SelDat$Site)
SelDat<-gam(Fitness ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site), data=SelDat)
GAMod3summary(GAMod3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Fitness ~ s(FlwrAvg, by = Site) + s(FVegAvg, by = Site)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.416 8.589 6.685 8.88e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(FlwrAvg):Site1_BEF 4.791 5.276 3.600 0.003400 **
## s(FlwrAvg):Site2_KSR 5.761 6.505 15.461 < 2e-16 ***
## s(FlwrAvg):Site3_Timmins 1.000 1.000 12.457 0.000471 ***
## s(FVegAvg):Site1_BEF 1.382 1.670 42.535 < 2e-16 ***
## s(FVegAvg):Site2_KSR 8.560 8.878 15.086 < 2e-16 ***
## s(FVegAvg):Site3_Timmins 1.918 2.439 0.986 0.465720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.543 Deviance explained = 57.1%
## GCV = 1952.6 Scale est. = 1828.2 n = 383
Look carefully at the smooth terms and compare to the quadratic model (QuadMod) above. Unlike the lm
and glm
we don’t have an overall intercept and slope with deviations in site means and slope for. Instead, we have a single smoothing term estimated separately for each site.
We can use model selection to compare the GAM with the quadratic model:
AIC(QuadMod) - AIC(GAMod3)
## [1] 47.81079
Remember the smaller AIC means a better fit, with a \(\Delta AIC < 2\) being considered equivalent models. In this case the difference is quite large, representing a better model. Because gam
is ‘generalized’ like glm
, we can also try fitting the log-linear (Poisson) model, using a link function for the response variable.
BUT remember that the poisson distribution is for integers, so we need to round our biomass to whole numbers or else we will get a very annoying set of warning messages (one for each row of data). This will lose a small amount of information (tenths of a gram), so just to be safe we’ll make a new column:
$iFit<-as.integer(SelDat$Fitness) SelDat
<-gam(iFit ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site),
GAMod4family=poisson, data=SelDat)
AIC(GAMod3) - AIC(GAMod4)
## [1] -5745.954
In this case, we get a negative \(\Delta AIC\), indicating a worse fit for GAMod4.
Looking back at our Fitness data, we can see that it is mostly the zeros in the northern site that make it look like it might be log-normal.
A normal distribution with lots of zeros is called zero-inflated and it is a difficult problem for biological models. One option is to separate the data into two analyses – for example, recoding the fitness data = 1 wherever fitness is not zero, and then running a logistic regression. In addition, we can exclude observations with zero fitness.
In the case of our Lythrum dataset, we could do both: a logistic regression to predict survival to reproduction and a gaussian model to predict reproductive fitness.
In our case, we are just comparing the fit of different models, rather than testing whether a particular term is significant, so the zero-inflated data are less of a problem.
We can also extend GAM to include random effects, which give us Generalized Additive Mixed Models (GAMM). This is an extension of the Genalized Linear Mixed Models (GLMM) that we covered in the Mixed Models Tutorial
The gamm()
function from the mgcv
library can be used for this.
In our dataset, seed family would be a good random effect, since they are sampled randomly for each population. That means that we will have trouble fitting separate slopes for each family at each site. Remember that our maternal family column must be coded as a factor:
$Mat<-as.factor(SelDat$Mat) SelDat
Before doing this, we should check our sample size:
<-SelDat %>% group_by(Mat,Site) %>% summarize(N=n()) Nobs
## `summarise()` has grouped output by 'Mat'. You can override using the `.groups` argument.
qplot(Nobs$N)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Unfortunately, there is not good replication at the family level (just 1 or 2 surviving plants per family per site).
Instead, we can just estimate the variance among family means/intercepts in each garden site:
<-gamm(Fitness ~ s(FlwrAvg, by=Site) + s(FVegAvg, by=Site),
GAMMrandom=list(Mat=~1|Site),data=SelDat)
In the Mixed Models Tutorial we looked at a type of dataset called Repeated Measures because the same individuals were measured at different time points.
In that tutorial we also looked at a method for fitting random intercepts and slopes. This turns out to be a good fit to the data, but what if the responses over time were nonlinear? In that case we could use a GAMM. The model setup is a bit more complicated though.
First, load the data:
<-read.csv("https://colauttilab.github.io/Data/ImmuneData.csv",header=T) ImmuneDat
Remember in this data set we have paired M-F siblings, with the same IndID, so we should recode.
$IndID<-paste0(ImmuneDat$IndID,ImmuneDat$Sex) ImmuneDat
We’ll also encode the ID variables and Sex as factors to avoid errors:
<-ImmuneDat %>%
ImmuneDatmutate(FamID=as.factor(FamID),
IndID=as.factor(IndID),
Sex=as.factor(Sex))
There’s a few different random models we could use here. First, let’s imagine that we are geneticists so we are interested in these specific genetic families. For example, we might want to see how different families respond over time and then follow-up with genetic screening to identify candidate genes. In this case, we should treat FamID as a fixed effect, with individuals as random effects because they are randomly chosen from each family.
So now we want to estimate separate curves for each family:
<-gam(IgG ~ FamID + s(Time,by=FamID), data=ImmuneDat) GAMi
AND we want to account for random variation among individuals. For this, we use a bs="fs"
parameter in the smoothing function. This is a bit different to the other kinds of mixed models.
One drawback of gam
is that it is computationally much more intensive than glm
, making it inefficient for large datasets. However, the bam()
function from the same mgcv
package is similar to gam
and gamm
but optimized for larger datasets.
Note that this can take a long time to run!
<-bam(IgG ~ FamID + s(Time,by=FamID) +
BAMis(Time, IndID, bs="fs"), data=ImmuneDat)