We can think of each linear model as a specific hypothesis defined by the variables input into the model. By convention, we reject the null hypothesis in favour of the alternative hypothesis. In the Advanced Linear Models Tutorial we looked at some of the problems with this approach. In particular, the problem of false-positives (aka Type I error) is amplified in the scientific literature. The Type I error occurs when the null hypothesis is falsely rejected. For example, if we find a significant effect of a new drug but in reality the drug has no effect. These errors are amplified because most studies that fail to reject a null do not ever get published.
But there is another way to think about statistical hypotheses. Instead of having a simple true/false or pass/fail or significant/non-significant binary outcome, we can take a different approach using model selection. In this case we are testing among many different alternative hypotheses instead of just rejecting (or failing to reject) a single hypothesis.
Although these represent different philosophies, in practice the difference usually comes down to the type of data. The ‘classic’ approach of null hypothesis testing is common in experiments with smaller sample sizes and a few, tightly controlled variables. Model selection is more common in large datasets with many observations and many potential predictor variables.
You will need to refer back to the Model Selection Metrics in the Advanced Linear Models tutorial.
Load libraries and custom plotting theme
library(ggplot2) # plotting library
library(dplyr) # data management
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
We are also going to use two new libraries: lmtest
for our l
inear m
odel test
ing and MuMIn
for Mu
lti M
odel In
ference and model selection. We also use MASS
from the 2002 book M
odern A
pplied S
tatistics with S
. S is another programming language that R is based on. Don’t forget to use install.packages()
to update/install these before you load the library()
for the first time.
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MuMIn)
## Warning: package 'MuMIn' was built under R version 4.0.5
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.5
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
So far, we have framed linear models as significant or non-significant. But there are some problems with this approach, including the False Discovery problem and a few others outlined in this nice YouTube video from Veritasium, and in this paper by John Ioannidis in [PLoS Medicine](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124.
An alternate approach to running multiple independent models is to test multiple models to see which one is best supported by the data. Each model represents a different hypothesis, so model selection is a way to test among alternative hypotheses, rather than just testing a series of single hypotheses against the null.
In order to select among models, we need some metric that describes how well the data fit each model. These are known as fit or goodness-of-fit, and there are three main classes: \(R^2\), Likelihood, and Information Criteria.
We have already seen this in our linear model outputs. The \(R^2\) value ranges from 0 to 1 and describes how well the data fit the prediction. Let’s look at two examples:
<-rnorm(100)
X<-rnorm(100)+0.3*X
Y1<-X
Y2qplot(x=X,y=Y1)+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
<-lm(Y1~ X)
Rmod1summary(Rmod1)
##
## Call:
## lm(formula = Y1 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0561 -0.6661 0.2279 0.6536 1.9661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03079 0.09454 -0.326 0.7454
## X 0.28973 0.09554 3.033 0.0031 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9454 on 98 degrees of freedom
## Multiple R-squared: 0.08579, Adjusted R-squared: 0.07646
## F-statistic: 9.196 on 1 and 98 DF, p-value: 0.003103
qplot(x=X,y=Y2)+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
<-lm(Y2 ~ X)
Rmod2summary(Rmod2)
## Warning in summary.lm(Rmod2): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = Y2 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.322e-15 2.910e-17 5.910e-17 8.030e-17 6.828e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.163e-18 6.493e-17 6.400e-02 0.949
## X 1.000e+00 6.562e-17 1.524e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.493e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.322e+32 on 1 and 98 DF, p-value: < 2.2e-16
More generally, the \(R^2\) is the squared correlation between the predicted and the residuals of the model:
cor(predict(Rmod1),Y1)^2
## [1] 0.08578977
cor(predict(Rmod2),Y2)^2
## [1] 1
Notice that the first one is slightly different from the Adjusted R-squared, because the Adjusted corrects for random correlations that arise by chance. As sample size increases, the Adjusted R-squared approaches the actual R-squared (i.e. squared correlation).
Run this code a few times to see the range of \(R^2\) values that come up:
cor(rnorm(100),rnorm(100))^2
## [1] 0.0006179643
Since the Adjusted R-squared
If we have two tests that are nested then we can calculate an ANOVA on the residual sums of squares. Nested just means that one model is a subset of the other. For example, ModB is nested within ModA, and ModC is nested in both ModB and ModA in the following models:
<-lm(Y ~ X1 + X2 + X1:X2)
ModA<-lm(Y ~ X1 + X2)
ModB<-lm(Y ~ X1) ModC
In contrast, this model would not be nested because it contains a predictor that isn’t found in the other models:
<-lm(Y ~ X1 + X2 + X3) ModD
We can test nested lm using the anova
function in R, which tests the hypothesis that the larger or full model fits the data better than the smaller or reduced model. Here’s a simple example:
<-rnorm(1000)
X1<-rnorm(1000)
X2<-X + X2 + rnorm(1000)
Y<-lm(Y ~ X1 + X2)
Full<-lm(Y ~ X1) Reduced
Now, we can ask:
Does the Full model fit the data better than the reduced model?
OR
Does adding X2 improve the model more than we would expect by chance?
anova(Reduced,Full)
## Analysis of Variance Table
##
## Model 1: Y ~ X1
## Model 2: Y ~ X1 + X2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 998 2833.1
## 2 997 1886.7 1 946.37 500.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this is an ANOVA with degrees of freedom equal to the difference in the number of parameters (1 regression coefficient in this case).
The likelhood ratio test gets its name from the statistic:
\[-2\ln(\frac{\theta_0}{\theta_1})\]
Where \(\theta\) is the likelihood of a model, and the subscripts 0 and 1 are for two models that differ by at least one parameter. The model with less parameters goes in the numerator, and the model with more parameters goes in the denominator
The likelihood function is quite complicated, but for now just know that you can calculate the likelihood of any model. Taking the natural log of the ratio, and then multiplying by negative 2 gives you the likelihood ratio, which you can test against a \(\chi^2\) distribution, with degrees of freedom equal to the difference in the degrees of freedom (number of predictors).
Thus, the LRT is very similar to the anova
function for linear models, except that the LRT uses the \(\chi^2\) distribution of likelihoods whereas the anova
uses the \(F\)-distribution of variances.
Information criteria are another set of statistics that measure the fit of a model to the data. The key difference is that there is no statistical test, and no p-value. You just compare the information criteria of different models to find the best one.
The most common information criterion is the Akike Information Criterion or AIC. The mathematical definition of AIC is:
\[ AIC = 2k - \ln(\theta)\]
where \(k\) is the number of parameters in the model, and \(ln(\theta)\) is the natural log of the likelihood of the model. Yes, the same likelihood as the Likelihood Ratio Test (LRT)!
The ‘best’ model is the one with the lowest AIC
Unlike the LRT, there is no p-value associated with AIC. However, a common rule of thumb for comparing models is to consider them to have a similar fit if the difference in AIC between the models is less than 2. This is the change in AIC, or delta AIC
\[\Delta _{AIC}\]
The relative likelihood of two models is:
\[ e^{1/2 (AIC_{small} - AIC_{big})} \]
where \(AIC_{small}\) is the model with the lower AIC (i.e. better fit to the data), and \(AIC_{big}\) is the model with the higher AIC. The value will be between zero (large difference) and 1 (small difference). For example, if we had AIC values of \(AIC_1 = 25\) and \(AIC_2 = 38\), our relative likelihood would be:
\[ e^{1/2(25-38)} = 0.001503439\]
We would say that the model for \(AIC_2\) is 1.5% as probable as \(AIC_1\). Meaning, that \(AIC_1\) is a much better fit to the data.
Another common and important criterion is AICc. This is just the AIC adjusted for small sample size:
\[AICc = AIC +{\frac {2k^{2}+2k}{n-k-1}} \]
Where \(n\) is the sample size. As \(n\) increases, the denominator of the second term moves toward zero. So as sample size increases, AICc approaches AIC. Note also that \(k\) is in the denominator too, so we need larger sample sizes when including more complicated models.
The Bayes Information Criterion or BIC is related to Bayes’ theorem. It is similar to AICc but with a different adjustment for sample size:
\[ {BIC} =k\ln(n)-2\ln(\theta) \] as above, \(n\) denotes samples size, \(k\) the number of parameters estimated in the model, and \(\theta\) is the likelihood of the model.
In addition to the usual assumptions of linear models, another important quality check for model selection is to make sure that each model has the exact same observations. For our simulated data it’s no problem, but with real data you may be missing observations for certain predictor variables. In that case, you will have different data in models that include or exclude predictor variables with one or more missing values.
A really simple way to deal with this is to make a dataset containing ONLY the predictor and response variables you want to use in model selection, and then use the complete.cases
function to remove any rows with missing values.
Here’s an example:
<-data.frame(A = c(0,1,2,3,4,5),
IncompleteB = c(0,1,NA,3,4,5),
C = c(0,1,2,3,NA,5))
Incomplete
## A B C
## 1 0 0 0
## 2 1 1 1
## 3 2 NA 2
## 4 3 3 3
## 5 4 4 NA
## 6 5 5 5
Our incomplete data.frame has 6 rows
Which rows have complete data?
complete.cases(Incomplete)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
Remove rows with missing data:
%>%
Incomplete filter(complete.cases(Incomplete))
## A B C
## 1 0 0 0
## 2 1 1 1
## 3 3 3 3
## 4 5 5 5
The Likelihood Ratio Test (LRT) compares the likelihood of two models. The likelihood ratio statistic follows a chi-squred distribution with degrees of freedom equal to the difference in the number of parameters.
To run a likelihood ratio test, we can use a function from the lmtest
library
library(lmtest)
Let’s run an example with completely random numbers but different number of predictors, then we run the LRT using the lrtest
function:
<-rnorm(1000)
Y<-rnorm(1000)
X1<-rnorm(1000)
X2<-rnorm(1000)
X3<-lm(Y ~ X1 + X2 + X3)
Mod1<-lm(Y ~ X1)
Mod2lrtest(Mod2,Mod1)
## Likelihood ratio test
##
## Model 1: Y ~ X1
## Model 2: Y ~ X1 + X2 + X3
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -1380.4
## 2 5 -1379.8 2 1.1906 0.5514
Now run the above code a bunch of times and compare the output.
Each time we get the same number of degrees of freedom (#Df
): 5 in model 2 and 3 in model 1. We get the same difference in degrees of freedom (Df
) of 2 because there are 2 more predictors in Mod1 than Mod2.
The Chisq
is -2 times the \(ln\) of the ratio of the LogLik
values. We can then calculate the probability from the chi-square distribution. For example, if we have 1.09 in our Chisq above, with 2 df for the difference in parameters:
pchisq(q=1.09,df=2, lower.tail=F)
## [1] 0.5798418
The LogLik is the natural-log-scaled likelihood of the model. The likelihood describes how well the data fit the model (how likely is the model given the data), so the specific likelihood depends on the scatter of points around the prediction of the model (in this case, a single or multiple regression). The Likelihood is a probability. Specifically, it’s the probability you would observe these exact data if the model is true. So if the LogLik
is -140, then the likelihood of the model is the exponential \(e^{-140}\), or approximately:
\[0.000000000000000000000000000000000000000000000000000000000000158 \]
That’s a very small probability! But that doesn’t mean that the model is unlikely. It’s the same probability you might get if you flip a coin 1000 times and ask: “what’s the probability you would get EXACTLY 500 heads and 500 tails?” What if we only flip the coin 1 or 10 times?
Any specific outcome becomes less likely as the sample size increases, but that doesn’t mean the model is wrong! That’s why we have to compare probabilities. We only have 1 set of observations (Y
), but we can ask the probability of observing that data for any given statistical model.
Of course in this case, we also get a low likelihood of the model because we are choosing random variables with no association.
NOTE: Why do we use a natural-log scale?
The main reason is that computers are not very good at working with small numbers. For example:
exp(-1)
## [1] 0.3678794
exp(-10)
## [1] 4.539993e-05
exp(-100)
## [1] 3.720076e-44
exp(-1000)
## [1] 0
Note that \(e^{-1000}\) is a very small number, but it’s not zero!
We can similarly compare models with information criteria using AIC
AIC(Mod1)
## [1] 2769.53
AIC(Mod2)
## [1] 2766.721
AIC(Mod1) - AIC(Mod2)
## [1] 2.80937
and we can convert this to a probability. The smaller model (Mod2) is about
exp(0.5*AIC(Mod2)-0.5*AIC(Mod1))
## [1] 0.2454443
times as probable as the larger model (Mod1)
Likewise, we can do model selection with BIC:
BIC(Mod1)
## [1] 2794.069
BIC(Mod2)
## [1] 2781.444
BIC(Mod1) - BIC(Mod2)
## [1] 12.62488
Remember the basic rule of thumb is that models are equivalent if their delta (difference in AIC or BIC) is less than 2. If the difference is bigger than 2, then we choose the model with the smaller AIC or BIC.
Try re-running the code to randomly generate the variables above, then re-run the LRT, AIC and BIC to compare the results. How often do you get a non-signifcant LRT but a delta AIC or BIC bigger than 2?
Now that we’ve simulated an example to compare the two main approaches for model fitting and model selection, let’s look at some automated methods.
With forward selection you add variables one at a time and compare the fit of the model to decide if the predictor should stay in the model in the next round. Typically, we might:
Backward selection starts with all of the variables in the model and then removes predictors one at a time to test the fit of the model. In biological research, this is a commonly done using the LRT with nested models:
The LRT with backward selection is a common approach in Biology when there aren’t too many predictors, and the models are nested.
This is a general term for comparing many models. The idea is you are running your statistical tool through the muck of models to find something valuable. If you have many predictor variables you would not typically include interaction terms because this would greatly increase the number of models you would have to test. As you can imagine, the more models you test the more likely you are to find a spurious relationship. Nevertheless, this can be a powerful approach as long as you are careful about interpretation (e.g. run follow-up experiments to test significant relationships). If you have a large dataset, you can split it into a ‘training’ dataset and a ‘validation’ dataset. In this case, you would dredge the training dataset and then validate the model on the validation dataset. This is a common approach in machine learning and is covered other tutorials.
The MuMIn
and MASS
packages in R have some convenient functions for running model selection
library(MuMIn)
library(MASS)
We will try these different approaches on the FallopiaData.csv. These are plant biomass data for different species competing in pots.
<-read.csv("https://colauttilab.github.io/RCrashCourse/FallopiaData.csv")
InDatstr(InDat)
## 'data.frame': 123 obs. of 13 variables:
## $ PotNum : int 1 2 3 5 6 7 8 9 10 11 ...
## $ Scenario : chr "low" "low" "low" "low" ...
## $ Nutrients : chr "low" "low" "low" "low" ...
## $ Taxon : chr "japon" "japon" "japon" "japon" ...
## $ Symphytum : num 9.81 8.64 2.65 1.44 9.15 ...
## $ Silene : num 36.4 29.6 36 21.4 23.9 ...
## $ Urtica : num 16.08 5.59 17.09 12.39 5.19 ...
## $ Geranium : num 4.68 5.75 5.13 5.37 0 9.05 3.51 9.64 7.3 6.36 ...
## $ Geum : num 0.12 0.55 0.09 0.31 0.17 0.97 0.4 0.01 0.47 0.33 ...
## $ All_Natives : num 67 50.2 61 40.9 38.4 ...
## $ Fallopia : num 0.01 0.04 0.09 0.77 3.4 0.54 2.05 0.26 0 0 ...
## $ Total : num 67.1 50.2 61.1 41.7 41.8 ...
## $ Pct_Fallopia: num 0.01 0.08 0.15 1.85 8.13 1.12 3.7 0.61 0 0 ...
WARNING: There are several columns that are calculated from other columns. Including everything will give us problems due to collinearity (see Collinearity in the Advanced LM tutorial).
In particular, these columns should be excluded from the model selection:
Let’s see how well we can predict total biomass. We already know that this is calculated as the sum of all other plant biomasses, so to prevent co-linearity we will only look at the biomass of the native plants (i.e. exclude Fallopia and the treatment columns).
The first step is to specify the full model. We’ll add in a random variable
set.seed(4567)
$RAND<-rnorm(nrow(InDat))
InDat<-lm(Total ~ 1, data=InDat)
MinMod<-lm(Total ~ Symphytum + Silene + Urtica + Geranium + Geum
FullMod+ RAND, data=InDat)
Then we can use the stepAIC
function for stepwise selection using AIC
<-stepAIC(MinMod, scope=formula(FullMod), direction="forward") ForSel
## Start: AIC=565.9
## Total ~ 1
##
## Df Sum of Sq RSS AIC
## + Silene 1 2772.24 9276.4 535.73
## + Urtica 1 1545.26 10503.4 551.01
## + Geum 1 510.16 11538.5 562.58
## <none> 12048.7 565.90
## + Geranium 1 187.22 11861.5 565.97
## + RAND 1 104.30 11944.4 566.83
## + Symphytum 1 77.31 11971.4 567.10
##
## Step: AIC=535.73
## Total ~ Silene
##
## Df Sum of Sq RSS AIC
## + Urtica 1 3710.1 5566.3 474.91
## + Geranium 1 1012.1 8264.3 523.52
## + Geum 1 641.9 8634.5 528.91
## + Symphytum 1 155.3 9121.1 535.66
## <none> 9276.4 535.73
## + RAND 1 80.9 9195.5 536.66
##
## Step: AIC=474.91
## Total ~ Silene + Urtica
##
## Df Sum of Sq RSS AIC
## + Geranium 1 1575.16 3991.1 436.00
## + Symphytum 1 1211.04 4355.2 446.74
## + Geum 1 116.07 5450.2 474.32
## + RAND 1 91.44 5474.8 474.88
## <none> 5566.3 474.91
##
## Step: AIC=436
## Total ~ Silene + Urtica + Geranium
##
## Df Sum of Sq RSS AIC
## + Symphytum 1 1251.40 2739.7 391.72
## + Geum 1 88.35 3902.8 435.24
## <none> 3991.1 436.00
## + RAND 1 19.85 3971.3 437.38
##
## Step: AIC=391.72
## Total ~ Silene + Urtica + Geranium + Symphytum
##
## Df Sum of Sq RSS AIC
## + Geum 1 58.681 2681.1 391.06
## <none> 2739.7 391.72
## + RAND 1 2.087 2737.7 393.63
##
## Step: AIC=391.06
## Total ~ Silene + Urtica + Geranium + Symphytum + Geum
##
## Df Sum of Sq RSS AIC
## <none> 2681.1 391.06
## + RAND 1 2.7969 2678.3 392.93
summary(ForSel)
##
## Call:
## lm(formula = Total ~ Silene + Urtica + Geranium + Symphytum +
## Geum, data = InDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.984 -2.910 -1.171 1.280 18.493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.21815 2.47000 4.947 2.55e-06 ***
## Silene 0.80007 0.04441 18.015 < 2e-16 ***
## Urtica 0.92486 0.06397 14.457 < 2e-16 ***
## Geranium 0.90367 0.10843 8.334 1.72e-13 ***
## Symphytum 0.76473 0.10473 7.302 3.73e-11 ***
## Geum 1.93416 1.20866 1.600 0.112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 117 degrees of freedom
## Multiple R-squared: 0.7775, Adjusted R-squared: 0.768
## F-statistic: 81.76 on 5 and 117 DF, p-value: < 2.2e-16
From the output, we can see the AIC for each model, and go down until the difference is < 2.
It’s almost trivial to adapt the code above for backward
<-stepAIC(FullMod, direction="backward") BackSel
## Start: AIC=392.93
## Total ~ Symphytum + Silene + Urtica + Geranium + Geum + RAND
##
## Df Sum of Sq RSS AIC
## - RAND 1 2.8 2681.1 391.06
## <none> 2678.3 392.93
## - Geum 1 59.4 2737.7 393.63
## - Symphytum 1 1202.5 3880.7 436.55
## - Geranium 1 1546.4 4224.7 446.99
## - Urtica 1 4771.9 7450.2 516.77
## - Silene 1 7342.2 10020.5 553.23
##
## Step: AIC=391.06
## Total ~ Symphytum + Silene + Urtica + Geranium + Geum
##
## Df Sum of Sq RSS AIC
## <none> 2681.1 391.06
## - Geum 1 58.7 2739.7 391.72
## - Symphytum 1 1221.7 3902.8 435.24
## - Geranium 1 1591.7 4272.7 446.38
## - Urtica 1 4789.6 7470.7 515.11
## - Silene 1 7437.0 10118.0 552.42
summary(BackSel)
##
## Call:
## lm(formula = Total ~ Symphytum + Silene + Urtica + Geranium +
## Geum, data = InDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.984 -2.910 -1.171 1.280 18.493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.21815 2.47000 4.947 2.55e-06 ***
## Symphytum 0.76473 0.10473 7.302 3.73e-11 ***
## Silene 0.80007 0.04441 18.015 < 2e-16 ***
## Urtica 0.92486 0.06397 14.457 < 2e-16 ***
## Geranium 0.90367 0.10843 8.334 1.72e-13 ***
## Geum 1.93416 1.20866 1.600 0.112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 117 degrees of freedom
## Multiple R-squared: 0.7775, Adjusted R-squared: 0.768
## F-statistic: 81.76 on 5 and 117 DF, p-value: < 2.2e-16
When there is some degree of collinearity in the predictors, then the Forward and Backward selection can miss the ‘best’ model because it only tests one predictor at a time.
We can combine methods for a more robust model testing.
<-stepAIC(MinMod, scope=formula(FullMod), direction="both") BothSel
## Start: AIC=565.9
## Total ~ 1
##
## Df Sum of Sq RSS AIC
## + Silene 1 2772.24 9276.4 535.73
## + Urtica 1 1545.26 10503.4 551.01
## + Geum 1 510.16 11538.5 562.58
## <none> 12048.7 565.90
## + Geranium 1 187.22 11861.5 565.97
## + RAND 1 104.30 11944.4 566.83
## + Symphytum 1 77.31 11971.4 567.10
##
## Step: AIC=535.73
## Total ~ Silene
##
## Df Sum of Sq RSS AIC
## + Urtica 1 3710.1 5566.3 474.91
## + Geranium 1 1012.1 8264.3 523.52
## + Geum 1 641.9 8634.5 528.91
## + Symphytum 1 155.3 9121.1 535.66
## <none> 9276.4 535.73
## + RAND 1 80.9 9195.5 536.66
## - Silene 1 2772.2 12048.7 565.90
##
## Step: AIC=474.91
## Total ~ Silene + Urtica
##
## Df Sum of Sq RSS AIC
## + Geranium 1 1575.2 3991.1 436.00
## + Symphytum 1 1211.0 4355.2 446.74
## + Geum 1 116.1 5450.2 474.32
## + RAND 1 91.4 5474.8 474.88
## <none> 5566.3 474.91
## - Urtica 1 3710.1 9276.4 535.73
## - Silene 1 4937.1 10503.4 551.01
##
## Step: AIC=436
## Total ~ Silene + Urtica + Geranium
##
## Df Sum of Sq RSS AIC
## + Symphytum 1 1251.4 2739.7 391.72
## + Geum 1 88.3 3902.8 435.24
## <none> 3991.1 436.00
## + RAND 1 19.9 3971.3 437.38
## - Geranium 1 1575.2 5566.3 474.91
## - Urtica 1 4273.2 8264.3 523.52
## - Silene 1 6320.8 10312.0 550.75
##
## Step: AIC=391.72
## Total ~ Silene + Urtica + Geranium + Symphytum
##
## Df Sum of Sq RSS AIC
## + Geum 1 58.7 2681.1 391.06
## <none> 2739.7 391.72
## + RAND 1 2.1 2737.7 393.63
## - Symphytum 1 1251.4 3991.1 436.00
## - Geranium 1 1615.5 4355.2 446.74
## - Urtica 1 5390.6 8130.4 523.51
## - Silene 1 7550.3 10290.0 552.49
##
## Step: AIC=391.06
## Total ~ Silene + Urtica + Geranium + Symphytum + Geum
##
## Df Sum of Sq RSS AIC
## <none> 2681.1 391.06
## - Geum 1 58.7 2739.7 391.72
## + RAND 1 2.8 2678.3 392.93
## - Symphytum 1 1221.7 3902.8 435.24
## - Geranium 1 1591.7 4272.7 446.38
## - Urtica 1 4789.6 7470.7 515.11
## - Silene 1 7437.0 10118.0 552.42
summary(BothSel)
##
## Call:
## lm(formula = Total ~ Silene + Urtica + Geranium + Symphytum +
## Geum, data = InDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.984 -2.910 -1.171 1.280 18.493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.21815 2.47000 4.947 2.55e-06 ***
## Silene 0.80007 0.04441 18.015 < 2e-16 ***
## Urtica 0.92486 0.06397 14.457 < 2e-16 ***
## Geranium 0.90367 0.10843 8.334 1.72e-13 ***
## Symphytum 0.76473 0.10473 7.302 3.73e-11 ***
## Geum 1.93416 1.20866 1.600 0.112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 117 degrees of freedom
## Multiple R-squared: 0.7775, Adjusted R-squared: 0.768
## F-statistic: 81.76 on 5 and 117 DF, p-value: < 2.2e-16
Now we see many ‘base’ models showing how the fit changes as we add (+) or remove (-) from each base models.
Interactions add a lot of complexity to model fitting. Here’s an example with just four variables:
<-lm(Total ~ Symphytum*Silene*Urtica*RAND, data=InDat)
ComplexMod<-stepAIC(MinMod, scope=formula(ComplexMod), direction="both") CplxSel
## Start: AIC=565.9
## Total ~ 1
##
## Df Sum of Sq RSS AIC
## + Silene 1 2772.24 9276.4 535.73
## + Urtica 1 1545.26 10503.4 551.01
## <none> 12048.7 565.90
## + RAND 1 104.30 11944.4 566.83
## + Symphytum 1 77.31 11971.4 567.10
##
## Step: AIC=535.73
## Total ~ Silene
##
## Df Sum of Sq RSS AIC
## + Urtica 1 3710.1 5566.3 474.91
## + Symphytum 1 155.3 9121.1 535.66
## <none> 9276.4 535.73
## + RAND 1 80.9 9195.5 536.66
## - Silene 1 2772.2 12048.7 565.90
##
## Step: AIC=474.91
## Total ~ Silene + Urtica
##
## Df Sum of Sq RSS AIC
## + Symphytum 1 1211.0 4355.2 446.74
## + RAND 1 91.4 5474.8 474.88
## <none> 5566.3 474.91
## + Silene:Urtica 1 40.1 5526.2 476.02
## - Urtica 1 3710.1 9276.4 535.73
## - Silene 1 4937.1 10503.4 551.01
##
## Step: AIC=446.74
## Total ~ Silene + Urtica + Symphytum
##
## Df Sum of Sq RSS AIC
## + Symphytum:Silene 1 319.5 4035.7 439.36
## <none> 4355.2 446.74
## + RAND 1 45.0 4310.2 447.46
## + Silene:Urtica 1 18.8 4336.4 448.20
## + Symphytum:Urtica 1 7.0 4348.2 448.54
## - Symphytum 1 1211.0 5566.3 474.91
## - Urtica 1 4765.9 9121.1 535.66
## - Silene 1 6141.4 10496.6 552.93
##
## Step: AIC=439.36
## Total ~ Silene + Urtica + Symphytum + Silene:Symphytum
##
## Df Sum of Sq RSS AIC
## + Symphytum:Urtica 1 86.6 3949.2 438.70
## <none> 4035.7 439.36
## + RAND 1 31.0 4004.7 440.41
## + Silene:Urtica 1 2.6 4033.1 441.28
## - Silene:Symphytum 1 319.5 4355.2 446.74
## - Urtica 1 4681.4 8717.1 532.09
##
## Step: AIC=438.7
## Total ~ Silene + Urtica + Symphytum + Silene:Symphytum + Urtica:Symphytum
##
## Df Sum of Sq RSS AIC
## <none> 3949.2 438.70
## + Silene:Urtica 1 45.85 3903.3 439.26
## + RAND 1 44.42 3904.7 439.30
## - Urtica:Symphytum 1 86.57 4035.7 439.36
## - Silene:Symphytum 1 399.08 4348.2 448.54
summary(CplxSel)
##
## Call:
## lm(formula = Total ~ Silene + Urtica + Symphytum + Silene:Symphytum +
## Urtica:Symphytum, data = InDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4483 -3.6256 0.0568 2.6176 20.4114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.18183 4.28716 2.608 0.010288 *
## Silene 1.00369 0.10427 9.626 < 2e-16 ***
## Urtica 1.14232 0.18032 6.335 4.58e-09 ***
## Symphytum 1.90793 0.37453 5.094 1.36e-06 ***
## Silene:Symphytum -0.03638 0.01058 -3.439 0.000811 ***
## Urtica:Symphytum -0.02870 0.01792 -1.602 0.111956
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.81 on 117 degrees of freedom
## Multiple R-squared: 0.6722, Adjusted R-squared: 0.6582
## F-statistic: 47.99 on 5 and 117 DF, p-value: < 2.2e-16
The dredge
function from the MuMIn
package is useful for testing all possible models. It has several additional parameters to help with model selection. An important one is beta=
. This is a method for standardizing the beta (coefficients) estimates so that they can be compared across models.
<-lm(Total ~ Symphytum*Silene*Urtica*RAND, data=InDat)
ComplexMod<-dredge(ComplexMod, beta="sd") DMod
## Error in dredge(ComplexMod, beta = "sd"): 'global.model''s 'na.action' argument is not set and options('na.action') is "na.omit"
Note the error. This is an important warning about missing data. In our case, there are no missing data, but this error prevents us from accidentally comparing models with different numbers of parameters. To run the model, we should set our global options to ‘na.fail’ instead of ‘na.omit’:
options(na.action="na.fail")
<-dredge(ComplexMod, beta="sd") DMod
## Fixed term is "(Intercept)"
head(DMod)
## Global model call: lm(formula = Total ~ Symphytum * Silene * Urtica * RAND, data = InDat)
## ---
## Model selection table
## (Int) RAN Sln Sym Urt RAN:Sym RAN:Urt Sln:Sym Sln:Urt
## 208 0 -0.2744 1.202 0.7579 0.7024 0.2699 -0.4547
## 9168 0 -0.2359 1.588 1.1870 1.4680 0.2154 -0.8123 -0.60830
## 720 0 -0.2576 1.240 0.8827 0.8412 0.2420 -0.4997
## 976 0 -0.2599 1.324 0.9350 1.0270 0.2537 -0.5038 -0.13360
## 464 0 -0.2806 1.233 0.7466 0.7546 0.2843 -0.4431 -0.06766
## 240 0 -0.2404 1.206 0.7639 0.7016 -0.03178 0.2627 -0.4579
## Sym:Urt Sln:Sym:Urt df logLik AICc delta weight
## 208 8 -384.672 786.6 0.00 0.298
## 9168 -0.7120 0.4163 11 -381.541 787.5 0.85 0.194
## 720 -0.1635 9 -384.104 787.8 1.19 0.164
## 976 -0.2612 10 -383.195 788.4 1.75 0.124
## 464 9 -384.391 788.4 1.77 0.123
## 240 9 -384.630 788.9 2.25 0.097
## Models ranked by AICc(x)
The output include ALL of the models, so here we just look at the AICc, delta AICc and coefficients of the top models.
As always, we should inspect the residuals of the model to look for problems and test the assumptions as outlined in the QC section of the Linear Models tutorial. We also make sure there are no missing values in the columns used in our models
Finally, we should also check the overall fit if our best model. The reason to do this is that there will always be a ‘best fit’ model in our model selection analysis, even if all the models are complete garbage. A good way to do this is to compare the prediction of the model against the observed values. The better the model, the more tightly the points will form a line with the observed values:
qplot(x=predict(Mod2), y=Y) + xlab("Predicted") + ylab("Observed")
qplot(x=predict(ComplexMod),y=InDat$Total) + xlab("Predicted") + ylab("Observed")