Linear models are run in R using the lm()
function, as we saw detailed in the Linear Models Tutorial. In that tutorial, we also saw a wide range of ‘classical’ statistical models that differ only in the type of predictor variables (e.g. ANOVA for categorical and regression for continuous).
We also learned how to inspect our residuals to make sure that they don’t violate model assumptions. One of the most important is that the error term follows a normal distribution with a mean of zero.
Generalized Linear Models use the glm()
function in R, and can be used when our residuals DO NOT follow a normal distribution. If you understand how lm
works, then glm
will be very easy to understand. If you are still struggling with lm
then it’s a good idea to go back and review/practice before continuing.
A good way to think of a Generalized Linear Model is a regular linear model that has been transformed by a Link function. A link function is just another equation that we use to transform the linear model. We’ll show how this works for Logistic Regression (binomial response variable) and Poisson Regression (log-normal or count response variable).
Load libraries and custom plotting theme
library(ggplot2) # plotting library
library(dplyr) # data management
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
Let’s start with a typical linear model by generating pretend data:
<-1000
N<-rnorm(N, sd=3)
X1<-sample(c("Control","Treat1"), N, replace=T)
X2<-data.frame(Pred1=X1, Pred2=X2) %>%
tDatmutate(Pred2dev=recode(Pred2, "Control"=0, "Treat1"=4)) %>%
mutate(Resp= -1 + 0.3*Pred1 + Pred2dev + rnorm(N, sd=2))
This is similar to the models we set up in the Linear Models Tutorial, where we have a sample size of N = 1000, we have a continuous predictor and we have a categorical predictor. Our response variable is a function of the predictors with a slope of \(0.3 \times X_1\) and a random normal error. We add the strings for the treatment category so that we can run the linear model.
<-lm(Resp~Pred1 + Pred2, data=tDat)
Modsummary(Mod)
##
## Call:
## lm(formula = Resp ~ Pred1 + Pred2, data = tDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.381 -1.370 -0.085 1.345 6.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03170 0.08589 -12.01 <2e-16 ***
## Pred1 0.26340 0.02027 13.00 <2e-16 ***
## Pred2Treat1 4.14499 0.12280 33.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.94 on 997 degrees of freedom
## Multiple R-squared: 0.5746, Adjusted R-squared: 0.5738
## F-statistic: 673.4 on 2 and 997 DF, p-value: < 2.2e-16
Recall that the model is:
\[ Y = \beta_0 + \beta_1*X_1 + \beta_2X_2 + \epsilon \]
where \(Y\) is a vector (e.g. column in a data frame) containing the response values for each individual, \(\Beta_0\) is the intercept of the model, \(\beta_1\) is the slope of the predictor \(X_1\), which is also a vector containing values for the first predictor. \(\beta_2\) is likewise the slope for the vector of predictor \(X_2\) and \(\epsilon\) is the residual error term.
The key assumption that distinguishes lm
from glm
is that our residual error is normally distributed:
qplot(residuals(Mod))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can fit the same model using GLM. Compare this code to the lm
model, above:
<-glm(Resp~Pred1 + Pred2, family=gaussian, data=tDat) GLMod
Note the addition of the family=
parameter. Here we use gaussian, which is actually the default so technically we don’t need to specify it for this model. However, we include it here since it is the main parameter that changes depending on the distribution of the response variable.
Now compare the output:
summary(GLMod)
##
## Call:
## glm(formula = Resp ~ Pred1 + Pred2, family = gaussian, data = tDat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.381 -1.370 -0.085 1.345 6.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03170 0.08589 -12.01 <2e-16 ***
## Pred1 0.26340 0.02027 13.00 <2e-16 ***
## Pred2Treat1 4.14499 0.12280 33.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3.761942)
##
## Null deviance: 8817.3 on 999 degrees of freedom
## Residual deviance: 3750.7 on 997 degrees of freedom
## AIC: 4167.8
##
## Number of Fisher Scoring iterations: 2
It should be very similar, but you’ll notice that there is no p-value for the overall model. But, there is an AIC that we can use for model selection, or we can do the likelihood ratio test. You can review these in the Model Selection Tutorial if you need a refresher.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
<-glm(Resp~Pred1, data=tDat)
GLMod2lrtest(GLMod2,GLMod)
## Likelihood ratio test
##
## Model 1: Resp ~ Pred1
## Model 2: Resp ~ Pred1 + Pred2
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -2461.0
## 2 4 -2079.9 1 762.11 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now what if our residuals are not from a random normal distribution? First, make sure you review the Distributions Tutorial.
A Logistic Regression is similar to a linear regression in lm()
except that the response variable is binary rather than continuous. This means that values of the response variable must be either 0 or 1. This is also known as a Bernoulli variable. Going back to our Distributions Tutorial you may recall that we can sample from a random binomial distribution with the rbinom
distribution. In this case, the ‘size’ parameter is just 1 because each response observation is a 0 or 1. But to generate a statistical model, we have to define the probability of 0 and 1. We also have to make sure that the probabilities add up to 1. In a logistic regression this is done with a transformation of the linear model (Y) to a probability (P). This is called Link function:
\[P = \frac{1}{1+e^{-Y}} \]
Or more specifically: \[ P = \frac{1}{1+e^{-(\beta_0 + \beta_1*X_1 + \beta_2X_2)}} \]
Note that the residual error is not part of the link function
<-1/(1+exp(-(-1 + 0.3*tDat$Pred1 + tDat$Pred2dev)))
Logitqplot(x=tDat$Resp,y=Logit)
Notice how the y-axis runs from 0 to 1, which we can think of as a probability of observing 1 in the dataset. Now we can input that probability
<-tDat %>%
tDatmutate(LResp=rbinom(n=N, size=1, prob=Logit))
To run the glm
model, we have to specify that this is a logistic model, rather than Gaussian.
<-glm(LResp ~ Pred1 + Pred2, family=binomial, data=tDat)
LModsummary(LMod)
##
## Call:
## glm(formula = LResp ~ Pred1 + Pred2, family = binomial, data = tDat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4172 -0.6562 0.2132 0.4358 2.2207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9395 0.1073 -8.754 <2e-16 ***
## Pred1 0.3240 0.0345 9.390 <2e-16 ***
## Pred2Treat1 3.8825 0.2359 16.460 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1333.85 on 999 degrees of freedom
## Residual deviance: 770.71 on 997 degrees of freedom
## AIC: 776.71
##
## Number of Fisher Scoring iterations: 5
Note how the estimates are very similar to the gaussian model BUT REMEMBER that to generate our predictions we have to transform our linear model using the equation above. You can see the difference by re-running the model with lm
(without the family parameter) and seeing how the Estimate values differ.
As with the gaussian distribution, we can use LRT to determine the significance of specific terms or models, or AIC for more general model selection.
Poisson variables are common for count data (e.g. N offspring) but also can apply to log-normally distributed data with continuous measurements (e.g. biomass).
If we look at rpois()
to sample from a Poisson distribution, we can see the term lambda, which is the mean of the poisson distribution. In a Poisson distribution, the variance (e.g. residual error) scales with the mean. Specifically, the prediction is on the natural-log scale:
\[ N = e^{Y}\]
Or the more detailed version:
\[ N = e^{\beta_0 + \beta_1*X_1 + \beta_2X_2}\]
Similar to the logistic regression, we transform the model using this equation.
<-exp(-1 + 0.3*tDat$Pred1 + tDat$Pred2dev)
Log<-tDat %>%
tDatmutate(PResp=rpois(N, lambda=Log))
And we can plot to better understand the relationship
qplot(x=Resp,y=PResp,data=tDat)
Now run the model and compare the output.
<-glm(PResp ~ Pred1 + Pred2, family=poisson, data=tDat)
PModsummary(PMod)
##
## Call:
## glm(formula = PResp ~ Pred1 + Pred2, family = poisson, data = tDat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4982 -0.7875 -0.4199 0.5073 2.9811
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.051461 0.062385 -16.85 <2e-16 ***
## Pred1 0.301898 0.002695 112.01 <2e-16 ***
## Pred2Treat1 4.047623 0.062532 64.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 34626.43 on 999 degrees of freedom
## Residual deviance: 938.83 on 997 degrees of freedom
## AIC: 3721.2
##
## Number of Fisher Scoring iterations: 5
?family
provides a list of families that you can use with glm
and related models.
Sometimes the link function is obvious given the data. For example, a response variable with 1 and 0 would be a binomial function in our example above. But sometimes it’s not so easy. For example, other families not covered here can give the same response data type. In these cases, we can use information criteria to test the fit of different models. We can see an example of this for our poisson model:
<-glm(PResp ~ Pred1 + Pred2, family=gaussian, data=tDat)
GausMod<-glm(PResp ~ Pred1 + Pred2, family=poisson, data=tDat)
PoisModAIC(GausMod)
## [1] 8842.777
AIC(PoisMod)
## [1] 3721.235
We can see the model with the poisson link fits much better than the model with the gaussian link.
Why can’t we use a likelihood ratio test?
HINT: Think about the parameters in a LRT