Overview

Linear models are run in R using the lm() function, as we saw detailed in the Linear Models Tutorial. In that tutorial, we also saw a wide range of ‘classical’ statistical models that differ only in the type of predictor variables (e.g. ANOVA for categorical and regression for continuous).

We also learned how to inspect our residuals to make sure that they don’t violate model assumptions. One of the most important is that the error term follows a normal distribution with a mean of zero.

Generalized Linear Models use the glm() function in R, and can be used when our residuals DO NOT follow a normal distribution. If you understand how lm works, then glm will be very easy to understand. If you are still struggling with lm then it’s a good idea to go back and review/practice before continuing.

A good way to think of a Generalized Linear Model is a regular linear model that has been transformed by a Link function. A link function is just another equation that we use to transform the linear model. We’ll show how this works for Logistic Regression (binomial response variable) and Poisson Regression (log-normal or count response variable).

Setup

Load libraries and custom plotting theme

library(ggplot2) # plotting library
library(dplyr) # data management

source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())

Normal Residuals

Let’s start with a typical linear model by generating pretend data:

N<-1000
X1<-rnorm(N, sd=3)
X2<-sample(c("Control","Treat1"), N, replace=T)
tDat<-data.frame(Pred1=X1, Pred2=X2) %>%
  mutate(Pred2dev=recode(Pred2, "Control"=0, "Treat1"=4)) %>%
  mutate(Resp= -1 + 0.3*Pred1 + Pred2dev + rnorm(N, sd=2))

This is similar to the models we set up in the Linear Models Tutorial, where we have a sample size of N = 1000, we have a continuous predictor and we have a categorical predictor. Our response variable is a function of the predictors with a slope of \(0.3 \times X_1\) and a random normal error. We add the strings for the treatment category so that we can run the linear model.

LM

Mod<-lm(Resp~Pred1 + Pred2, data=tDat)
summary(Mod)

## 
## Call:
## lm(formula = Resp ~ Pred1 + Pred2, data = tDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.381 -1.370 -0.085  1.345  6.919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03170    0.08589  -12.01   <2e-16 ***
## Pred1        0.26340    0.02027   13.00   <2e-16 ***
## Pred2Treat1  4.14499    0.12280   33.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.94 on 997 degrees of freedom
## Multiple R-squared:  0.5746, Adjusted R-squared:  0.5738 
## F-statistic: 673.4 on 2 and 997 DF,  p-value: < 2.2e-16

Recall that the model is:

\[ Y = \beta_0 + \beta_1*X_1 + \beta_2X_2 + \epsilon \]

where \(Y\) is a vector (e.g. column in a data frame) containing the response values for each individual, \(\Beta_0\) is the intercept of the model, \(\beta_1\) is the slope of the predictor \(X_1\), which is also a vector containing values for the first predictor. \(\beta_2\) is likewise the slope for the vector of predictor \(X_2\) and \(\epsilon\) is the residual error term.

The key assumption that distinguishes lm from glm is that our residual error is normally distributed:

qplot(residuals(Mod))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GLM

We can fit the same model using GLM. Compare this code to the lm model, above:

GLMod<-glm(Resp~Pred1 + Pred2, family=gaussian, data=tDat)

Note the addition of the family= parameter. Here we use gaussian, which is actually the default so technically we don’t need to specify it for this model. However, we include it here since it is the main parameter that changes depending on the distribution of the response variable.

Now compare the output:

summary(GLMod)

## 
## Call:
## glm(formula = Resp ~ Pred1 + Pred2, family = gaussian, data = tDat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.381  -1.370  -0.085   1.345   6.919  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03170    0.08589  -12.01   <2e-16 ***
## Pred1        0.26340    0.02027   13.00   <2e-16 ***
## Pred2Treat1  4.14499    0.12280   33.75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.761942)
## 
##     Null deviance: 8817.3  on 999  degrees of freedom
## Residual deviance: 3750.7  on 997  degrees of freedom
## AIC: 4167.8
## 
## Number of Fisher Scoring iterations: 2

It should be very similar, but you’ll notice that there is no p-value for the overall model. But, there is an AIC that we can use for model selection, or we can do the likelihood ratio test. You can review these in the Model Selection Tutorial if you need a refresher.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

GLMod2<-glm(Resp~Pred1, data=tDat)
lrtest(GLMod2,GLMod)

## Likelihood ratio test
## 
## Model 1: Resp ~ Pred1
## Model 2: Resp ~ Pred1 + Pred2
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   3 -2461.0                         
## 2   4 -2079.9  1 762.11  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now what if our residuals are not from a random normal distribution? First, make sure you review the Distributions Tutorial.

Logistic

A Logistic Regression is similar to a linear regression in lm() except that the response variable is binary rather than continuous. This means that values of the response variable must be either 0 or 1. This is also known as a Bernoulli variable. Going back to our Distributions Tutorial you may recall that we can sample from a random binomial distribution with the rbinom distribution. In this case, the ‘size’ parameter is just 1 because each response observation is a 0 or 1. But to generate a statistical model, we have to define the probability of 0 and 1. We also have to make sure that the probabilities add up to 1. In a logistic regression this is done with a transformation of the linear model (Y) to a probability (P). This is called Link function:

Link Function

\[P = \frac{1}{1+e^{-Y}} \]

Or more specifically: \[ P = \frac{1}{1+e^{-(\beta_0 + \beta_1*X_1 + \beta_2X_2)}} \]

Note that the residual error is not part of the link function

Logit<-1/(1+exp(-(-1 + 0.3*tDat$Pred1 + tDat$Pred2dev)))
qplot(x=tDat$Resp,y=Logit)

Notice how the y-axis runs from 0 to 1, which we can think of as a probability of observing 1 in the dataset. Now we can input that probability

tDat<-tDat %>%
  mutate(LResp=rbinom(n=N, size=1, prob=Logit))

To run the glm model, we have to specify that this is a logistic model, rather than Gaussian.

LMod<-glm(LResp ~ Pred1 + Pred2, family=binomial, data=tDat)
summary(LMod)

## 
## Call:
## glm(formula = LResp ~ Pred1 + Pred2, family = binomial, data = tDat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4172  -0.6562   0.2132   0.4358   2.2207  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9395     0.1073  -8.754   <2e-16 ***
## Pred1         0.3240     0.0345   9.390   <2e-16 ***
## Pred2Treat1   3.8825     0.2359  16.460   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1333.85  on 999  degrees of freedom
## Residual deviance:  770.71  on 997  degrees of freedom
## AIC: 776.71
## 
## Number of Fisher Scoring iterations: 5

Note how the estimates are very similar to the gaussian model BUT REMEMBER that to generate our predictions we have to transform our linear model using the equation above. You can see the difference by re-running the model with lm (without the family parameter) and seeing how the Estimate values differ.

As with the gaussian distribution, we can use LRT to determine the significance of specific terms or models, or AIC for more general model selection.

Poisson

Poisson variables are common for count data (e.g. N offspring) but also can apply to log-normally distributed data with continuous measurements (e.g. biomass).

If we look at rpois() to sample from a Poisson distribution, we can see the term lambda, which is the mean of the poisson distribution. In a Poisson distribution, the variance (e.g. residual error) scales with the mean. Specifically, the prediction is on the natural-log scale:

Link Function

\[ N = e^{Y}\]

Or the more detailed version:

\[ N = e^{\beta_0 + \beta_1*X_1 + \beta_2X_2}\]

Similar to the logistic regression, we transform the model using this equation.

Log<-exp(-1 + 0.3*tDat$Pred1 + tDat$Pred2dev)
tDat<-tDat %>% 
  mutate(PResp=rpois(N, lambda=Log))

And we can plot to better understand the relationship

qplot(x=Resp,y=PResp,data=tDat)

Now run the model and compare the output.

PMod<-glm(PResp ~ Pred1 + Pred2, family=poisson, data=tDat)
summary(PMod)

## 
## Call:
## glm(formula = PResp ~ Pred1 + Pred2, family = poisson, data = tDat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4982  -0.7875  -0.4199   0.5073   2.9811  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.051461   0.062385  -16.85   <2e-16 ***
## Pred1        0.301898   0.002695  112.01   <2e-16 ***
## Pred2Treat1  4.047623   0.062532   64.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 34626.43  on 999  degrees of freedom
## Residual deviance:   938.83  on 997  degrees of freedom
## AIC: 3721.2
## 
## Number of Fisher Scoring iterations: 5

Other GLMs

?family provides a list of families that you can use with glm and related models.

Model Selection

Sometimes the link function is obvious given the data. For example, a response variable with 1 and 0 would be a binomial function in our example above. But sometimes it’s not so easy. For example, other families not covered here can give the same response data type. In these cases, we can use information criteria to test the fit of different models. We can see an example of this for our poisson model:

GausMod<-glm(PResp ~ Pred1 + Pred2, family=gaussian, data=tDat)
PoisMod<-glm(PResp ~ Pred1 + Pred2, family=poisson, data=tDat)
AIC(GausMod)

## [1] 8842.777

AIC(PoisMod)

## [1] 3721.235

We can see the model with the poisson link fits much better than the model with the gaussian link.

Why can’t we use a likelihood ratio test?

HINT: Think about the parameters in a LRT