Setup

Load libraries and our custom plotting theme.

library(ggplot2) # plotting library
library(dplyr) # data management

source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())

We will be working with the same FallopiaData.csv data that we used in the R fundamentals tutorial and the qplot tutorial. We could download to a local folder again and then read from our computer.

Another trick is to read the file directly from the internet. This will work for any file available through http://

InDat<-read.csv("http://colauttilab.github.io/RCrashCourse/FallopiaData.csv")
str(InDat)

## 'data.frame':    123 obs. of  13 variables:
##  $ PotNum      : int  1 2 3 5 6 7 8 9 10 11 ...
##  $ Scenario    : chr  "low" "low" "low" "low" ...
##  $ Nutrients   : chr  "low" "low" "low" "low" ...
##  $ Taxon       : chr  "japon" "japon" "japon" "japon" ...
##  $ Symphytum   : num  9.81 8.64 2.65 1.44 9.15 ...
##  $ Silene      : num  36.4 29.6 36 21.4 23.9 ...
##  $ Urtica      : num  16.08 5.59 17.09 12.39 5.19 ...
##  $ Geranium    : num  4.68 5.75 5.13 5.37 0 9.05 3.51 9.64 7.3 6.36 ...
##  $ Geum        : num  0.12 0.55 0.09 0.31 0.17 0.97 0.4 0.01 0.47 0.33 ...
##  $ All_Natives : num  67 50.2 61 40.9 38.4 ...
##  $ Fallopia    : num  0.01 0.04 0.09 0.77 3.4 0.54 2.05 0.26 0 0 ...
##  $ Total       : num  67.1 50.2 61.1 41.7 41.8 ...
##  $ Pct_Fallopia: num  0.01 0.08 0.15 1.85 8.13 1.12 3.7 0.61 0 0 ...

The structure function str() is a handy way to inspect the data, but it doesn’t show us all the different values for the nominal (i.e. categorical) columns. We can use the unique() function for that. We could apply it to each column or we can use the dplyr() function from the data science tutorial

InDat %>%
  select(Scenario, Nutrients, Taxon) %>%
  unique()

##         Scenario Nutrients Taxon
## 1            low       low japon
## 13           low       low bohem
## 26          high      high japon
## 38          high      high bohem
## 49       gradual      high japon
## 61       gradual      high bohem
## 74       extreme      high japon
## 88       extreme      high bohem
## 100 fluctuations      high japon
## 113 fluctuations      high bohem

This gives us every combination of the nominal variables. In this case, we can see that most of the different Scenario categories are only applied to the high Nutrient treatment.

Looking at the combined output, we can see:

PotNum - a unique ID number for each plant pot
Scenario - the way fertilizer was applied
Nutrients - the amount of fertilizer applied
Taxon - the species of Fallopia
Symphytum, Silene, Urtica, Geranium, Geum - biomass of native plants, in competition with one of the two Fallopia species grown
Fallopia - biomass of the Fallopia competitor species

Overview

Most statistical textbooks and courses will go through all of the standard models including ANOVA and linear regression, then slightly more complicated models like polynomial regression, multiple regression, multifactor ANOVA, and ANCOVA.

I’m not sure why stats are always taught this way. There could be a good reason but I suspect it has more to do with the history of statistics: Student’s t-test was invented, followed by Fisher’s ANOVA, and then linear regression. Then these were expanded to more complicated cases.

However, today we know that all of these models are special cases of a general model in statistics called Linear Models. In R, linear models are run with the lm() function. Rather than go through the history of statistics, we’ll focus on the application of linear models. That said, it’s important to know the difference between all those models listed above because many people learn statistics that way, so they are used to talking about those models.

NOTE: Linear Models are different from General Linear Models, which are both different from Generalized Linear Models. We also have Linear Mixed Effects Models. For now, it’s just important to know that these are different classes of models even though they sound similar.

Structure

In the qplot tutorial we looked at visualizations for individual and paired data.

We saw that data can come in a few flavours:

Binary – Boolean variables with two possible states, such as 1 or 0, true/false, yes/no
Nominal – Variables with two or more categories (e.g. treatment)
Continuous – Measurements that can take on a range of values (e.g. height)
Ordinal – Discrete categories but ordered in some way (e.g. number of offspring)

Different types of linear models (e.g. ANOVA vs Regression) are defined by the type of input data they use. Importantly, Linear Models have two types of input:

Response or Dependent variables – these are the data columns that we are trying to predict.
Predictor or Independent variables – thes are the data columns that we are using to make predictions.

Linear Models will always have a single, continuous response variable, but it can have one or more predictor variables that may include a mix of binary, nominal and continuous variables. Ordinal variables are a bit more tricky, and will be treated as (approximately) nominal or continuous variables, depending on the nature of the data.

Formula

To understand the syntax of a linear model, look at the help:

?lm

There are just two main components; the rest we can keep as default for no:

formula – this is how we define the model
data – the data.frame object we are analyzing

The formula for linear models looks like this:

Response ~ Predictor(s)

You just have to remember the tilde and which side to put your predictor vs response variables.

Visualizing Types

Let’s start by looking at a single response variable and a single predictor. Our response will always be a continuous variable in a Linear Model. Similar to the qplot tutorial, we can organize our linear models based on the type of predictor

To visualize these models, let’s first create some data for each type of response variables:

Continuous

Our continuous response variable (Y)

set.seed(123)
Y<-rnorm(1000)

A continuous response variable (Xcon)

set.seed(234)
Xcon<-rnorm(1000)

Nominal

A categorical response variable (Xcat).

We can use the sample function ?sample

set.seed(345)
Xcat<-sample(c("A","B","C"),1000, replace=T)

Ordinal

What about ordinal data? There are more advanced models that can handle this kind of data but we can also run them in a linear model if we make a choice whether to treat it as categorical or continuous.

For example, if we were looking at offspring numbers in Canadian moose, most would have 0, 1 or 2, but some would have 3 or more. We could recode every observation into one of 4 categories: 0, 1, 2, 3+ and analyze offspring number as a categorical variable.

Another example, if we look at the nubmer of seeds on the head of a dandelion, we might find that the distribution follows a normal or log-normal distribution. In that case we can treat it as a continuous variable – possibly taking the log before using it in our linear model.

Linear Regression

Linear regression uses a continuous predictor

Continuous Response ~ Continuous Predictor

CCplot<-qplot(x=Xcon, y=Y)
CCplot

In linear regression, we want to find a straight line that fits to the majority of data. We can visulaize this using geom_smooth()

CCplot<-CCplot + geom_smooth(method="lm")

lm(Y ~ Xcon)

## 
## Call:
## lm(formula = Y ~ Xcon)
## 
## Coefficients:
## (Intercept)         Xcon  
##     0.01684      0.03298

The output of the lm function is not too informative. It just shows us the input formula and then the estimated coefficients. There are some additional functions to extract more information about the model.

LinReg<-lm(Y ~ Xcon)
summary(LinReg)

## 
## Call:
## lm(formula = Y ~ Xcon)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7766 -0.6543 -0.0101  0.6544  3.2131 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.01684    0.03137   0.537    0.591
## Xcon         0.03298    0.03186   1.035    0.301
## 
## Residual standard error: 0.9917 on 998 degrees of freedom
## Multiple R-squared:  0.001073,   Adjusted R-squared:  7.187e-05 
## F-statistic: 1.072 on 1 and 998 DF,  p-value: 0.3008

Predictions

The Estimate column gives us the equation for the line of best fit, with the intercept and slope. Recall the linear equation:

\[ y = mX + b \]

In linear models, this exact same equation is usually written as:

\[ Y_i \sim \beta_0 + \beta_1 X_i\]

This gives us the equation for the estimated line, which would be okay if every single observation fell exactly on the line of best fit. We can plot for comparison:

CCplot

## `geom_smooth()` using formula 'y ~ x'

CCplot + geom_abline(slope=0.033, intercept=0.017, colour="red")

## `geom_smooth()` using formula 'y ~ x'

The red line that we added sits right on top of the lm line added from geom_smooth.

We can also make specific predictions for each observation:

Predicted<- 0.033*Xcon + 0.017
CCplot + geom_point(aes(y=Predicted), colour="red")

## `geom_smooth()` using formula 'y ~ x'

However, we can see that most observations do not fall on the line of best fit. To fully describe the relationship between Y and X, we need to account for the deviation of each point from the line.

Residual Error

This is called the error term or residual error:

\[ \epsilon_i = Observed_i - Predicted_i \]

In other words, we:

Look at each point
Record the observed value on the y-axis
Look at the predicted value on the y-axis (the line of best fit)
Subtract the observed value from the predicted

Resid<-Y-Predicted
head(Resid)

## [1] -0.599281048 -0.179429050  1.591182114  0.004957699  0.064136164
## [6]  1.693440399

Accounting for this error, fully describes each point, giving the correct Linear Model equation:

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Our regression model has three important vectors:

Predictor

Take a minute to review and make sure you understand this. The \(Y_i\) is the observed value of our response variable. The subscript \(i\) represents an individual value, which tells us that \(Y\) is a vector of values.

Response

Similarly, the \(X_i\) is the observed value of our predictor variable, with the same subscript, meaning that \(X\) is also a vector.

Residuals

And again, \(\epsilon\) is also a vector. This vector represents the difference between the observed Y and the predicted Y from our model. This is known as the vector of residuals or residual error. We can get these from our lm object.

LinResid<-residuals(LinReg)
head(LinResid)

##            1            2            3            4            5            6 
## -0.599107663 -0.179313246  1.591309668  0.005148279  0.064326488  1.693602736

Compare with our calculation above:

head(Resid)

## [1] -0.599281048 -0.179429050  1.591182114  0.004957699  0.064136164
## [6]  1.693440399

Significance

Looking again at our linear regression output, we can see that each estimate has its own standard error , t-value and p-value.

LinReg<-lm(Y ~ Xcon)
summary(LinReg)

## 
## Call:
## lm(formula = Y ~ Xcon)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7766 -0.6543 -0.0101  0.6544  3.2131 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.01684    0.03137   0.537    0.591
## Xcon         0.03298    0.03186   1.035    0.301
## 
## Residual standard error: 0.9917 on 998 degrees of freedom
## Multiple R-squared:  0.001073,   Adjusted R-squared:  7.187e-05 
## F-statistic: 1.072 on 1 and 998 DF,  p-value: 0.3008

In our Distributions Tutorial we looked at the t-test equation for a mean and standard error. That same equation can be applied to the estimate and standard error for the slope and intercept, under the null hypothesis that slope = 0 and intercept = mean(Y).

We can also see the Adjusted R-squared value, which is very close to zero. This R-squared value is an estimate of the amount of variation explained by the statistical model, rangiong from 0 (no variation explained) to 1 (100% explained).

The F-statistic tests the overall fit of the model. In this case you can see that our model is not significant. We can see this more clearly using the anova function

anova(LinReg)

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value Pr(>F)
## Xcon        1   1.05 1.05400  1.0718 0.3008
## Residuals 998 981.42 0.98339

Let’s try to make a significant model by creating a predictor that is related to the response. A simple way is to add a random value to each observation:

CorPred<-Y+rnorm(length(Y))
CorMod<-lm(Y ~ CorPred)
qplot(x=CorPred,y=Y) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

summary(CorMod)

## 
## Call:
## lm(formula = Y ~ CorPred)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96917 -0.44505 -0.02051  0.47141  2.65104 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01044    0.02246  -0.465    0.642    
## CorPred      0.49200    0.01595  30.855   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7098 on 998 degrees of freedom
## Multiple R-squared:  0.4882, Adjusted R-squared:  0.4877 
## F-statistic: 952.1 on 1 and 998 DF,  p-value: < 2.2e-16

anova(CorMod)

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## CorPred     1 479.66  479.66  952.05 < 2.2e-16 ***
## Residuals 998 502.81    0.50                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we get a significant slope but not a significant intercept, based on the t-tests of the estimates.

We also get a highly significant F-test and our model explains almost 50% of the variation.

ANOVA

Let’s look at a linear model with a categorical predictor.

CatMod<-lm(Y ~ Xcat)
qplot(x=Xcat, y=Y, geom="boxplot")

summary(CatMod)

## 
## Call:
## lm(formula = Y ~ Xcat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7911 -0.6442 -0.0092  0.6606  3.1999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.04109    0.05574   0.737    0.461
## XcatB       -0.01377    0.07721  -0.178    0.859
## XcatC       -0.05981    0.07759  -0.771    0.441
## 
## Residual standard error: 0.9924 on 997 degrees of freedom
## Multiple R-squared:  0.0006628,  Adjusted R-squared:  -0.001342 
## F-statistic: 0.3306 on 2 and 997 DF,  p-value: 0.7186

Similar to linear regrssion we have an estimate column, and for each estimate we also have standard error, t-value and probability.

We also have R-squared and F-values describing the fit of the model, along with an overall p-value for the model.

BUT there is one important difference. Our predictor variable \(X_i\) is categorical in this model, rather than continuous. So how do we get from a categorical predictor to an estimated slope?

We have 3 categories: A, B, C and we also have 3 estimates: (Intercept), XcatB, and XcatC

We can define 3 categories using 2 binary variables.

RecodeDat<-data.frame(Response=Y,
                      PredGroup=Xcat) %>%
  mutate(XcatB=recode(Xcat,"A"=0,"B"=1,"C"=0),
         XcatC=recode(Xcat,"A"=0,"B"=0,"C"=1))

Why no XcatA?

We know that Xcat=A when XcatB=0 AND XcatC=0.

Any predictor with N categories can be recoded into N-1 binary columns

Check that the data are recoded properly

head(RecodeDat)

##      Response PredGroup XcatB XcatC
## 1 -0.56047565         A     0     0
## 2 -0.23017749         C     0     1
## 3  1.55870831         C     0     1
## 4  0.07050839         C     0     1
## 5  0.12928774         A     0     0
## 6  1.71506499         B     1     0

And run the model

RecLM<-lm(Response ~ XcatB + XcatC, data=RecodeDat)
summary(RecLM)

## 
## Call:
## lm(formula = Response ~ XcatB + XcatC, data = RecodeDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7911 -0.6442 -0.0092  0.6606  3.1999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.04109    0.05574   0.737    0.461
## XcatB       -0.01377    0.07721  -0.178    0.859
## XcatC       -0.05981    0.07759  -0.771    0.441
## 
## Residual standard error: 0.9924 on 997 degrees of freedom
## Multiple R-squared:  0.0006628,  Adjusted R-squared:  -0.001342 
## F-statistic: 0.3306 on 2 and 997 DF,  p-value: 0.7186

and compare to the original ANOVA

summary(CatMod)

## 
## Call:
## lm(formula = Y ~ Xcat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7911 -0.6442 -0.0092  0.6606  3.1999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.04109    0.05574   0.737    0.461
## XcatB       -0.01377    0.07721  -0.178    0.859
## XcatC       -0.05981    0.07759  -0.771    0.441
## 
## Residual standard error: 0.9924 on 997 degrees of freedom
## Multiple R-squared:  0.0006628,  Adjusted R-squared:  -0.001342 
## F-statistic: 0.3306 on 2 and 997 DF,  p-value: 0.7186

This is equivalent to taking the deviation of the means:

A<-RecodeDat %>% 
  filter(PredGroup=="A") %>%
  summarize(mean(Response))
B<-RecodeDat %>% 
  filter(PredGroup=="B") %>%
  summarize(mean(Response))
C<-RecodeDat %>% 
  filter(PredGroup=="C") %>%
  summarize(mean(Response))

A

##   mean(Response)
## 1     0.04109208

B-A

##   mean(Response)
## 1    -0.01376664

C-A

##   mean(Response)
## 1    -0.05980686

Let’s now look at the output of the anova function

anova(CatMod)

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value Pr(>F)
## Xcat        2   0.65 0.32558  0.3306 0.7186
## Residuals 997 981.82 0.98478

Notice how there is now just a single predictor (Xcat) rather than separate predictors for each binary subcategory.

The equation for this ANOVA can be written the same as the regression model:

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

We just have to keep in mind that \(X_i\) is a nominal variable with N categories recoded as N-1 binary variables. \(\beta_0\) is the mean of one category of data (Intercept). There are N-1 \(\beta_1\) terms specifying the deviation of each group from the Intercept.

NOTE: The lm funciton chooses the category for \(\beta_0\) as the first in alphabetical order

Contrast this with a continuous variable where \(\beta_1\) is a single value for the slope, which is multiplied by \(X_i\)

2+ predictors

So far we have looked at models with a single predictor. What if we have 2 or more predictors?

We can simply add additional predictor \(X\) terms to our linear model

\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + ... + \epsilon_i\]

There are a few specific ‘flavours’ of these models, as shown below. You should know these because they are still used in the scientific literature, but remember that they are just different combinations of predictor variables:

ANCOVA – nominal + continuous predictors
Multifactor ANOVA – 2 or more nominal predictors
Multiple regression – 2 or more continuous predictors
Polynomical regression – 1 or more continuous predictors raised to 2 or more powers

Build-a-model

Now that you understand the basic structure of a linear model, let’s build one using simulated data in R, with a few different parameters we can adjust to see how these changes affect the output of lm.

In some of the above models, we made \(Y\) independent of \(X\) by randomly sampling \(Y\) from a normal distribution. In the models below, we will make \(Y\) dependent on the predictors.

In each case, we’ll look at the linear model equation, and code it in R.

ANCOVA

ANCOVA is short for ANalysis of COVariance, meaning it is is an ANOVA (nominal predictor \(cat\)) combined with a covariate (continuous predictor \(con\)).

\[ Y_i = \beta_0 + \beta_{cat} X_{cat,i} + \beta_{con} X_{con,i} + \epsilon_i\]

Let’s imagine a cancer drug trial where we want to test the effect of a drug (+ placebo) and we want to include age as a covariate. Our response variable would have to be a continuous measurement – let’s say tumor growth.

First, let’s create our predictor variables

N<-1000 # Sample Size
Beta0<-0 # Overall mean
BetaCon<-0.03 # Slope of our continuous variable
SimDat<-data.frame(
  Treat=sample(c("Drug","Placebo"), N, replace=T), # Treatment
  Age=sample(18:80, N, replace=T), # Age (between 18 and 80 years)
  Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
) %>%
  mutate(TreatMean=recode(Treat,"Drug"=0,"Placebo"=1.2)) # Means of treatments
head(SimDat)

##     Treat Age        Err TreatMean
## 1 Placebo  32 -1.0249403       1.2
## 2 Placebo  56 -1.3919035       1.2
## 3 Placebo  63  0.8392222       1.2
## 4 Placebo  77 -1.0132896       1.2
## 5    Drug  38  1.1013809       0.0
## 6 Placebo  62 -0.3482810       1.2

Now we just plug those into our linear model equation to generate our dependent variable Growth. Then plot the data and run lm:

SimDat<-SimDat %>%
  mutate(Growth = Beta0 + TreatMean + BetaCon*Age + Err)

qplot(x=Age, y=Growth, colour=Treat, data=SimDat) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

GrowthMod<-lm(Growth ~ Age + Treat, data=SimDat)
anova(GrowthMod)

## Analysis of Variance Table
## 
## Response: Growth
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Age         1 259.16  259.16  274.40 < 2.2e-16 ***
## Treat       1 345.86  345.86  366.21 < 2.2e-16 ***
## Residuals 997 941.60    0.94                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(GrowthMod)

## 
## Call:
## lm(formula = Growth ~ Age + Treat, data = SimDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71009 -0.68516  0.00184  0.65625  2.77062 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.122878   0.094807   1.296    0.195    
## Age          0.028373   0.001708  16.612   <2e-16 ***
## TreatPlacebo 1.176208   0.061463  19.137   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9718 on 997 degrees of freedom
## Multiple R-squared:  0.3912, Adjusted R-squared:   0.39 
## F-statistic: 320.3 on 2 and 997 DF,  p-value: < 2.2e-16

Try changing the parameters below re-running the code above. How does the graph and statstical output change?

N – Sample Size
Beta0 # Overall mean
BetaCon # Slope of our continuous variable
Err=rnorm(..., mean=??, sd=??) # mean & sd of error term
TreatMean=recode(Treat,"Drug"=??,"Placebo"=??)) # Means of treatments

Multifactor ANOVA

A multifactor ANOVA is an analysis of variance with 2 or more categories. For example, if we wanted to test tumour growth from three chemicals on male vs female mice.

\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i\]

N<-1000 # Sample Size
Beta0<-0 # Overall mean
SimDat<-data.frame(
  Treat=sample(c("Control","DrugA","DrugB","DrugC"), N, replace=T), # Treatment
  Sex=sample(c("M","F"), N, replace=T), # Sex
  Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
) %>%
  mutate(TreatMean=recode(Treat,"Control"=0,"DrugA"=1.2,"DrugB"=-2,"DrugC"=0),
         SexMean=recode(Sex,"M"=0.5,"F"=0)) # Means of treatments
head(SimDat)

##     Treat Sex        Err TreatMean SexMean
## 1   DrugA   F  1.7310308       1.2     0.0
## 2   DrugA   M -0.2693376       1.2     0.5
## 3   DrugB   M  0.8458420      -2.0     0.5
## 4 Control   M -0.9024574       0.0     0.5
## 5 Control   F -0.9099187       0.0     0.0
## 6 Control   M  1.1299475       0.0     0.5

SimDat<-SimDat %>%
  mutate(Growth = Beta0 + TreatMean + SexMean + Err)

qplot(x=Treat, y=Growth, fill=Sex, data=SimDat, geom="boxplot")

GrowthMod<-lm(Growth ~ Sex + Treat, data=SimDat)
anova(GrowthMod)

## Analysis of Variance Table
## 
## Response: Growth
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Sex         1   73.84   73.84  75.428 < 2.2e-16 ***
## Treat       3 1359.54  453.18 462.902 < 2.2e-16 ***
## Residuals 995  974.10    0.98                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(GrowthMod)

## 
## Call:
## lm(formula = Growth ~ Sex + Treat, data = SimDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8074 -0.6043 -0.0043  0.6214  3.1800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13261    0.06985   1.899   0.0579 .  
## SexM         0.46489    0.06272   7.412 2.65e-13 ***
## TreatDrugA   1.04237    0.08681  12.007  < 2e-16 ***
## TreatDrugB  -2.21853    0.08983 -24.696  < 2e-16 ***
## TreatDrugC  -0.04354    0.08805  -0.495   0.6211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9894 on 995 degrees of freedom
## Multiple R-squared:  0.5954, Adjusted R-squared:  0.5938 
## F-statistic:   366 on 4 and 995 DF,  p-value: < 2.2e-16

Multiple regression

A multiple regression is just a regression with 2+ continuous predictors.

CHALLENGE: Make up an experiment with two predictors and try to encode them

\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i\]

N<-1000 # Sample Size
Beta0<-0 # Overall mean
SlopeA<-0.3 # Slope for 1st continuous predictor
SlopeB<- -0.4 # Slope for 2nd continuous predictor
SimDat<-data.frame(
  TreatA=sample(0:100,N, replace=T),
  TreatB=sample(0:100,N, replace=T),
  Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
) %>%
  mutate(Growth = Beta0 + SlopeA*TreatA + SlopeB*TreatB + Err)

Visualizing the model is tricky because we have two predictors now. We could plot each separately on the x-axis, or we could scale point size to one of them

qplot(x=TreatA, y=Growth, data=SimDat) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

qplot(x=TreatB, y=Growth, data=SimDat) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

GrowthMod<-lm(Growth ~ TreatA + TreatB, data=SimDat)
anova(GrowthMod)

## Analysis of Variance Table
## 
## Response: Growth
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## TreatA      1  74035   74035   76446 < 2.2e-16 ***
## TreatB      1 135522  135522  139935 < 2.2e-16 ***
## Residuals 997    966       1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(GrowthMod)

## 
## Call:
## lm(formula = Growth ~ TreatA + TreatB, data = SimDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7720 -0.7192  0.0347  0.6743  3.2024 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -0.114359   0.079801   -1.433    0.152    
## TreatA       0.299755   0.001073  279.287   <2e-16 ***
## TreatB      -0.398664   0.001066 -374.079   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9841 on 997 degrees of freedom
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9954 
## F-statistic: 1.082e+05 on 2 and 997 DF,  p-value: < 2.2e-16

Polynomial regression

A polynomial regression is a special type of multiple regression where the different predictors are polynomials of the same variable.

\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X^2_{1,i} + \beta_n X^n_{1,i} + ... + \epsilon_i\] This is easier to understand with an example.

Let’s say we want to look at the relationship between plant height and plant biomass, with biomass following a quadratic relationship:

N<-1000 # Sample Size
Beta0<-0 # Overall mean
SlopeA<-0.3 # Slope for 1st continuous predictor
SlopeB<- -0.4 # Slope for 2nd (quadratic) predictor
SimDat<-data.frame(
  Height=sample(c(0:1000)/100,N, replace=T),
  Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
) %>%
  mutate(Biomass = Beta0 + SlopeA*Height + SlopeB*Height^2 + Err)

qplot(x=Height, y=Biomass, data=SimDat) + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

To fit a polynomial, we use the Identity I function inside the lm

GrowthMod<-lm(Biomass ~ Height + I(Height^2) , data=SimDat)
anova(GrowthMod)

## Analysis of Variance Table
## 
## Response: Biomass
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## Height        1 115483  115483 113850.8 < 2.2e-16 ***
## I(Height^2)   1   9198    9198   9067.9 < 2.2e-16 ***
## Residuals   997   1011       1                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(GrowthMod)

## 
## Call:
## lm(formula = Biomass ~ Height + I(Height^2), data = SimDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.00146 -0.71496 -0.01533  0.68273  2.94893 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.137145   0.098036  -1.399    0.162    
## Height       0.371397   0.044576   8.332 2.61e-16 ***
## I(Height^2) -0.407120   0.004275 -95.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared:  0.992,  Adjusted R-squared:  0.9919 
## F-statistic: 6.146e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Let’s look at a qubic relationship for comparison:

SlopeC<- -2
SimDat<-SimDat %>%
  mutate(Biomass2 = Beta0 + SlopeA*Height + SlopeB*Height^2 + SlopeC*Height^3 + Err)

GrowthMod<-lm(Biomass2 ~ Height + I(Height^2) + I(Height^3), data=SimDat)
anova(GrowthMod)

## Analysis of Variance Table
## 
## Response: Biomass2
##              Df    Sum Sq   Mean Sq   F value    Pr(>F)    
## Height        1 287699795 287699795 283740631 < 2.2e-16 ***
## I(Height^2)   1  51789157  51789157  51076463 < 2.2e-16 ***
## I(Height^3)   1   1466001   1466001   1445827 < 2.2e-16 ***
## Residuals   996      1010         1                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(GrowthMod)

## 
## Call:
## lm(formula = Biomass2 ~ Height + I(Height^2) + I(Height^3), data = SimDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9589 -0.7127 -0.0211  0.6813  2.9242 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) -0.237635   0.130236    -1.825   0.0684 .  
## Height       0.489653   0.110319     4.438 1.01e-05 ***
## I(Height^2) -0.436466   0.025405   -17.180  < 2e-16 ***
## I(Height^3) -1.998053   0.001662 -1202.425  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 996 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.121e+08 on 3 and 996 DF,  p-value: < 2.2e-16

Notice that our third polynomial is not significant, even though it should be because of our linear model. The reason for this is complicated, but it has to do with the fact that the polynomial predictors are not independent.

We can solve this problem with the poly() function, but note that it changes the estimates.

GrowthMod<-lm(Biomass2 ~ poly(Height,degree=3), data=SimDat)
anova(GrowthMod)

## Analysis of Variance Table
## 
## Response: Biomass2
##                           Df    Sum Sq   Mean Sq   F value    Pr(>F)    
## poly(Height, degree = 3)   3 340954953 113651651 112087640 < 2.2e-16 ***
## Residuals                996      1010         1                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(GrowthMod)

## 
## Call:
## lm(formula = Biomass2 ~ poly(Height, degree = 3), data = SimDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9589 -0.7127 -0.0211  0.6813  2.9242 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -5.241e+02  3.184e-02  -16458   <2e-16 ***
## poly(Height, degree = 3)1 -1.696e+04  1.007e+00  -16845   <2e-16 ***
## poly(Height, degree = 3)2 -7.196e+03  1.007e+00   -7147   <2e-16 ***
## poly(Height, degree = 3)3 -1.211e+03  1.007e+00   -1202   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 996 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.121e+08 on 3 and 996 DF,  p-value: < 2.2e-16

We can also add the polynomial term to ggplot object if we use the formula parameter in geom_smooth

qplot(x=Height, y=Biomass2, data=SimDat) + geom_smooth(formula=y ~ poly(x,3, raw=TRUE))

## `geom_smooth()` using method = 'gam'

Quality Checks

The key assumption of our linear models is:

residual error follows a normal distribution with a mean of zero

AND

error variance is independent of predictors

There are a few convenient ways to check this, but first, let’s set up a few models, starting with a base dataset.

N<-1000
B0<-0
B1<-0.03
TestDat<-data.frame(
  PredCon=sample(c(1:100)/10,N,replace=T),
  PredNom=sample(c("A","B","C"), N, replace=T),
  Err=rnorm(N, mean=0, sd=1)
) %>% 
  mutate(PredNomMean=recode(PredNom,"A"=0,"B"=5,"C"=-5)) %>%
  mutate(Response=B0+B1*PredCon+PredNomMean+Err)

TestMod<-lm(Response ~ PredCon + PredNom, data=TestDat)
anova(TestMod)

## Analysis of Variance Table
## 
## Response: Response
##            Df  Sum Sq Mean Sq   F value Pr(>F)    
## PredCon     1     0.2     0.2    0.2407 0.6238    
## PredNom     2 16850.7  8425.3 8839.1708 <2e-16 ***
## Residuals 996   949.4     1.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(TestMod)

## 
## Call:
## lm(formula = Response ~ PredCon + PredNom, data = TestDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1999 -0.6896 -0.0052  0.6190  2.8556 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.13321    0.07588  -1.756 0.079463 .  
## PredCon      0.03593    0.01073   3.347 0.000847 ***
## PredNomB     5.14029    0.07566  67.936  < 2e-16 ***
## PredNomC    -4.91573    0.07563 -65.001  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9763 on 996 degrees of freedom
## Multiple R-squared:  0.9467, Adjusted R-squared:  0.9465 
## F-statistic:  5893 on 3 and 996 DF,  p-value: < 2.2e-16

This model should meet our assumptions, but let’s check. There are three main ways to do this, all using the residuals of our model, which we can calculate manually, or use the residuals function on our model object.

Histogram

qplot(residuals(TestMod))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This looks normal, and is senered around zero, but what is a normal distribution supposed to look like? We can use our qnorm function to generate the expected quantile of each observation. We could then plot the observed vs. expected values.

This is called the quantile-quantile or q-q plot

Q-Q Plot

The Quantile-Quantile (Q-Q) Plot bins the residuals the same way we would for a histogram, but then calculates the number of observations we expect in each bin for a normal distribution. If the data are perfectly normal, then all the points will fall along the 1:1 line

qplot(sample=residuals(TestMod)) + stat_qq() + stat_qq_line()

We can see some very slight deviation from the line at the extreme values, where we have a small sample size

Homoskedasticity

The last test is to make sure that variance is constant across our predictors. We can do this for each predictor variable alone, as well as the overall prediction from the model. To do this, let’s add residuals to the dataset:

TestDat$Residuals<-residuals(TestMod)
qplot(x=PredCon, y=Residuals, data=TestDat)

qplot(x=PredNom, y=Residuals, data=TestDat, geom="boxplot")

Note how the variance is similar range, centred at zero across all values of x in both graphs. We should also not see any correlation between the x- and y- axes.

We can also generate the predicted values using the estimates from the linear model output. But an easier way is the predict() function.

TestDat$Predicted<-predict(TestMod)
qplot(x=Predicted, y=Residuals, data=TestDat)

Since we have continuous + nominal predictors we want to make sure that each group is centered at zero with similar variance with no correlations between the x- and y-axis.

Violations

Let’s make a couple of models that violate the assumptions for comparison.

Nonlinear relationship

The first example is the one above – fitting a linear model to a quadratic relationship

# Make a new response variable
TestDat<-TestDat %>% 
  mutate(Response2=B0+B1*PredCon+0.2*PredCon^2+PredNomMean+Err)

# Run the linear model
TestMod2<-lm(Response2 ~ PredCon + PredNom, data=TestDat)

# Calculate predictions and residuals of the linear model
TestDat$Predicted2<-predict(TestMod2)
TestDat$Residuals2<-residuals(TestMod2)

# Plot to look for problems
qplot(x=Residuals2, data=TestDat) # Histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample=Residuals2, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot

qplot(x=PredCon, y=Residuals2, data=TestDat)

qplot(x=PredNom, y=Residuals2, data=TestDat, geom="boxplot")

qplot(x=Predicted2, y=Residuals2, data=TestDat)

Log-normal response

# Make a new response variable
TestDat<-TestDat %>% 
  mutate(Response3=exp(B0+B1*PredCon+PredNomMean+Err))

# Run the linear model
TestMod3<-lm(Response3 ~ PredCon + PredNom, data=TestDat)

# Calculate predictions and residuals of the linear model
TestDat$Predicted3<-predict(TestMod3)
TestDat$Residuals3<-residuals(TestMod3)

# Plot to look for problems
qplot(x=Residuals3, data=TestDat) # Histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample=Residuals3, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot

qplot(x=PredCon, y=Residuals3, data=TestDat)

qplot(x=PredNom, y=Residuals3, data=TestDat, geom="boxplot")

qplot(x=Predicted, y=Residuals3, data=TestDat)

Error scales with prediction

# Make a new response variable
TestDat<-TestDat %>% 
  mutate(Response4=B0+B1*PredCon*Err+PredNomMean*Err/10)

# Run the linear model
TestMod4<-lm(Response4 ~ PredCon + PredNom, data=TestDat)

# Calculate predictions and residuals of the linear model
TestDat$Predicted4<-predict(TestMod4)
TestDat$Residuals4<-residuals(TestMod4)

# Plot to look for problems
qplot(x=Residuals4, data=TestDat) # Histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample=Residuals4, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot

qplot(x=PredCon, y=Residuals4, data=TestDat)

qplot(x=PredNom, y=Residuals4, data=TestDat, geom="boxplot")

qplot(x=Predicted, y=Residuals4, data=TestDat)

Poisson Error

# Make a new response variable
TestDat<-TestDat %>% 
  mutate(Response5=B0+B1*PredCon+PredNomMean+rpois(N,lambda=1))

# Run the linear model
TestMod5<-lm(Response5 ~ PredCon + PredNom, data=TestDat)

# Calculate predictions and residuals of the linear model
TestDat$Predicted5<-predict(TestMod5)
TestDat$Residuals5<-residuals(TestMod5)

# Plot to look for problems
qplot(x=Residuals5, data=TestDat) # Histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample=Residuals5, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot

qplot(x=PredCon, y=Residuals5, data=TestDat)

qplot(x=PredNom, y=Residuals5, data=TestDat, geom="boxplot")

qplot(x=Predicted, y=Residuals5, data=TestDat)

Residuals matter most

Note that in our analysis above, we are analyzing the RESIDUALS. Our assumptions with linear models are about the residuals NOT the predictor or response variables.

Let’s look back at our original, well-behaved model as an example. Compare the histograms of the response and predictors:

qplot(TestDat$Response)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TestDat$PredCon)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(TestDat$PredNom)

In the case of our continuous variable, we have a tri-modal distribution. But remember that our residuals look normally distributed:

qplot(residuals(TestMod))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(sample=residuals(TestMod)) + stat_qq() + stat_qq_line()

There is nothing wrong with the sample or the model! It doesn’t matter that the predictors and response are not normally distributed, as long as the residuals are.

Why is there a trimodal distribution in the response variable but not the residuals?

We have three groups with different means, showing up in the response as three peaks. When we take out the means by including the predictor in our linear model, then the residuals are the deviation from the group means, resulting in a single peak.

QC Tools

Once you understand how these quality checks work, there are some tools you can use to quickly plot and view the diagnostic graphs.

The autoplot() function in theggfortify package is a nice complement to ggplot for generating diagnostic graphs. You can even colour code your predictors:

library(ggfortify)

## Warning: package 'ggfortify' was built under R version 4.0.5

autoplot(TestMod, which=1:4, data=TestDat, colour = 'PredNom')

Note the Cook’s distance plot. This checks whether you have a few outlier data points that contribute disproportionately to the fit of the model. The row numbers are shown for potential outliers, but looking at the figure we can see that the Cook’s distance measure is not much higher for these then several other points. Try adding an outlier to the TestDat dataset to see what this would look like in the Cook’s distance plot.

You can generate other analytical plots with the which function. See ?autoplot for more detail.