Load libraries and our custom plotting theme.
library(ggplot2) # plotting library
library(dplyr) # data management
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
We will be working with the same FallopiaData.csv
data that we used in the R fundamentals tutorial and the qplot
tutorial. We could download to a local folder again and then read from our computer.
Another trick is to read the file directly from the internet. This will work for any file available through http://
<-read.csv("http://colauttilab.github.io/RCrashCourse/FallopiaData.csv")
InDatstr(InDat)
## 'data.frame': 123 obs. of 13 variables:
## $ PotNum : int 1 2 3 5 6 7 8 9 10 11 ...
## $ Scenario : chr "low" "low" "low" "low" ...
## $ Nutrients : chr "low" "low" "low" "low" ...
## $ Taxon : chr "japon" "japon" "japon" "japon" ...
## $ Symphytum : num 9.81 8.64 2.65 1.44 9.15 ...
## $ Silene : num 36.4 29.6 36 21.4 23.9 ...
## $ Urtica : num 16.08 5.59 17.09 12.39 5.19 ...
## $ Geranium : num 4.68 5.75 5.13 5.37 0 9.05 3.51 9.64 7.3 6.36 ...
## $ Geum : num 0.12 0.55 0.09 0.31 0.17 0.97 0.4 0.01 0.47 0.33 ...
## $ All_Natives : num 67 50.2 61 40.9 38.4 ...
## $ Fallopia : num 0.01 0.04 0.09 0.77 3.4 0.54 2.05 0.26 0 0 ...
## $ Total : num 67.1 50.2 61.1 41.7 41.8 ...
## $ Pct_Fallopia: num 0.01 0.08 0.15 1.85 8.13 1.12 3.7 0.61 0 0 ...
The structure function str()
is a handy way to inspect the data, but it doesn’t show us all the different values for the nominal (i.e. categorical) columns. We can use the unique()
function for that. We could apply it to each column or we can use the dplyr()
function from the data science tutorial
%>%
InDat select(Scenario, Nutrients, Taxon) %>%
unique()
## Scenario Nutrients Taxon
## 1 low low japon
## 13 low low bohem
## 26 high high japon
## 38 high high bohem
## 49 gradual high japon
## 61 gradual high bohem
## 74 extreme high japon
## 88 extreme high bohem
## 100 fluctuations high japon
## 113 fluctuations high bohem
This gives us every combination of the nominal variables. In this case, we can see that most of the different Scenario
categories are only applied to the high Nutrient
treatment.
Looking at the combined output, we can see:
Most statistical textbooks and courses will go through all of the standard models including ANOVA and linear regression, then slightly more complicated models like polynomial regression, multiple regression, multifactor ANOVA, and ANCOVA.
I’m not sure why stats are always taught this way. There could be a good reason but I suspect it has more to do with the history of statistics: Student’s t-test was invented, followed by Fisher’s ANOVA, and then linear regression. Then these were expanded to more complicated cases.
However, today we know that all of these models are special cases of a general model in statistics called Linear Models. In R, linear models are run with the lm()
function. Rather than go through the history of statistics, we’ll focus on the application of linear models. That said, it’s important to know the difference between all those models listed above because many people learn statistics that way, so they are used to talking about those models.
NOTE: Linear Models are different from General Linear Models, which are both different from Generalized Linear Models. We also have Linear Mixed Effects Models. For now, it’s just important to know that these are different classes of models even though they sound similar.
In the qplot
tutorial we looked at visualizations for individual and paired data.
We saw that data can come in a few flavours:
Different types of linear models (e.g. ANOVA vs Regression) are defined by the type of input data they use. Importantly, Linear Models have two types of input:
Linear Models will always have a single, continuous response variable, but it can have one or more predictor variables that may include a mix of binary, nominal and continuous variables. Ordinal variables are a bit more tricky, and will be treated as (approximately) nominal or continuous variables, depending on the nature of the data.
To understand the syntax of a linear model, look at the help:
?lm
There are just two main components; the rest we can keep as default for no:
The formula for linear models looks like this:
Response ~ Predictor(s)
You just have to remember the tilde and which side to put your predictor vs response variables.
Let’s start by looking at a single response variable and a single predictor. Our response will always be a continuous variable in a Linear Model. Similar to the qplot tutorial, we can organize our linear models based on the type of predictor
To visualize these models, let’s first create some data for each type of response variables:
Our continuous response variable (Y)
set.seed(123)
<-rnorm(1000) Y
A continuous response variable (Xcon)
set.seed(234)
<-rnorm(1000) Xcon
A categorical response variable (Xcat).
We can use the sample function ?sample
set.seed(345)
<-sample(c("A","B","C"),1000, replace=T) Xcat
What about ordinal data? There are more advanced models that can handle this kind of data but we can also run them in a linear model if we make a choice whether to treat it as categorical or continuous.
For example, if we were looking at offspring numbers in Canadian moose, most would have 0, 1 or 2, but some would have 3 or more. We could recode every observation into one of 4 categories: 0, 1, 2, 3+ and analyze offspring number as a categorical variable.
Another example, if we look at the nubmer of seeds on the head of a dandelion, we might find that the distribution follows a normal or log-normal distribution. In that case we can treat it as a continuous variable – possibly taking the log before using it in our linear model.
Linear regression uses a continuous predictor
Continuous Response ~ Continuous Predictor
<-qplot(x=Xcon, y=Y)
CCplot CCplot
In linear regression, we want to find a straight line that fits to the majority of data. We can visulaize this using geom_smooth()
<-CCplot + geom_smooth(method="lm") CCplot
lm(Y ~ Xcon)
##
## Call:
## lm(formula = Y ~ Xcon)
##
## Coefficients:
## (Intercept) Xcon
## 0.01684 0.03298
The output of the lm
function is not too informative. It just shows us the input formula and then the estimated coefficients. There are some additional functions to extract more information about the model.
<-lm(Y ~ Xcon)
LinRegsummary(LinReg)
##
## Call:
## lm(formula = Y ~ Xcon)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7766 -0.6543 -0.0101 0.6544 3.2131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01684 0.03137 0.537 0.591
## Xcon 0.03298 0.03186 1.035 0.301
##
## Residual standard error: 0.9917 on 998 degrees of freedom
## Multiple R-squared: 0.001073, Adjusted R-squared: 7.187e-05
## F-statistic: 1.072 on 1 and 998 DF, p-value: 0.3008
The Estimate column gives us the equation for the line of best fit, with the intercept and slope. Recall the linear equation:
\[ y = mX + b \]
In linear models, this exact same equation is usually written as:
\[ Y_i \sim \beta_0 + \beta_1 X_i\]
This gives us the equation for the estimated line, which would be okay if every single observation fell exactly on the line of best fit. We can plot for comparison:
CCplot
## `geom_smooth()` using formula 'y ~ x'
+ geom_abline(slope=0.033, intercept=0.017, colour="red") CCplot
## `geom_smooth()` using formula 'y ~ x'
The red line that we added sits right on top of the lm
line added from geom_smooth
.
We can also make specific predictions for each observation:
<- 0.033*Xcon + 0.017
Predicted+ geom_point(aes(y=Predicted), colour="red") CCplot
## `geom_smooth()` using formula 'y ~ x'
However, we can see that most observations do not fall on the line of best fit. To fully describe the relationship between Y and X, we need to account for the deviation of each point from the line.
This is called the error term or residual error:
\[ \epsilon_i = Observed_i - Predicted_i \]
In other words, we:
<-Y-Predicted
Residhead(Resid)
## [1] -0.599281048 -0.179429050 1.591182114 0.004957699 0.064136164
## [6] 1.693440399
Accounting for this error, fully describes each point, giving the correct Linear Model equation:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
Our regression model has three important vectors:
Take a minute to review and make sure you understand this. The \(Y_i\) is the observed value of our response variable. The subscript \(i\) represents an individual value, which tells us that \(Y\) is a vector of values.
Similarly, the \(X_i\) is the observed value of our predictor variable, with the same subscript, meaning that \(X\) is also a vector.
And again, \(\epsilon\) is also a vector. This vector represents the difference between the observed Y and the predicted Y from our model. This is known as the vector of residuals or residual error. We can get these from our lm
object.
<-residuals(LinReg)
LinResidhead(LinResid)
## 1 2 3 4 5 6
## -0.599107663 -0.179313246 1.591309668 0.005148279 0.064326488 1.693602736
Compare with our calculation above:
head(Resid)
## [1] -0.599281048 -0.179429050 1.591182114 0.004957699 0.064136164
## [6] 1.693440399
Looking again at our linear regression output, we can see that each estimate has its own standard error , t-value and p-value.
<-lm(Y ~ Xcon)
LinRegsummary(LinReg)
##
## Call:
## lm(formula = Y ~ Xcon)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7766 -0.6543 -0.0101 0.6544 3.2131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01684 0.03137 0.537 0.591
## Xcon 0.03298 0.03186 1.035 0.301
##
## Residual standard error: 0.9917 on 998 degrees of freedom
## Multiple R-squared: 0.001073, Adjusted R-squared: 7.187e-05
## F-statistic: 1.072 on 1 and 998 DF, p-value: 0.3008
In our Distributions Tutorial we looked at the t-test equation for a mean and standard error. That same equation can be applied to the estimate and standard error for the slope and intercept, under the null hypothesis that slope = 0
and intercept = mean(Y)
.
We can also see the Adjusted R-squared value, which is very close to zero. This R-squared value is an estimate of the amount of variation explained by the statistical model, rangiong from 0 (no variation explained) to 1 (100% explained).
The F-statistic tests the overall fit of the model. In this case you can see that our model is not significant. We can see this more clearly using the anova
function
anova(LinReg)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## Xcon 1 1.05 1.05400 1.0718 0.3008
## Residuals 998 981.42 0.98339
Let’s try to make a significant model by creating a predictor that is related to the response. A simple way is to add a random value to each observation:
<-Y+rnorm(length(Y))
CorPred<-lm(Y ~ CorPred)
CorModqplot(x=CorPred,y=Y) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
summary(CorMod)
##
## Call:
## lm(formula = Y ~ CorPred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.96917 -0.44505 -0.02051 0.47141 2.65104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01044 0.02246 -0.465 0.642
## CorPred 0.49200 0.01595 30.855 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7098 on 998 degrees of freedom
## Multiple R-squared: 0.4882, Adjusted R-squared: 0.4877
## F-statistic: 952.1 on 1 and 998 DF, p-value: < 2.2e-16
anova(CorMod)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## CorPred 1 479.66 479.66 952.05 < 2.2e-16 ***
## Residuals 998 502.81 0.50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we get a significant slope but not a significant intercept, based on the t-tests of the estimates.
We also get a highly significant F-test and our model explains almost 50% of the variation.
Let’s look at a linear model with a categorical predictor.
<-lm(Y ~ Xcat)
CatModqplot(x=Xcat, y=Y, geom="boxplot")
summary(CatMod)
##
## Call:
## lm(formula = Y ~ Xcat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7911 -0.6442 -0.0092 0.6606 3.1999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04109 0.05574 0.737 0.461
## XcatB -0.01377 0.07721 -0.178 0.859
## XcatC -0.05981 0.07759 -0.771 0.441
##
## Residual standard error: 0.9924 on 997 degrees of freedom
## Multiple R-squared: 0.0006628, Adjusted R-squared: -0.001342
## F-statistic: 0.3306 on 2 and 997 DF, p-value: 0.7186
Similar to linear regrssion we have an estimate column, and for each estimate we also have standard error, t-value and probability.
We also have R-squared and F-values describing the fit of the model, along with an overall p-value for the model.
BUT there is one important difference. Our predictor variable \(X_i\) is categorical in this model, rather than continuous. So how do we get from a categorical predictor to an estimated slope?
We have 3 categories: A, B, C and we also have 3 estimates: (Intercept)
, XcatB
, and XcatC
We can define 3 categories using 2 binary variables.
<-data.frame(Response=Y,
RecodeDatPredGroup=Xcat) %>%
mutate(XcatB=recode(Xcat,"A"=0,"B"=1,"C"=0),
XcatC=recode(Xcat,"A"=0,"B"=0,"C"=1))
Why no XcatA?
We know that Xcat=A
when XcatB=0
AND XcatC=0
.
Any predictor with N categories can be recoded into N-1 binary columns
Check that the data are recoded properly
head(RecodeDat)
## Response PredGroup XcatB XcatC
## 1 -0.56047565 A 0 0
## 2 -0.23017749 C 0 1
## 3 1.55870831 C 0 1
## 4 0.07050839 C 0 1
## 5 0.12928774 A 0 0
## 6 1.71506499 B 1 0
And run the model
<-lm(Response ~ XcatB + XcatC, data=RecodeDat)
RecLMsummary(RecLM)
##
## Call:
## lm(formula = Response ~ XcatB + XcatC, data = RecodeDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7911 -0.6442 -0.0092 0.6606 3.1999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04109 0.05574 0.737 0.461
## XcatB -0.01377 0.07721 -0.178 0.859
## XcatC -0.05981 0.07759 -0.771 0.441
##
## Residual standard error: 0.9924 on 997 degrees of freedom
## Multiple R-squared: 0.0006628, Adjusted R-squared: -0.001342
## F-statistic: 0.3306 on 2 and 997 DF, p-value: 0.7186
and compare to the original ANOVA
summary(CatMod)
##
## Call:
## lm(formula = Y ~ Xcat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7911 -0.6442 -0.0092 0.6606 3.1999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04109 0.05574 0.737 0.461
## XcatB -0.01377 0.07721 -0.178 0.859
## XcatC -0.05981 0.07759 -0.771 0.441
##
## Residual standard error: 0.9924 on 997 degrees of freedom
## Multiple R-squared: 0.0006628, Adjusted R-squared: -0.001342
## F-statistic: 0.3306 on 2 and 997 DF, p-value: 0.7186
This is equivalent to taking the deviation of the means:
<-RecodeDat %>%
Afilter(PredGroup=="A") %>%
summarize(mean(Response))
<-RecodeDat %>%
Bfilter(PredGroup=="B") %>%
summarize(mean(Response))
<-RecodeDat %>%
Cfilter(PredGroup=="C") %>%
summarize(mean(Response))
A
## mean(Response)
## 1 0.04109208
-A B
## mean(Response)
## 1 -0.01376664
-A C
## mean(Response)
## 1 -0.05980686
Let’s now look at the output of the anova
function
anova(CatMod)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## Xcat 2 0.65 0.32558 0.3306 0.7186
## Residuals 997 981.82 0.98478
Notice how there is now just a single predictor (Xcat) rather than separate predictors for each binary subcategory.
The equation for this ANOVA can be written the same as the regression model:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
We just have to keep in mind that \(X_i\) is a nominal variable with N categories recoded as N-1 binary variables. \(\beta_0\) is the mean of one category of data (Intercept). There are N-1 \(\beta_1\) terms specifying the deviation of each group from the Intercept.
NOTE: The
lm
funciton chooses the category for \(\beta_0\) as the first in alphabetical order
Contrast this with a continuous variable where \(\beta_1\) is a single value for the slope, which is multiplied by \(X_i\)
So far we have looked at models with a single predictor. What if we have 2 or more predictors?
We can simply add additional predictor \(X\) terms to our linear model
\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + ... + \epsilon_i\]
There are a few specific ‘flavours’ of these models, as shown below. You should know these because they are still used in the scientific literature, but remember that they are just different combinations of predictor variables:
Now that you understand the basic structure of a linear model, let’s build one using simulated data in R, with a few different parameters we can adjust to see how these changes affect the output of lm
.
In some of the above models, we made \(Y\) independent of \(X\) by randomly sampling \(Y\) from a normal distribution. In the models below, we will make \(Y\) dependent on the predictors.
In each case, we’ll look at the linear model equation, and code it in R.
ANCOVA is short for ANalysis of COVariance, meaning it is is an ANOVA (nominal predictor \(cat\)) combined with a covariate (continuous predictor \(con\)).
\[ Y_i = \beta_0 + \beta_{cat} X_{cat,i} + \beta_{con} X_{con,i} + \epsilon_i\]
Let’s imagine a cancer drug trial where we want to test the effect of a drug (+ placebo) and we want to include age as a covariate. Our response variable would have to be a continuous measurement – let’s say tumor growth.
First, let’s create our predictor variables
<-1000 # Sample Size
N<-0 # Overall mean
Beta0<-0.03 # Slope of our continuous variable
BetaCon<-data.frame(
SimDatTreat=sample(c("Drug","Placebo"), N, replace=T), # Treatment
Age=sample(18:80, N, replace=T), # Age (between 18 and 80 years)
Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
%>%
) mutate(TreatMean=recode(Treat,"Drug"=0,"Placebo"=1.2)) # Means of treatments
head(SimDat)
## Treat Age Err TreatMean
## 1 Placebo 32 -1.0249403 1.2
## 2 Placebo 56 -1.3919035 1.2
## 3 Placebo 63 0.8392222 1.2
## 4 Placebo 77 -1.0132896 1.2
## 5 Drug 38 1.1013809 0.0
## 6 Placebo 62 -0.3482810 1.2
Now we just plug those into our linear model equation to generate our dependent variable Growth. Then plot the data and run lm
:
<-SimDat %>%
SimDatmutate(Growth = Beta0 + TreatMean + BetaCon*Age + Err)
qplot(x=Age, y=Growth, colour=Treat, data=SimDat) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
<-lm(Growth ~ Age + Treat, data=SimDat)
GrowthModanova(GrowthMod)
## Analysis of Variance Table
##
## Response: Growth
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 259.16 259.16 274.40 < 2.2e-16 ***
## Treat 1 345.86 345.86 366.21 < 2.2e-16 ***
## Residuals 997 941.60 0.94
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GrowthMod)
##
## Call:
## lm(formula = Growth ~ Age + Treat, data = SimDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71009 -0.68516 0.00184 0.65625 2.77062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.122878 0.094807 1.296 0.195
## Age 0.028373 0.001708 16.612 <2e-16 ***
## TreatPlacebo 1.176208 0.061463 19.137 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9718 on 997 degrees of freedom
## Multiple R-squared: 0.3912, Adjusted R-squared: 0.39
## F-statistic: 320.3 on 2 and 997 DF, p-value: < 2.2e-16
Try changing the parameters below re-running the code above. How does the graph and statstical output change?
N
– Sample SizeBeta0
# Overall meanBetaCo
n # Slope of our continuous variableErr=rnorm(..., mean=??, sd=??)
# mean & sd of error termTreatMean=recode(Treat,"Drug"=??,"Placebo"=??))
# Means of treatmentsA multifactor ANOVA is an analysis of variance with 2 or more categories. For example, if we wanted to test tumour growth from three chemicals on male vs female mice.
\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i\]
<-1000 # Sample Size
N<-0 # Overall mean
Beta0<-data.frame(
SimDatTreat=sample(c("Control","DrugA","DrugB","DrugC"), N, replace=T), # Treatment
Sex=sample(c("M","F"), N, replace=T), # Sex
Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
%>%
) mutate(TreatMean=recode(Treat,"Control"=0,"DrugA"=1.2,"DrugB"=-2,"DrugC"=0),
SexMean=recode(Sex,"M"=0.5,"F"=0)) # Means of treatments
head(SimDat)
## Treat Sex Err TreatMean SexMean
## 1 DrugA F 1.7310308 1.2 0.0
## 2 DrugA M -0.2693376 1.2 0.5
## 3 DrugB M 0.8458420 -2.0 0.5
## 4 Control M -0.9024574 0.0 0.5
## 5 Control F -0.9099187 0.0 0.0
## 6 Control M 1.1299475 0.0 0.5
<-SimDat %>%
SimDatmutate(Growth = Beta0 + TreatMean + SexMean + Err)
qplot(x=Treat, y=Growth, fill=Sex, data=SimDat, geom="boxplot")
<-lm(Growth ~ Sex + Treat, data=SimDat)
GrowthModanova(GrowthMod)
## Analysis of Variance Table
##
## Response: Growth
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 73.84 73.84 75.428 < 2.2e-16 ***
## Treat 3 1359.54 453.18 462.902 < 2.2e-16 ***
## Residuals 995 974.10 0.98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GrowthMod)
##
## Call:
## lm(formula = Growth ~ Sex + Treat, data = SimDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8074 -0.6043 -0.0043 0.6214 3.1800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13261 0.06985 1.899 0.0579 .
## SexM 0.46489 0.06272 7.412 2.65e-13 ***
## TreatDrugA 1.04237 0.08681 12.007 < 2e-16 ***
## TreatDrugB -2.21853 0.08983 -24.696 < 2e-16 ***
## TreatDrugC -0.04354 0.08805 -0.495 0.6211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9894 on 995 degrees of freedom
## Multiple R-squared: 0.5954, Adjusted R-squared: 0.5938
## F-statistic: 366 on 4 and 995 DF, p-value: < 2.2e-16
A multiple regression is just a regression with 2+ continuous predictors.
CHALLENGE: Make up an experiment with two predictors and try to encode them
\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \epsilon_i\]
<-1000 # Sample Size
N<-0 # Overall mean
Beta0<-0.3 # Slope for 1st continuous predictor
SlopeA<- -0.4 # Slope for 2nd continuous predictor
SlopeB<-data.frame(
SimDatTreatA=sample(0:100,N, replace=T),
TreatB=sample(0:100,N, replace=T),
Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
%>%
) mutate(Growth = Beta0 + SlopeA*TreatA + SlopeB*TreatB + Err)
Visualizing the model is tricky because we have two predictors now. We could plot each separately on the x-axis, or we could scale point size to one of them
qplot(x=TreatA, y=Growth, data=SimDat) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
qplot(x=TreatB, y=Growth, data=SimDat) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
<-lm(Growth ~ TreatA + TreatB, data=SimDat)
GrowthModanova(GrowthMod)
## Analysis of Variance Table
##
## Response: Growth
## Df Sum Sq Mean Sq F value Pr(>F)
## TreatA 1 74035 74035 76446 < 2.2e-16 ***
## TreatB 1 135522 135522 139935 < 2.2e-16 ***
## Residuals 997 966 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GrowthMod)
##
## Call:
## lm(formula = Growth ~ TreatA + TreatB, data = SimDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7720 -0.7192 0.0347 0.6743 3.2024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.114359 0.079801 -1.433 0.152
## TreatA 0.299755 0.001073 279.287 <2e-16 ***
## TreatB -0.398664 0.001066 -374.079 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9841 on 997 degrees of freedom
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9954
## F-statistic: 1.082e+05 on 2 and 997 DF, p-value: < 2.2e-16
A polynomial regression is a special type of multiple regression where the different predictors are polynomials of the same variable.
\[ Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X^2_{1,i} + \beta_n X^n_{1,i} + ... + \epsilon_i\] This is easier to understand with an example.
Let’s say we want to look at the relationship between plant height and plant biomass, with biomass following a quadratic relationship:
<-1000 # Sample Size
N<-0 # Overall mean
Beta0<-0.3 # Slope for 1st continuous predictor
SlopeA<- -0.4 # Slope for 2nd (quadratic) predictor
SlopeB<-data.frame(
SimDatHeight=sample(c(0:1000)/100,N, replace=T),
Err=rnorm(N, mean=0, sd=1) # Individual error with a mean of 0 and sd of 1
%>%
) mutate(Biomass = Beta0 + SlopeA*Height + SlopeB*Height^2 + Err)
qplot(x=Height, y=Biomass, data=SimDat) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
To fit a polynomial, we use the Identity I
function inside the lm
<-lm(Biomass ~ Height + I(Height^2) , data=SimDat)
GrowthModanova(GrowthMod)
## Analysis of Variance Table
##
## Response: Biomass
## Df Sum Sq Mean Sq F value Pr(>F)
## Height 1 115483 115483 113850.8 < 2.2e-16 ***
## I(Height^2) 1 9198 9198 9067.9 < 2.2e-16 ***
## Residuals 997 1011 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GrowthMod)
##
## Call:
## lm(formula = Biomass ~ Height + I(Height^2), data = SimDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.00146 -0.71496 -0.01533 0.68273 2.94893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.137145 0.098036 -1.399 0.162
## Height 0.371397 0.044576 8.332 2.61e-16 ***
## I(Height^2) -0.407120 0.004275 -95.225 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 997 degrees of freedom
## Multiple R-squared: 0.992, Adjusted R-squared: 0.9919
## F-statistic: 6.146e+04 on 2 and 997 DF, p-value: < 2.2e-16
Let’s look at a qubic relationship for comparison:
<- -2
SlopeC<-SimDat %>%
SimDatmutate(Biomass2 = Beta0 + SlopeA*Height + SlopeB*Height^2 + SlopeC*Height^3 + Err)
<-lm(Biomass2 ~ Height + I(Height^2) + I(Height^3), data=SimDat)
GrowthModanova(GrowthMod)
## Analysis of Variance Table
##
## Response: Biomass2
## Df Sum Sq Mean Sq F value Pr(>F)
## Height 1 287699795 287699795 283740631 < 2.2e-16 ***
## I(Height^2) 1 51789157 51789157 51076463 < 2.2e-16 ***
## I(Height^3) 1 1466001 1466001 1445827 < 2.2e-16 ***
## Residuals 996 1010 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GrowthMod)
##
## Call:
## lm(formula = Biomass2 ~ Height + I(Height^2) + I(Height^3), data = SimDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9589 -0.7127 -0.0211 0.6813 2.9242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.237635 0.130236 -1.825 0.0684 .
## Height 0.489653 0.110319 4.438 1.01e-05 ***
## I(Height^2) -0.436466 0.025405 -17.180 < 2e-16 ***
## I(Height^3) -1.998053 0.001662 -1202.425 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 996 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.121e+08 on 3 and 996 DF, p-value: < 2.2e-16
Notice that our third polynomial is not significant, even though it should be because of our linear model. The reason for this is complicated, but it has to do with the fact that the polynomial predictors are not independent.
We can solve this problem with the poly()
function, but note that it changes the estimates.
<-lm(Biomass2 ~ poly(Height,degree=3), data=SimDat)
GrowthModanova(GrowthMod)
## Analysis of Variance Table
##
## Response: Biomass2
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(Height, degree = 3) 3 340954953 113651651 112087640 < 2.2e-16 ***
## Residuals 996 1010 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(GrowthMod)
##
## Call:
## lm(formula = Biomass2 ~ poly(Height, degree = 3), data = SimDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9589 -0.7127 -0.0211 0.6813 2.9242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.241e+02 3.184e-02 -16458 <2e-16 ***
## poly(Height, degree = 3)1 -1.696e+04 1.007e+00 -16845 <2e-16 ***
## poly(Height, degree = 3)2 -7.196e+03 1.007e+00 -7147 <2e-16 ***
## poly(Height, degree = 3)3 -1.211e+03 1.007e+00 -1202 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 996 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.121e+08 on 3 and 996 DF, p-value: < 2.2e-16
We can also add the polynomial term to ggplot
object if we use the formula
parameter in geom_smooth
qplot(x=Height, y=Biomass2, data=SimDat) + geom_smooth(formula=y ~ poly(x,3, raw=TRUE))
## `geom_smooth()` using method = 'gam'
The key assumption of our linear models is:
residual error follows a normal distribution with a mean of zero
AND
error variance is independent of predictors
There are a few convenient ways to check this, but first, let’s set up a few models, starting with a base dataset.
<-1000
N<-0
B0<-0.03
B1<-data.frame(
TestDatPredCon=sample(c(1:100)/10,N,replace=T),
PredNom=sample(c("A","B","C"), N, replace=T),
Err=rnorm(N, mean=0, sd=1)
%>%
) mutate(PredNomMean=recode(PredNom,"A"=0,"B"=5,"C"=-5)) %>%
mutate(Response=B0+B1*PredCon+PredNomMean+Err)
<-lm(Response ~ PredCon + PredNom, data=TestDat)
TestModanova(TestMod)
## Analysis of Variance Table
##
## Response: Response
## Df Sum Sq Mean Sq F value Pr(>F)
## PredCon 1 0.2 0.2 0.2407 0.6238
## PredNom 2 16850.7 8425.3 8839.1708 <2e-16 ***
## Residuals 996 949.4 1.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(TestMod)
##
## Call:
## lm(formula = Response ~ PredCon + PredNom, data = TestDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1999 -0.6896 -0.0052 0.6190 2.8556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.13321 0.07588 -1.756 0.079463 .
## PredCon 0.03593 0.01073 3.347 0.000847 ***
## PredNomB 5.14029 0.07566 67.936 < 2e-16 ***
## PredNomC -4.91573 0.07563 -65.001 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9763 on 996 degrees of freedom
## Multiple R-squared: 0.9467, Adjusted R-squared: 0.9465
## F-statistic: 5893 on 3 and 996 DF, p-value: < 2.2e-16
This model should meet our assumptions, but let’s check. There are three main ways to do this, all using the residuals of our model, which we can calculate manually, or use the residuals
function on our model object.
qplot(residuals(TestMod))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This looks normal, and is senered around zero, but what is a normal distribution supposed to look like? We can use our qnorm
function to generate the expected quantile of each observation. We could then plot the observed vs. expected values.
This is called the quantile-quantile or q-q plot
The Quantile-Quantile (Q-Q) Plot bins the residuals the same way we would for a histogram, but then calculates the number of observations we expect in each bin for a normal distribution. If the data are perfectly normal, then all the points will fall along the 1:1 line
qplot(sample=residuals(TestMod)) + stat_qq() + stat_qq_line()
We can see some very slight deviation from the line at the extreme values, where we have a small sample size
The last test is to make sure that variance is constant across our predictors. We can do this for each predictor variable alone, as well as the overall prediction from the model. To do this, let’s add residuals to the dataset:
$Residuals<-residuals(TestMod)
TestDatqplot(x=PredCon, y=Residuals, data=TestDat)
qplot(x=PredNom, y=Residuals, data=TestDat, geom="boxplot")
Note how the variance is similar range, centred at zero across all values of x in both graphs. We should also not see any correlation between the x- and y- axes.
We can also generate the predicted values using the estimates from the linear model output. But an easier way is the predict()
function.
$Predicted<-predict(TestMod)
TestDatqplot(x=Predicted, y=Residuals, data=TestDat)
Since we have continuous + nominal predictors we want to make sure that each group is centered at zero with similar variance with no correlations between the x- and y-axis.
Let’s make a couple of models that violate the assumptions for comparison.
The first example is the one above – fitting a linear model to a quadratic relationship
# Make a new response variable
<-TestDat %>%
TestDatmutate(Response2=B0+B1*PredCon+0.2*PredCon^2+PredNomMean+Err)
# Run the linear model
<-lm(Response2 ~ PredCon + PredNom, data=TestDat)
TestMod2
# Calculate predictions and residuals of the linear model
$Predicted2<-predict(TestMod2)
TestDat$Residuals2<-residuals(TestMod2)
TestDat
# Plot to look for problems
qplot(x=Residuals2, data=TestDat) # Histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(sample=Residuals2, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot
qplot(x=PredCon, y=Residuals2, data=TestDat)
qplot(x=PredNom, y=Residuals2, data=TestDat, geom="boxplot")
qplot(x=Predicted2, y=Residuals2, data=TestDat)
# Make a new response variable
<-TestDat %>%
TestDatmutate(Response3=exp(B0+B1*PredCon+PredNomMean+Err))
# Run the linear model
<-lm(Response3 ~ PredCon + PredNom, data=TestDat)
TestMod3
# Calculate predictions and residuals of the linear model
$Predicted3<-predict(TestMod3)
TestDat$Residuals3<-residuals(TestMod3)
TestDat
# Plot to look for problems
qplot(x=Residuals3, data=TestDat) # Histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(sample=Residuals3, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot
qplot(x=PredCon, y=Residuals3, data=TestDat)
qplot(x=PredNom, y=Residuals3, data=TestDat, geom="boxplot")
qplot(x=Predicted, y=Residuals3, data=TestDat)
# Make a new response variable
<-TestDat %>%
TestDatmutate(Response4=B0+B1*PredCon*Err+PredNomMean*Err/10)
# Run the linear model
<-lm(Response4 ~ PredCon + PredNom, data=TestDat)
TestMod4
# Calculate predictions and residuals of the linear model
$Predicted4<-predict(TestMod4)
TestDat$Residuals4<-residuals(TestMod4)
TestDat
# Plot to look for problems
qplot(x=Residuals4, data=TestDat) # Histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(sample=Residuals4, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot
qplot(x=PredCon, y=Residuals4, data=TestDat)
qplot(x=PredNom, y=Residuals4, data=TestDat, geom="boxplot")
qplot(x=Predicted, y=Residuals4, data=TestDat)
# Make a new response variable
<-TestDat %>%
TestDatmutate(Response5=B0+B1*PredCon+PredNomMean+rpois(N,lambda=1))
# Run the linear model
<-lm(Response5 ~ PredCon + PredNom, data=TestDat)
TestMod5
# Calculate predictions and residuals of the linear model
$Predicted5<-predict(TestMod5)
TestDat$Residuals5<-residuals(TestMod5)
TestDat
# Plot to look for problems
qplot(x=Residuals5, data=TestDat) # Histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(sample=Residuals5, data=TestDat) + stat_qq() + stat_qq_line() # QQPlot
qplot(x=PredCon, y=Residuals5, data=TestDat)
qplot(x=PredNom, y=Residuals5, data=TestDat, geom="boxplot")
qplot(x=Predicted, y=Residuals5, data=TestDat)
Note that in our analysis above, we are analyzing the RESIDUALS. Our assumptions with linear models are about the residuals NOT the predictor or response variables.
Let’s look back at our original, well-behaved model as an example. Compare the histograms of the response and predictors:
qplot(TestDat$Response)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TestDat$PredCon)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(TestDat$PredNom)
In the case of our continuous variable, we have a tri-modal distribution. But remember that our residuals look normally distributed:
qplot(residuals(TestMod))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(sample=residuals(TestMod)) + stat_qq() + stat_qq_line()
There is nothing wrong with the sample or the model! It doesn’t matter that the predictors and response are not normally distributed, as long as the residuals are.
Why is there a trimodal distribution in the response variable but not the residuals?
We have three groups with different means, showing up in the response as three peaks. When we take out the means by including the predictor in our linear model, then the residuals are the deviation from the group means, resulting in a single peak.
Once you understand how these quality checks work, there are some tools you can use to quickly plot and view the diagnostic graphs.
The autoplot()
function in theggfortify
package is a nice complement to ggplot
for generating diagnostic graphs. You can even colour code your predictors:
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.0.5
autoplot(TestMod, which=1:4, data=TestDat, colour = 'PredNom')
Note the Cook’s distance plot. This checks whether you have a few outlier data points that contribute disproportionately to the fit of the model. The row numbers are shown for potential outliers, but looking at the figure we can see that the Cook’s distance measure is not much higher for these then several other points. Try adding an outlier to the TestDat
dataset to see what this would look like in the Cook’s distance plot.
You can generate other analytical plots with the which
function. See ?autoplot
for more detail.