Overview

Principal Components Analysis (PCA) is an example of an unsupervised machine learning model. However, PCA has long been used by ecologists and evolutionary biologists as a diminsion reduction tool and to look for structure in biological data. PCA also forms the basis for more discriminant analysis – a form of supervised machine learning.

PCA is one type of multivariate analysis and related methods are commonly used to analyze all kinds of biological data including plant and animal communities, microbiomes, gene expression, metabolomics & proteomics, and climate.

In this tutorial, we revisit the same data on Lythrum salicaria that we used for selection analysis in the GAM Tutorial.

Setup

Load the usual plotting and data management functions:

library(ggplot2) # plotting library
library(dplyr) # data management
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())

and the fitness data:

LythrumDat<-read.csv("https://colauttilab.github.io/Data/ColauttiBarrett2013Data.csv",header=T)

Dimensionality

Recall the structure of the data:

str(LythrumDat)
## 'data.frame':    432 obs. of  23 variables:
##  $ Ind      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site     : chr  "1_BEF" "1_BEF" "1_BEF" "1_BEF" ...
##  $ Row      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Pos      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Mat      : chr  "C13" "C11" "A16" "C20" ...
##  $ Pop      : chr  "C" "C" "A" "C" ...
##  $ Region   : chr  "NORTH" "NORTH" "NORTH" "NORTH" ...
##  $ Flwr07   : int  39285 39293 39286 39305 39293 39291 39287 39293 39287 39303 ...
##  $ FVeg07   : num  41 40 32.5 45 47.5 55 51.5 38 51 37.5 ...
##  $ InfMass07: num  9.7 6.5 16 1.5 17.4 13.1 3 0.6 0 12.2 ...
##  $ Fruits07 : int  625 329 1029 46 1121 504 122 15 0 896 ...
##  $ Flwr08   : int  39613 39625 39610 39638 39616 39625 39614 39621 39617 39638 ...
##  $ FVeg08   : num  79 117 70 94.5 97 ...
##  $ HVeg08   : num  71 108 62.5 87.5 90 124 85 76.5 91 96.5 ...
##  $ InfMass08: num  9.7 36.1 28.9 18.2 47.9 30.7 23.3 8.5 28.5 16.6 ...
##  $ Flwr09   : int  39980 39993 NA 40000 39986 39995 39990 39993 39996 39997 ...
##  $ FVeg09   : num  64 70 NA 84 66 91 80 62 58 99 ...
##  $ HVeg09   : num  82 65.5 NA 72 65.5 82 76 60.5 56.5 93.5 ...
##  $ InfMass09: num  11.7 10.1 NA 9.6 38.1 16.7 11.1 2.6 1.7 18.8 ...
##  $ Flwr10   : int  40345 40350 NA 40369 40353 40357 40350 NA 40358 40357 ...
##  $ FVeg10   : num  68.7 83.4 NA 69.9 70.1 97 105 NA 50.8 85.1 ...
##  $ HVeg10   : num  66 96 NA 57.5 65.5 77 98 57 49.5 82 ...
##  $ InfMass10: num  1.9 7.3 NA 5.7 27.9 4.9 5.3 5.5 1.5 15.3 ...

We have the same traits measured in multiple years, in particular:

  • Flwr – Days to first flower
  • FVeg – Plant size (vegetative height) at first flower

Each trait was measured over 4 years, from 2007 through 2010. We can think of each trait as ‘multivariate’ or ‘mutli-dimensional’ with \(D = 4\) dimensions.

The Principal Components of a PCA are new vectors that are calculated as linear combinations of other vectors.

Think of a linear model where we try to predict a single response variable \(Y\) by adding together coefficients of N individual predictors \(X_1 ... X_N\). PCA is similar, except that instead of predicting a measured variable \(Y\), we are trying to redefine N new component axes \(PC_1 ... PC_N\) that are:

  1. Linear combinations of the input features \(X_1 ... X_N\) (‘predictors’ in LM lingo)
  2. Uncorrelated with each other – the correlation coefficient between any two PC axes should be ~ 0.

One common goal of PCA is to reduce the dimensionality of our data, by redefining many correlated predictors into one or a few uncorrelated principal component axes. These PC axes are themselves vectors of data representing different combinations of the input features.

In the GAM tutorial, we averaged across years in order to reduce the dimensionality from 4 to 1 for each of the key measurements (Flwr & FVeg). One problem with that approach is that it doesn’t account for measurement error in each year.

Correlated Traits

PCA works best when there is colinearity among the predictors. We can see this by looking at the pairwise correlations. To do this, let’s first make two separate datasets, one for each of the main measurements. We can use the select() function from dplyr and specify every single column. However if we look at the help ?select we can see a short-cut since the column names all have the same prefix:

Flwr<-LythrumDat %>%
  select(starts_with("Flwr"))
FVeg<-LythrumDat %>%
  select(starts_with("FVeg"))
head(Flwr)
##   Flwr07 Flwr08 Flwr09 Flwr10
## 1  39285  39613  39980  40345
## 2  39293  39625  39993  40350
## 3  39286  39610     NA     NA
## 4  39305  39638  40000  40369
## 5  39293  39616  39986  40353
## 6  39291  39625  39995  40357
head(FVeg)
##   FVeg07 FVeg08 FVeg09 FVeg10
## 1   41.0   79.0     64   68.7
## 2   40.0  117.0     70   83.4
## 3   32.5   70.0     NA     NA
## 4   45.0   94.5     84   69.9
## 5   47.5   97.0     66   70.1
## 6   55.0  128.5     91   97.0

We can calculate a correlation matrix using cor(), but:

cor(Flwr)
##        Flwr07 Flwr08 Flwr09 Flwr10
## Flwr07      1     NA     NA     NA
## Flwr08     NA      1     NA     NA
## Flwr09     NA     NA      1     NA
## Flwr10     NA     NA     NA      1

We just get NAs because we have missing data. If we look at the help ?cor we can see there is a use= parameter with different options for excluding NA. In this case, we’ll do each pairwise correlation and exclude only missing values in one or both of each pair.

cor(Flwr,use="pairwise.complete.obs")
##           Flwr07    Flwr08    Flwr09    Flwr10
## Flwr07 1.0000000 0.7889657 0.7237457 0.8024986
## Flwr08 0.7889657 1.0000000 0.8621346 0.8489979
## Flwr09 0.7237457 0.8621346 1.0000000 0.8058162
## Flwr10 0.8024986 0.8489979 0.8058162 1.0000000

We can see some pretty high correlation coefficients. We can square these to get R-squared values, which we can interpret as how much variation in one metric is explained by the other. We might also want to round to make the values easier to compare

round(cor(Flwr,use="pairwise.complete.obs")^2,3)
##        Flwr07 Flwr08 Flwr09 Flwr10
## Flwr07  1.000  0.622  0.524  0.644
## Flwr08  0.622  1.000  0.743  0.721
## Flwr09  0.524  0.743  1.000  0.649
## Flwr10  0.644  0.721  0.649  1.000
round(cor(FVeg,use="pairwise.complete.obs")^2,3)
##        FVeg07 FVeg08 FVeg09 FVeg10
## FVeg07  1.000  0.516  0.353  0.349
## FVeg08  0.516  1.000  0.492  0.453
## FVeg09  0.353  0.492  1.000  0.642
## FVeg10  0.349  0.453  0.642  1.000

Quite a bit of variation year-to-year but overall we can predict 35-75% of the variation.

Visualizing these correlations is a bit tricky. One way to do this would be to convert the data from ‘wide’ to ‘long’ format using pivot_longer from the tidyverse package, and then use facets with ggplot. We would want to do this to look for outliers and nonlinear relationships among variables, but for the sake of time we’ll skip those steps here.

PCA

Running PCA in R is very straight-forward.

Missing Data

One major limitation of PCA is that it can’t handle missing data.

The only problem is that PCA doesn’t work with missing data. To run this we can remove any rows with any missing observations. If we have observations of growth for a plant in 2007, 2008, and 2010 but we are missing 2009 then we have to remove the entire row!

Imputation

We only have a few hundred rows of data so we want to avoid removing any observations that we have. Rather than delete entire rows, we can try to ‘impute’ the missing data.

One simple method is just to replace the NA with the mean, median, or mode of the column of data.

A more complicated method is to use some kind of predictive model. For example, a linear model:

Mod<-lm(Flwr09 ~ Flwr07 + Flwr08 + Flwr10, data=Flwr)
summary(Mod)
## 
## Call:
## lm(formula = Flwr09 ~ Flwr07 + Flwr08 + Flwr10, data = Flwr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.272  -5.945  -1.167   6.089  23.699 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.090e+03  2.267e+03  -2.245   0.0258 *  
## Flwr07       6.790e-02  5.456e-02   1.245   0.2147    
## Flwr08       7.898e-01  8.387e-02   9.417   <2e-16 ***
## Flwr10       2.756e-01  9.844e-02   2.800   0.0056 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.869 on 205 degrees of freedom
##   (223 observations deleted due to missingness)
## Multiple R-squared:  0.7253, Adjusted R-squared:  0.7212 
## F-statistic: 180.4 on 3 and 205 DF,  p-value: < 2.2e-16

Here we can see that we can predict about 72% of the variation in 2009 by knowing the other 3 years. We can replace the NA with the prediction from the model. There are more complicated ‘imputation’ models that actually use machine learning algorithms to predict missing values. If we run everything in R, we can try different transformations to see how they affect the final visualizations and statistical analyses.

We’ll keep it simple for now and replace NA with the column average.

Flwr<-Flwr %>%
  mutate(Flwr07 = ifelse(is.na(Flwr07),mean(Flwr07,na.rm=T),Flwr07),
         Flwr08 = ifelse(is.na(Flwr08),mean(Flwr08,na.rm=T),Flwr08),
         Flwr09 = ifelse(is.na(Flwr09),mean(Flwr09,na.rm=T),Flwr09),
         Flwr10 = ifelse(is.na(Flwr10),mean(Flwr10,na.rm=T),Flwr10))

FVeg<-FVeg %>%
  mutate(FVeg07 = ifelse(is.na(FVeg07),mean(FVeg07,na.rm=T),FVeg07),
         FVeg08 = ifelse(is.na(FVeg08),mean(FVeg08,na.rm=T),FVeg08),
         FVeg09 = ifelse(is.na(FVeg09),mean(FVeg09,na.rm=T),FVeg09),
         FVeg10 = ifelse(is.na(FVeg10),mean(FVeg10,na.rm=T),FVeg10))

Now we are ready to run the PCA

Princomp

The princomp function in base R is all we need for a principal components analysis.

Scaling

In most cases, we would want to scale our predictors, as described in the Intro to ML Tutorial.

but we can also use the cor=T parameter to calculate principal components from the correlation matrix, rather than the covariance matrix. This is equivalent to standardizing to z-scores.

FlwrPCA<-princomp(Flwr, cor=T)

Looking at the structure of our PCA:

str(FlwrPCA)
## List of 7
##  $ sdev    : Named num [1:4] 1.647 0.828 0.598 0.495
##   ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
##  $ loadings: 'loadings' num [1:4, 1:4] 0.4 0.531 0.548 0.508 0.901 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Flwr07" "Flwr08" "Flwr09" "Flwr10"
##   .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
##  $ center  : Named num [1:4] 39310 39645 40019 40371
##   ..- attr(*, "names")= chr [1:4] "Flwr07" "Flwr08" "Flwr09" "Flwr10"
##  $ scale   : Named num [1:4] 16.1 18.9 16.7 15.2
##   ..- attr(*, "names")= chr [1:4] "Flwr07" "Flwr08" "Flwr09" "Flwr10"
##  $ n.obs   : int 432
##  $ scores  : num [1:432, 1:4] -3.67 -2.54 -1.58 -1.01 -2.93 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
##  $ call    : language princomp(x = Flwr, cor = T)
##  - attr(*, "class")= chr "princomp"

We can see that the output is a list object, with some important components that we’ll explore here.

Principal Components

Here’s where Principal Components Analysis gets its name: Every PCA analysis will return a number of principal component vectors (aka ‘axes’) equal to the number of input features. In this case, we have four columns of Flwr data from 2007 through 2010. This results in four PC axes (Comp.1 to Comp.4), which are scores in the output of a princomp list object.

head(FlwrPCA$scores)
##         Comp.1       Comp.2      Comp.3     Comp.4
## [1,] -3.672955 -0.091325637 -0.03597351  0.4083552
## [2,] -2.543498  0.008066057 -0.24953917  0.2260170
## [3,] -1.580257 -1.057073046  1.00537762 -1.0146239
## [4,] -1.014452  0.047847804  0.25170169  0.6508622
## [5,] -2.925403  0.077955621  0.26614576  0.3827034
## [6,] -2.293567 -0.294768201  0.04472168  0.2629520

A key characteristic of PC axes is that they are uncorrelated with each other.

round(cor(FlwrPCA$scores),3)
##        Comp.1 Comp.2 Comp.3 Comp.4
## Comp.1      1      0      0      0
## Comp.2      0      1      0      0
## Comp.3      0      0      1      0
## Comp.4      0      0      0      1

Here we see the correlation coefficients of the PC axes, rounded to 3 decimal places. Note that the diagonal is 1 representing the correlation of each PC axis with itself. The off-diagonals are zero, meaning that the correlation coefficient is < 0.001.

Compare this matrix to the correlation matrix for the original Flwr data, shown above. We have redefined linear combinations of the original data to take the original 4 correlated features and redefine them as 4 uncorrelated features.

In other words, we have rescaled four correlated measurements into four uncorrelated PC axes.

Eigenvalues

Each of the four PC axes has an eigenvalue, which we can see in the summary:

summary(FlwrPCA)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     1.6466525 0.8281700 0.59815863 0.49484968
## Proportion of Variance 0.6778661 0.1714664 0.08944844 0.06121905
## Cumulative Proportion  0.6778661 0.8493325 0.93878095 1.00000000

The first row shows the Standard Deviation for each PC axis This is literally just the standard deviation of the values of each PC:

round(sd(FlwrPCA$scores[,1]),2)
## [1] 1.65

Compare this value to the Standard deviation for Comp.1 above.

Squaring the standard deviation gives the variance, which is also known as the PC eigenvalue:

round(sd(FlwrPCA$scores[,1])^2,2)
## [1] 2.72

The eigenvalue of a PC axis is just the amount of variation (variance) in the observed data that can be explained by the PC axis. Notice above that the sd declines from PC1 to PC4. This is always the case: PCs are sorted by their eigenvalues, from highest to lowest.

The second row shows the Proportion of variance, which is how much of the total variance (= sum of all eigenvalues) is represented by each Principal Component axis.

The third row is just the Cumulative Variance, calculated by summing the variances. So the first column is the same, the second column is the first + second, and so on.

If we are using PCA for dimesion reduction, we can use the Proportion of Variance explained by each PC to help us decide how many PCs to keep for downstream analysis. For example, recall how we averaged Flwr across years in the GAM tutorial to collapse 4 columns of data into 1. We can do something similar with PCA, but let the data tell us how many PCs we should keep.

A common method for this is to just plot the variance across eigenvectors.

Scree Plot

Looking back at the list items for FlwrPCA above:

qplot(x=c(1:4),y=FlwrPCA$sdev^2) +
  geom_line() +
  xlab("Component") +
  ylab("Eigenvalue")

The x-axis is the PC number, ranked by eigenvalue. The y-axis is the eigenvalue, representing the amount of variation explained.

This is called a Scree Plot. It can help us choose the number of principal components to keep in the analysis. We look at the shape and try to find the PC axes that account for most of the variance. Visually, we can see this as the change in slope.

In this case, we can see a big drop from 1 to 2 and then a much slower decline from 2 through 4. This is a good indication that the first eigenvector (PC1) captures most of the variation in Flowering Time. The above table tells us that it’s more than half (Proportion of Vairance = 68%).

Run the above analysis for FVeg and compare the outputs. Which PC axes should we retain?

Eigenvectors & Loadings

Each principal component axis is just a vector that is calculated by a linear transformation of the input features. You should now have a clear idea of how we can define a linear model to predict Y from one or more X predictors, with an overall intercept and individual slopes/means for each X. The eigenvector is similar except that there is no intercept. Instead of a measured Y variable, we have four PC axes, each one a different linear combination of the original Flwr07-Flwr10 columns. If we take these four coefficients and put them into a new vector, we get something called the eigenvector. In PCA lingo, the coefficients are called loadings and the eigenvector is just a vector of loadings.

FlwrPCA$loadings
## 
## Loadings:
##        Comp.1 Comp.2 Comp.3 Comp.4
## Flwr07  0.400  0.901  0.159       
## Flwr08  0.531 -0.145 -0.664  0.505
## Flwr09  0.548 -0.175 -0.137 -0.807
## Flwr10  0.508 -0.368  0.717  0.303
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00

The top part of the output shows the loadings for each eigenvector. You can see that that for PC1, all four measurements have similar and positive loadings. For PC2, Flwr07 has a strong positive loading while the other 3 have weak negative loadings.

Think of eigenvector loadings as a measure of how much a particular measurement ‘loads’ onto a given PC axis. Measurements with higher magnitude have a stronger influence on the PC and the sign gives the direction.

Compare the loadings of PC2 for Flwr07 and Flwr08 above, and compare to plotted graphs of the actual data:

qplot(x=FlwrPCA$scores[,2], y=Flwr$Flwr07)

qplot(x=FlwrPCA$scores[,2], y=Flwr$Flwr08)

cor(FlwrPCA$scores[,2], Flwr$Flwr07)
## [1] 0.7465783
cor(FlwrPCA$scores[,2], Flwr$Flwr08)
## [1] -0.1204679

Note this weird line in the middle corresponding to the missing values that we replaced with the mean at the beginning.

Now do the same for PC1. How do those graphs compare to these ones?

We can extract the first column of coefficients as the first 4 elements:

Load<-FlwrPCA$loadings[1:4]
print(Load)
## [1] 0.3996131 0.5311839 0.5475365 0.5082881

If we multiply each loading by its corresponding feature vector, and then sum, we get the PC axis, which we can confirm by plotting. We also have to make sure we scale each measurement to z-scores. The scale function in R is good for this:

testDat<-Flwr %>%
  mutate(PCcalc=scale(Flwr07)*Load[1] + 
           scale(Flwr08)*Load[2] +
           scale(Flwr09)*Load[3] +
           scale(Flwr10)*Load[4])
testDat$Comp.1<-FlwrPCA$scores[,1]
qplot(Comp.1,PCcalc,data=testDat)

Take a second to review and make sure you understand these. Before proceeding, you whould be able to define the following (verbally and mathematically):

  • Eigenvectors
  • Eigenvalues
  • Loadings
  • Principal Components Axes

Biological ‘Traits’

We can also see some degree of correlation between FlwrPC and VegPC. This is because we ran separate PC analyses for the two sets of traits. We might think this makes sense biologically because the vegetative growth of a plant serves a different function than the reproductive timing.

However, there are other cases where we may want to be more agnostic about traits. In this case, we have 6 populations with different genes for both flowering time and size. So an alternative approach would be to put all the vegetative and reproductive measurements into a single PCA – again, making sure we use the cor=T to account for the fact that different measurements in different years will have different means and variances.

Let’s do a new PCA by using all of the measurements, across all of the years.

For the PCA we only want the measurement columns, AND we want to avoid the 2007 measurements because plants there are a lot of missing data for the 2007 measurements in the Timmins site. We also want to keep Site and Pop for later analyses

names(LythrumDat)
##  [1] "Ind"       "Site"      "Row"       "Pos"       "Mat"       "Pop"      
##  [7] "Region"    "Flwr07"    "FVeg07"    "InfMass07" "Fruits07"  "Flwr08"   
## [13] "FVeg08"    "HVeg08"    "InfMass08" "Flwr09"    "FVeg09"    "HVeg09"   
## [19] "InfMass09" "Flwr10"    "FVeg10"    "HVeg10"    "InfMass10"
PCDat<-LythrumDat[,c(2,6,12:23)]

As a shortcut we’ll delete rows with missing data using the na.omit function with dplyr. This will lose a lot of data so some kind of imputation of missing values would be better, but this is okay for demonstration purposes.

PCDat<-PCDat %>%
  na.omit
names(PCDat)
##  [1] "Site"      "Pop"       "Flwr08"    "FVeg08"    "HVeg08"    "InfMass08"
##  [7] "Flwr09"    "FVeg09"    "HVeg09"    "InfMass09" "Flwr10"    "FVeg10"   
## [13] "HVeg10"    "InfMass10"

Remember to exclude the first two rows since Site and Pop are note measurements that we want to include in the PCA

PCfull<-princomp(PCDat[,3:14],cor=T)
summary(PCfull)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.4100500 1.6655040 1.1850639 0.75447888 0.68564472
## Proportion of Variance 0.4840284 0.2311586 0.1170314 0.04743653 0.03917572
## Cumulative Proportion  0.4840284 0.7151871 0.8322184 0.87965496 0.91883068
##                           Comp.6     Comp.7     Comp.8      Comp.9     Comp.10
## Standard deviation     0.5816700 0.50573110 0.38269718 0.319314353 0.272287578
## Proportion of Variance 0.0281950 0.02131366 0.01220476 0.008496805 0.006178377
## Cumulative Proportion  0.9470257 0.96833934 0.98054410 0.989040908 0.995219285
##                            Comp.11    Comp.12
## Standard deviation     0.189701702 0.14622532
## Proportion of Variance 0.002998895 0.00178182
## Cumulative Proportion  0.998218180 1.00000000
loadings(PCfull)
## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## Flwr08     0.216  0.441  0.264         0.214         0.180  0.459  0.108
## FVeg08     0.329        -0.444  0.277  0.226 -0.148         0.148       
## HVeg08     0.322        -0.468  0.268  0.245        -0.121  0.151       
## InfMass08  0.149 -0.386  0.289  0.613 -0.428 -0.305 -0.108              
## Flwr09     0.214  0.394  0.391        -0.219 -0.116 -0.181  0.341 -0.265
## FVeg09     0.386                      -0.254  0.466 -0.163              
## HVeg09     0.382                      -0.159  0.566 -0.128 -0.179  0.105
## InfMass09  0.175 -0.412  0.315  0.185  0.403  0.233  0.594        -0.112
## Flwr10     0.201  0.442  0.217  0.260  0.234 -0.165        -0.741       
## FVeg10     0.365               -0.373 -0.187 -0.278  0.231 -0.186 -0.668
## HVeg10     0.365               -0.355 -0.207 -0.362  0.316         0.648
## InfMass10  0.202 -0.345  0.343 -0.323  0.477 -0.188 -0.590              
##           Comp.10 Comp.11 Comp.12
## Flwr08     0.618                 
## FVeg08                    -0.709 
## HVeg08    -0.105           0.698 
## InfMass08  0.263                 
## Flwr09    -0.598                 
## FVeg09            -0.728         
## HVeg09             0.667         
## InfMass09 -0.275                 
## Flwr10                           
## FVeg10     0.248                 
## HVeg10    -0.186                 
## InfMass10                        
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.083  0.083  0.083  0.083  0.083  0.083  0.083  0.083  0.083
## Cumulative Var  0.083  0.167  0.250  0.333  0.417  0.500  0.583  0.667  0.750
##                Comp.10 Comp.11 Comp.12
## SS loadings      1.000   1.000   1.000
## Proportion Var   0.083   0.083   0.083
## Cumulative Var   0.833   0.917   1.000

We could do a Scree Plot but we’ll just retain the first 2 PCs for visualizing. Remember that PC axes are just rescaled versions of the input data, so we can treat these as new variables, even adding them back to our dataset:

PCDat$PC1<-PCfull$scores[,1]
PCDat$PC2<-PCfull$scores[,2]

Now we can just plot our two PC axes and colour code based on common garden location:

qplot(x=PC1,y=PC2,colour=Site,data=PCDat)

We can see that there is more overlap between the red and green and less overlap with the blue if we look along the PC2 axis. This represents environmental differences (e.g. plasticity) in the phenotypic measurements that we put into our PCA.

So what explains the variation along PC1? Let’s try plotting those separately and then colour-coding by population

qplot(x=PC1,y=PC2,colour=Pop,facets="Site",data=PCDat)

With a separate graph from each site, we can clearly see separation of the different genetic populations. This represents population-level genetic differentiation for the traits that load onto PC1.

There are a lot of other things we could explore here. For example, we could use PC1 or PC2 as a predictor or response variable in a linear model.

Mod<-lm(PC1 ~ Pop*Site, data=PCDat)
summary(Mod)
## 
## Call:
## lm(formula = PC1 ~ Pop * Site, data = PCDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2125 -0.9086 -0.1082  0.8773  4.6184 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.47078    0.52433  -6.620 1.86e-10 ***
## PopC                2.03753    0.62864   3.241 0.001336 ** 
## PopE               -1.02187    0.68364  -1.495 0.136115    
## PopJ                3.84150    0.59590   6.447 5.07e-10 ***
## PopS                4.59424    0.60199   7.632 3.75e-13 ***
## PopT                2.04988    0.69910   2.932 0.003647 ** 
## Site2_KSR           1.45882    0.65035   2.243 0.025678 *  
## Site3_Timmins       2.45431    0.68364   3.590 0.000391 ***
## PopC:Site2_KSR     -0.32679    0.78390  -0.417 0.677095    
## PopE:Site2_KSR      0.59462    0.82712   0.719 0.472800    
## PopJ:Site2_KSR      0.47681    0.75970   0.628 0.530762    
## PopS:Site2_KSR      0.98043    0.77324   1.268 0.205878    
## PopT:Site2_KSR      0.04439    0.87990   0.050 0.959801    
## PopC:Site3_Timmins -0.31723    0.81720  -0.388 0.698174    
## PopE:Site3_Timmins  0.79117    0.87562   0.904 0.367015    
## PopJ:Site3_Timmins -0.12607    0.82209  -0.153 0.878231    
## PopS:Site3_Timmins -1.97851    0.96939  -2.041 0.042200 *  
## PopT:Site3_Timmins -0.70143    1.03251  -0.679 0.497483    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.387 on 277 degrees of freedom
## Multiple R-squared:  0.6889, Adjusted R-squared:  0.6698 
## F-statistic: 36.08 on 17 and 277 DF,  p-value: < 2.2e-16
anova(Mod)
## Analysis of Variance Table
## 
## Response: PC1
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## Pop         5 893.54 178.709 92.8637 < 2e-16 ***
## Site        2 246.33 123.166 64.0015 < 2e-16 ***
## Pop:Site   10  40.52   4.052  2.1056 0.02418 *  
## Residuals 277 533.06   1.924                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both Site and Population affect PC1, and these factors alone explain almost 70% of the variation in PC1.

Projection

You may remember back to high school math or physics that a vector is defined by a direction and magnitude. We have already seen that the eigenvalue is the magnitude of an eigenvector and represents the amount of variation caputured by a principal component axis.

So what is the direction of an eigenvector? It’s a direction in multivariate space, with dimesions equal to the number of input features. In the examples so far, we have seen 4-D space for each of Flwr and FVeg, corresponding to four years of data.

Of course, more than 2 dimensions can’t be visualized on a computer screen or report, and more than 3 dimensions is a completely foreign concept to the human brain. But we can still visualize multiple dimensions by using projection.

Think about what happens when you watch a movie on a 2-dimensional screen. Let’s say a scene takes place in a room with a table and some chairs. That’s a completely flat image, yet you are able to get a sense for the third dimension. In fact, your visualization of the real world is really just a 2-dimensional field of cells in your retina. You are intuitively able to reconstruct 3D space based on the angles of the lines of the room, table and chairs.

This is called projection. The 3D space is projected onto 2D space. We can do the same thing for higher dimensions of a PCA. Rather than do these calculations by hand, we can use the autoplot function from the ggfortify package

library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.0.5
autoplot(PCfull)

We can also use the original dataset to color-code the data.

autoplot(PCfull, data=PCDat, colour="Site")

And use loadings=TRUE to project the eigenvectors to see which measurements contribute most to PC1 vs PC2

autoplot(PCfull, data=PCDat, colour="Site", loadings=T, loadings.label=T)

Compare the direction of these projected eigenvectors to the loadings in the output for PCfull, above. We can see that PC1 is affected by all three measurements whereas component 2 is more affected by InfMass and Flwr. Notice how the same measurement in different years all have very similar vectors. This shows that the same measurements in different years are collinear.

QA/QC

PCA has important assumptions that should be checked. As with linear models, we may want to transform some of the input variables to help meet model assumptions. The key assumptions is that input variables are multivariat normal. This means that if we graph each pair of input variables, they should form a ‘shotgun’ pattern that is approximately circular (or oval) with the highest density of points in the middle. As Principal Components are linear combinations of input variables, outliers and non-normal distributions tend to bias the loadings of particular PC axes.

Another important assumption is that the major axes of variation are the axes of interest. In the examples above, we saw how variation along PC1 is largely due to genetic variation while PC2 is largely due to environmental variation. But it may not always be like this. For example, we may have multiple axes of environmental variation and genetic variation might be spread across many axes. In such a case, we may want to rescale or ‘rotate’ our axes in a way that maximizes variation among populations. Although we can’t do this with PCA, there are other methods based on PCA that we can try.

Beyond PCA

PCA forms the basis for a number of additional analyses in machine learning, so it is worth spending some time to review and understand the concepts in this tutorial. Here are a few examples of PCA-based analyses.

  • Discriminant Analysis – This is a supervised learning approach based on PCA. The key difference is that the eigenvectors are rescaled in a way that maximizes their ability to discriminate between the different groups of the response variable.
  • Factor Analysis – FA builds on PCA by adding unobserved variables called factors. For example, we may have gene expression data from a set of particular pathways, and we want to use factor analysis to look at the behaviour of the smaller subset of pathways rather than the many genes themselves.
  • Correspondence Analysis – Think of this as 2 separate PCAs that you then compare to figure out how one set of data affects the other. A good example of this is community ecology where one set of data is the species community and the other set includes a bunch of environmental variables.