Overview

In the PCA Tutorial we looked at Principal Components Analysis as an example of Unsupervised Machine Learning because there were no a priori groups that we could use to train the model. Instead, we just used PCA to identify uncorrelated (i.e. orthogonal) axes of variation, and then looked at our groups post hoc to see how they mapped onto the major PC axes.

In this tutorial we look at Regularlized Discriminant Analysis (RDA) as an extension of PCA that uses a categorical variable for Supervised Learning. In supervised learning, we know the groups ahead of time and try to find a model that distinguishes them. With PCA, we defined component axes based on variation in the features themselves. In our example case, we found that the major axes correspond to genetic population (PC1) and growing environment (PC2). BUT that won’t always be the case.

With RDA, we try to redefine the First Component Axis by changing the axis loadings, applying a linear or nonlinear equation to predict the response variable. The closest analogy is to think of a logistic regression with a binary response variable and multiple continuous predictors. With an RDA we also have a binary response variable and multiple continuous predictors. Remember that we can redefine a variable with N categories with N-1 binary variables, as we’ll see in our example below.

Terminology

The terminology can get a bit confusing. You may see the term Discriminant Analysis (DA), which is really a Linear Discrimimant Analysis (LDA) or sometimes caled a Discriminant Function Analysis (DFA). The LDA is a generalization of Fisher’s Linear Discriminant – the same Fisher who invented ANOVA and Fisher’s Exact Test, who was also an Evolutionary Biologist who contributed key insights into population genetics.

Later, the LDA was generalized to the Quadratic Discriminant Analysis (QDA), which includes a tuning parameter for unequal variances in the feature predictions. And then later the QDA was generalized to the Regularized Discriminant Analysis (RDA) by adding a second tuning parameter. We’ll look at these tuning parameters (\(\gamma\) and \(\lambda\)) in the example below.

Setup

Load the usual plotting and data management functions:

library(ggplot2) # plotting library
library(dplyr) # data management

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())

For RDA we will use functions from the MASS package. Don’t forget to install.packages("MASS") the first time you use it.

library(MASS)

## Warning: package 'MASS' was built under R version 4.0.5

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Input Data

The data we’ll be working on today come from nasal swabs of patients that were analyzed using a method called metabolomics. This is just a fancy word for chemical analysis using High Performance Liquid Chromatography (HPLC). For our purposes, just know that there are a set of chemicals that we can identify and then measure their concentrations. These concentrations vary by orders of magnitude so this is a prime candidate for scaling our data, as discussed in the Intro to ML Tutorial.

VirDat<-read.csv("https://colauttilab.github.io/Data/ViralMetData.csv", header=T)
head(VirDat)

##   Sample.Name Batch.Number Class.name  Sex  Age   CT  Gly   Ala  Ser Histamine
## 1       VTM a            1        VTM <NA> <NA> <NA> 33.6  7.48 5.16     0.183
## 2      VTM a1            1        VTM <NA> <NA> <NA> 33.9 10.50 7.49     0.138
## 3      VTM a2            1        VTM <NA> <NA> <NA> 32.1 19.20 7.70     0.159
## 4       VTM b            2        VTM <NA> <NA> <NA> 28.9 13.00 7.25     0.363
## 5      VTM b1            2        VTM <NA> <NA> <NA> 29.6 11.90 4.63     0.181
## 6      VTM b2            2        VTM <NA> <NA> <NA> 28.1 12.10 2.37     0.219
##    Pro  Val  Thr Taurine Putrescine t4.OH.Pro  Leu  Ile   Asn  Asp  Gln  Glu
## 1 3.99 2.81 4.02    4.70      0.000     0.247 0.00 2.49 4.860 31.1 0.00 5840
## 2 4.20 3.42 6.78    3.87      0.000     0.213 0.00 1.54 5.750 35.2 0.00 5550
## 3 3.66 3.56 3.87    3.60      0.000     0.206 0.00 2.28 7.540 35.4 0.00 5230
## 4 3.95 2.80 5.70    4.27      0.472     3.290 3.46 2.37 3.610 22.9 2.43    0
## 5 4.30 3.12 7.96    3.73      0.510     3.420 5.50 2.10 0.353 25.6 1.97    0
## 6 3.43 2.66 5.91    3.98      0.467     2.910 3.50 2.14 0.967 30.1 1.86    0
##      Met  His  Phe Met.SO  Arg Ac.Orn   Cit  Tyr  DOPA total.DMA Kynurenine
## 1 0.0674 2.21 2.81  0.114 6.90 0.0964 0.766 6.03 0.481    0.0000      0.000
## 2 0.0313 2.91 3.11  0.177 6.09 0.1510 1.480 5.78 0.516    0.1260      0.000
## 3 0.1380 2.81 3.16  0.246 6.79 0.1420 0.822 7.08 0.315    0.0000      0.000
## 4 0.3870 2.78 3.03  0.281 6.83 0.0000 0.000 7.11 0.641    0.0798      1.320
## 5 0.6350 1.94 3.17  0.284 6.14 0.1300 0.000 6.60 0.649    0.0000      0.938
## 6 0.5710 2.56 3.15  0.152 6.41 0.0000 0.000 6.49 0.190    0.0548      0.286
##   Carnosine Nitro.Tyr   Orn  Lys Spermidine Spermine Sarcosine Tyramine
## 1      2.91    0.1710 1.230 3.58     0.2740   0.1580    0.0126    0.149
## 2      2.60    0.1620 1.300 3.84     0.0000   0.1350    0.0000    0.157
## 3      2.72    0.1540 0.987 3.80     0.0525   0.1280    0.0282    0.182
## 4      2.55    0.0505 1.240 4.66     1.8400   0.0442    0.1400    0.118
## 5      2.60    0.0620 1.010 3.82     3.0400   0.0403    0.1360    0.104
## 6      2.73    0.0479 0.938 4.31     1.6400   0.0000    0.1550    0.126
##   Creatine Betaine Choline  TMAO Lactic.acid beta.Hydroxybutyric.acid
## 1    0.895    23.6    2.25 0.151        30.2                    0.000
## 2    0.847    21.8    2.46 0.191        22.8                    0.000
## 3    0.858    20.5    3.06 0.144        29.1                    0.000
## 4    1.030    20.3    2.88 0.683        30.8                    0.522
## 5    1.010    21.9    1.31 0.503        36.3                    0.512
## 6    0.869    18.5    1.60 0.506        27.2                    0.509
##   Citric.acid Butyric.acid Propionic.acid Succinic.acid Fumaric.acid
## 1        4.12        0.838           5.99         0.460       0.0476
## 2        4.15        1.040           8.58         0.211       0.0423
## 3        4.07        1.280          10.50         0.538       0.0473
## 4        4.87        0.327           2.60         1.870       0.0740
## 5        4.41        0.627           4.50         1.040       0.0700
## 6        4.26        0.598           4.17         0.782       0.0595
##   Pyruvic.acid Isobutyric.acid Hippuric.acid Glucose LYSOC14.0 LYSOC16.1
## 1         3.56           1.200         0.160   28780    0.2513    0.0221
## 2         2.64           1.890         0.115   28971    0.2271    0.0159
## 3         3.77           2.350         0.145   28047    0.3094    0.0239
## 4        27.00           0.643         0.217   36007    0.4712    0.0855
## 5         4.95           1.210         0.223   37108    0.4591    0.0994
## 6         4.60           1.220         0.225   37698    0.4360    0.0806
##   LYSOC16.0 LYSOC17.0 LYSOC18.2 LYSOC18.1 LYSOC18.0 LYSOC20.4 LYSOC20.3
## 1    0.1046    0.0293    0.0811    0.1958    0.0980    0.1013    0.9735
## 2    0.0741    0.0468    0.1276    0.1647    0.0944    0.1135    1.3377
## 3    0.1036    0.0444    0.1445    0.1153    0.1211    0.0981    1.3555
## 4    0.1951    0.1390    0.8328    1.0528    0.2628    0.3331    1.2771
## 5    0.1871    0.1907    0.8766    0.4555    0.3154    0.4256    2.4573
## 6    0.1688    0.0983    0.6199    0.5133    0.2453    0.4074    1.7006
##   LYSOC24.0 LYSOC26.1 LYSOC26.0 LYSOC28.1 LYSOC28.0 X14.1SMOH X16.1SM X16.0SM
## 1    0.3398    0.3130    0.5302    0.0874    0.2037    0.1319  0.0947  0.1016
## 2    0.3465    0.3562    0.4859    0.0882    0.2694    0.0995  0.0938  0.1280
## 3    0.2983    0.3351    0.5359    0.1080    0.3199    0.1242  0.1380  0.1306
## 4    0.8346    0.7999    1.6933    0.4765    0.9388    0.9707  0.7634  0.8328
## 5    0.9145    1.0149    1.7288    0.5151    1.0953    1.1390  0.9097  1.1899
## 6    0.8560    0.7375    1.5725    0.4250    1.0660    0.7859  0.7659  0.8999
##   X16.1SMOH X18.1SM PC32.2AA X18.0SM X20.2SM PC36.0AE PC36.6AA PC36.0AA
## 1    0.0672  0.1107   0.1325  0.1304  0.0349   0.0605   0.1681   0.1674
## 2    0.0629  0.0997   0.1121  0.1185  0.0337   0.1111   0.2039   0.2051
## 3    0.0859  0.1399   0.0993  0.1292  0.0406   0.0868   0.1289   0.1229
## 4    0.8141  0.7329   1.4576  0.8903  0.7397   1.0864   1.2992   1.7285
## 5    1.0085  0.9001   2.0743  0.8559  0.8003   1.8082   1.1285   1.3786
## 6    0.7482  0.7201   1.0170  0.6784  0.7201   1.1852   1.0221   0.6879
##   X22.2SMOH X22.1SMOH PC38.6AA PC38.0AA PC40.6AE X24.1SMOH PC40.6AA PC40.2AA
## 1    0.0469    1.0360   0.0433   0.0583   0.0476    0.0355   0.0444   0.0287
## 2    0.0287    0.9698   0.0473   0.0606   0.0280    0.0242   0.0468   0.0276
## 3    0.0452    0.9878   0.0556   0.0606   0.0404    0.0402   0.0339   0.0180
## 4    0.5306    3.1635   0.7399   0.5914   0.7078    0.4832   0.5457   0.4900
## 5    0.7034    2.9238   1.0035   0.8721   0.9650    0.5335   0.7759   0.7727
## 6    0.5170    2.8234   0.4993   0.5475   0.4972    0.4367   0.4655   0.4162
##   PC40.1AA      C0     C2   C3.1     C3   C4.1     C4   C3OH   C5.1     C5
## 1   0.0253  0.7811 0.1734 0.0418 0.0738 0.0514 0.0969 0.1107 0.0328 0.0366
## 2   0.0349  0.8777 0.1731 0.0561 0.0847 0.0711 0.0962 0.1275 0.0388 0.0420
## 3   0.0301  0.7166 0.1827 0.0332 0.0948 0.0797 0.0704 0.0896 0.0511 0.0365
## 4   0.5444 10.3667 1.4544 0.0968 0.3844 0.1764 0.8741 0.1950 0.0871 0.1848
## 5   0.6797  6.5042 1.0957 0.2116 0.4547 0.2367 0.5804 0.1676 0.0914 0.1451
## 6   0.4747  3.1333 0.5489 0.2988 0.4535 0.1486 0.3734 0.0825 0.0549 0.0795
##     C4OH   C6.1     C6   C5OH C5.1DC   C5DC     C8  C5MDC     C9   C7DC
## 1 0.2561 0.1601 0.4502 2.0265 0.0547 0.0893 0.2407 0.0408 0.0481 0.0499
## 2 0.2341 0.2025 0.5572 2.1246 0.0796 0.1204 0.1818 0.0632 0.0548 0.0753
## 3 0.2128 0.1853 0.5039 2.1108 0.0857 0.0961 0.2002 0.0462 0.0417 0.0436
## 4 0.1919 0.0654 0.6398 6.0530 0.0684 0.0749 0.5027 0.0624 0.1417 0.0418
## 5 0.1918 0.0981 0.3269 7.2123 0.0838 0.0729 0.2413 0.0572 0.2149 0.0388
## 6 0.1338 0.0635 0.2497 7.6669 0.0839 0.0557 0.2117 0.0499 0.2214 0.0404
##        C10.2     C10.1    C10  C12.1    C12  C14.2   C14.1    C14  C12DC
## 1 0.07403994 0.3575579 0.2306 0.2175 0.3944 0.0643  0.0124 1.1884 0.3570
## 2 0.04419391 0.3615120 0.2614 0.2451 0.4553 0.0885  0.1427 1.0042 0.4498
## 3 0.03847946 0.3324331 0.2935 0.2501 0.4633 0.0836  0.1344 0.9804 0.4447
## 4 0.24881239 0.2241368 0.3145 0.0611 0.2671 0.0114 -0.1195 0.1990 0.0947
## 5 0.19996217 0.2384443 0.1436 0.0680 0.2248 0.0208 -0.3304 0.1534 0.1300
## 6 0.13533840 0.2486022 0.1176 0.0623 0.1673 0.0135  0.0157 0.1201 0.0939
##   C14.2OH C14.1OH  C16.2  C16.1    C16 C16.2OH C16.1OH  C16OH  C18.2  C18.1
## 1  0.0384  0.1149 0.0367 0.0381 0.0338  0.0193  0.0258 0.1279 0.0391 0.0641
## 2  0.0439  0.1248 0.0386 0.0332 0.0346  0.0246  0.0333 0.1222 0.0395 0.0705
## 3  0.0385  0.1227 0.0395 0.0356 0.0513  0.0236  0.0346 0.1254 0.0466 0.0726
## 4  0.0151  0.0222 0.0192 0.0203 0.1364  0.0156  0.0178 0.0893 0.0334 0.0446
## 5  0.0250  0.0271 0.0290 0.0258 0.0812  0.0202  0.0288 0.0999 0.0327 0.0532
## 6  0.0225  0.0254 0.0363 0.0240 0.0528  0.0190  0.0258 0.0855 0.0450 0.0410
##      C18 C18.1OH
## 1 0.0382  0.1378
## 2 0.0476  0.1566
## 3 0.0626  0.1735
## 4 0.1363  0.0396
## 5 0.0929  0.0583
## 6 0.0688  0.0648

Sample.Name – A unique identifier for each sample
Batch.Number – A unique number for each ‘batch’. All the samples with the same ‘batch’ were run on the equipment at the same time. Batch Effects are common in this kind of analysis, where we can get slight differences in some of the measurements due to a variety of reasons (e.g. technician handling, differences in age of chemicals, etc.)
Class.name – This is the group classifier, and there are four groups: * VTM – this is just the liquid used to stabilize the nazal swab. It is purchased from a biotech company so the exact chemical profile is unknown. Including it in the analysis acts as one type of control. * Control – nasal swabs from patients with no known infection * COVID19 – patients who tested positive for COVID-19 via qPCR * Influenza – patients who tested positive for Influenza via qPCR * RSV – patients who tested positive for Respiratory Syncytial Virus (RSV) via qPCR * Age, Sex – Age and sex of the patient * CT – Ct is short for ‘count’ or ‘count threshold’ and it a measure of viral load in qPCR (see below). * Other columns – Each of the other columns, from Gly through C18.1OH represents a chemical profile. You don’t need to worry about the specific names or chemicals, just know that each column represents a different chemical with a unique chemical signature. The values in each column is a measure of concentration. Technically, it’s the estimated area under a curve, and the curve is a measure of concentration over time – review the HPLC Wikipedia Page for more information on how the HPLC works. For now you just need to know that the value is an estimate of the concentration.

Note: Quantitative PCR (qPCR) is also known as Real-Time PCR (RT-PCR), which is NOT THE SAME as reverse-transcription PCR (also RT-PCR). See Wikepedia qPCR/RT-PCR vs Reverse-transcription PCR.

The CT or ‘count threshold’ is the number of cycles of PCR that have run before the target sequence reaches the detection threshold. So the LARGER the CT, the more PCR cycles, and therefore the LESS template. Less template DNA represents a lower viral load. Therefore: Higher CT = Lower viral load

Data Inspection

Looking at a few of the concentrations, we can see characteristic log-normal distributions, with very different median concentrations (compare x-axes)

qplot(x=Gly,data=VirDat)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x=Pyruvic.acid,data=VirDat)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

qplot(x=log(Gly+1),data=VirDat)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

QA/QC

First we should do some quality checks and modify data to fit assumptions of multivariate normality.

One thing we should do is set our batch effects as a factor so that we don’t accidentally analyze it as a numeric variable:

VirDat$Batch.Number<-as.factor(VirDat$Batch.Number)

The next thing we’ll want to do to make things easier, is to create separate data sets for our predictor and response variables.

From above, we see that the first six columns of data are the predictors, and the rest are our features:

RespDat<-VirDat %>%
  select(1:6)

## Error in select(., 1:6): unused argument (1:6)

What does this error mean? It’s a bit tricky, but there is a clue if we look at the help for ?select:

If you do that, you will see that select is a function with the same name in the dplyr AND the MASS libraries. Since MASS was loaded second in our setup above, R assumes that MASS is the version of select that we want to use. To avoid this error, we use :: to specify the package:

RespDat<-VirDat %>%
  dplyr::select(1:6)
Features<-VirDat %>%
  dplyr::select(-c(1:6))

Now we have two datasets, one containing the potential predictors, and one containing the responses.

Verify that the correct columns are subset in each dataset

Scaling

Scaling is an important part of QA/QC, but it’s important enough to get its own heading.

We want to do is scale all of our chemical columns to the same mean and standard deviation. We could code a very long pipe function to scale each column, or we can take a shortcut by separating our predictor data from our feature data.

We can use a shortcut to scale each column of the Features dataset. Really, we should look at each scaled histogram to decide if the feature should also be log-transformed, but for now we’ll just use the regular scaling.

Scaled<-Features %>%
  mutate_all(scale)

Missing Data

As with PCA, missing data will cause a problem for our analyis, so we should check if any of our features have missing data:

Scaled %>%
  select_if(function(x) any(is.na(x))) %>%
  names()

## [1] "Putrescine"    "Leu"           "Asp"           "Lactic.acid"  
## [5] "Butyric.acid"  "Succinic.acid" "Pyruvic.acid"

As with our PCA, we want to impute missing data or else we have to exclude the entire row even if only one value is missing. Since we have scaled everything to a mean of 0, there is a quick and easy way to do this now that we have our features in a separate object:

ScalComp<-Scaled %>%
  mutate(Putrescine = ifelse(is.na(Putrescine),0,Putrescine),
         Leu = ifelse(is.na(Leu),0,Leu),
         Asp = ifelse(is.na(Asp),0,Asp),
         Lactic.acid = ifelse(is.na(Lactic.acid),0,Lactic.acid),
         Butyric.acid = ifelse(is.na(Butyric.acid),0,Butyric.acid),
         Succinic.acid = ifelse(is.na(Succinic.acid),0,Succinic.acid),
         Pyruvic.acid = ifelse(is.na(Pyruvic.acid),0,Pyruvic.acid))

Now check our QA/QC output:

mean(ScalComp$Gly)

## [1] 1.932406e-17

sd(ScalComp$Gly)

## [1] 1

qplot(x=Gly,data=ScalComp)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note the mean = 0 and sd = 1. The data are not perfectly normal, but it’s close enough to continue our analysis. Real-world data are messy.

Dimension Reduction

Looking at the dimensions of our scaled features:

dim(ScalComp)

## [1] 221 124

We can see that we have almost as many columns of predictors (\(k=124\)) as rows of observations (\(n=221\)). If we just throw everything into an RDA we risk ‘overfitting’ our model, just as we would run into a false discovery problem if we ran a separate linear model for each feature as a predictor.

This problem is discussed in the Intro ML Tutorial.

PCA

One solution would be to do a PCA and retain only the major axes. Since we have already scaled our variables, we do not need cor=TRUE as we did in the PCA Tutorial.

PCA<-princomp(ScalComp)
summary(PCA)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     6.3821252 5.1867984 2.66306699 2.31134733 2.28112784
## Proportion of Variance 0.3301668 0.2180728 0.05748663 0.04330451 0.04217955
## Cumulative Proportion  0.3301668 0.5482395 0.60572616 0.64903067 0.69121022
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     2.19188178 1.91730212 1.90138064 1.72335124 1.54361558
## Proportion of Variance 0.03894368 0.02979777 0.02930494 0.02407411 0.01931439
## Cumulative Proportion  0.73015390 0.75995167 0.78925661 0.81333072 0.83264512
##                           Comp.11    Comp.12    Comp.13    Comp.14     Comp.15
## Standard deviation     1.36845303 1.25622475 1.18204835 1.17288175 1.093284717
## Proportion of Variance 0.01517968 0.01279197 0.01132591 0.01115093 0.009688784
## Cumulative Proportion  0.84782479 0.86061676 0.87194267 0.88309360 0.892782387
##                            Comp.16     Comp.17     Comp.18     Comp.19
## Standard deviation     1.003211070 0.893831514 0.856258086 0.848119859
## Proportion of Variance 0.008158068 0.006476107 0.005943087 0.005830653
## Cumulative Proportion  0.900940455 0.907416562 0.913359649 0.919190302
##                            Comp.20     Comp.21     Comp.22     Comp.23
## Standard deviation     0.782919361 0.750597233 0.735602081 0.715453180
## Proportion of Variance 0.004968631 0.004566849 0.004386202 0.004149207
## Cumulative Proportion  0.924158933 0.928725782 0.933111983 0.937261191
##                            Comp.24     Comp.25     Comp.26     Comp.27
## Standard deviation     0.684060473 0.643251089 0.621057226 0.603620066
## Proportion of Variance 0.003793077 0.003354005 0.003126554 0.002953453
## Cumulative Proportion  0.941054268 0.944408274 0.947534828 0.950488280
##                            Comp.28     Comp.29    Comp.30     Comp.31
## Standard deviation     0.597518251 0.579210835 0.54923767 0.536302211
## Proportion of Variance 0.002894043 0.002719419 0.00244525 0.002331427
## Cumulative Proportion  0.953382324 0.956101742 0.95854699 0.960878420
##                            Comp.32     Comp.33     Comp.34     Comp.35
## Standard deviation     0.507611912 0.499457327 0.489584585 0.475509121
## Proportion of Variance 0.002088653 0.002022085 0.001942935 0.001832822
## Cumulative Proportion  0.962967073 0.964989158 0.966932093 0.968764915
##                            Comp.36     Comp.37     Comp.38     Comp.39
## Standard deviation     0.467444736 0.453521722 0.430367887 0.423244431
## Proportion of Variance 0.001771182 0.001667243 0.001501352 0.001452062
## Cumulative Proportion  0.970536098 0.972203341 0.973704692 0.975156754
##                           Comp.40     Comp.41    Comp.42     Comp.43
## Standard deviation     0.41543478 0.394554362 0.37686464 0.367478215
## Proportion of Variance 0.00139897 0.001261875 0.00115126 0.001094626
## Cumulative Proportion  0.97655572 0.977817600 0.97896886 0.980063486
##                            Comp.44     Comp.45      Comp.46      Comp.47
## Standard deviation     0.363294995 0.353788911 0.3401805528 0.3366898090
## Proportion of Variance 0.001069847 0.001014591 0.0009380407 0.0009188881
## Cumulative Proportion  0.981133333 0.982147924 0.9830859647 0.9840048529
##                             Comp.48      Comp.49      Comp.50      Comp.51
## Standard deviation     0.3327567474 0.3090193922 0.3055648632 0.2927015712
## Proportion of Variance 0.0008975454 0.0007740592 0.0007568495 0.0006944689
## Cumulative Proportion  0.9849023983 0.9856764575 0.9864333070 0.9871277759
##                             Comp.52      Comp.53      Comp.54      Comp.55
## Standard deviation     0.2840360071 0.2798492127 0.2627268047 0.2516579531
## Proportion of Variance 0.0006539575 0.0006348204 0.0005595147 0.0005133624
## Cumulative Proportion  0.9877817333 0.9884165538 0.9889760684 0.9894894308
##                             Comp.56      Comp.57      Comp.58      Comp.59
## Standard deviation     0.2487533502 0.2406933129 0.2363567318 0.2323744829
## Proportion of Variance 0.0005015804 0.0004696029 0.0004528336 0.0004377031
## Cumulative Proportion  0.9899910112 0.9904606141 0.9909134477 0.9913511507
##                             Comp.60     Comp.61      Comp.62     Comp.63
## Standard deviation     0.2277088209 0.221973178 0.2164625404 0.208967530
## Proportion of Variance 0.0004203029 0.000399396 0.0003798116 0.000353965
## Cumulative Proportion  0.9917714537 0.992170850 0.9925506612 0.992904626
##                             Comp.64      Comp.65      Comp.66     Comp.67
## Standard deviation     0.2073410339 0.2040356432 0.1956967524 0.191130312
## Proportion of Variance 0.0003484763 0.0003374542 0.0003104345 0.000296116
## Cumulative Proportion  0.9932531025 0.9935905567 0.9939009911 0.994197107
##                             Comp.68      Comp.69      Comp.70      Comp.71
## Standard deviation     0.1853142226 0.1838355514 0.1804110054 0.1757225048
## Proportion of Variance 0.0002783686 0.0002739439 0.0002638328 0.0002502981
## Cumulative Proportion  0.9944754757 0.9947494196 0.9950132524 0.9952635505
##                             Comp.72      Comp.73      Comp.74      Comp.75
## Standard deviation     0.1729722998 0.1665406314 0.1636398334 0.1601132101
## Proportion of Variance 0.0002425246 0.0002248242 0.0002170605 0.0002078055
## Cumulative Proportion  0.9955060751 0.9957308993 0.9959479598 0.9961557653
##                             Comp.76      Comp.77      Comp.78      Comp.79
## Standard deviation     0.1548932263 0.1538713272 0.1493139357 0.1418543039
## Proportion of Variance 0.0001944767 0.0001919191 0.0001807188 0.0001631127
## Cumulative Proportion  0.9963502420 0.9965421610 0.9967228798 0.9968859925
##                             Comp.80      Comp.81      Comp.82      Comp.83
## Standard deviation     0.1403878890 0.1384831961 0.1357036337 0.1324809584
## Proportion of Variance 0.0001597578 0.0001554522 0.0001492745 0.0001422688
## Cumulative Proportion  0.9970457503 0.9972012025 0.9973504770 0.9974927458
##                             Comp.84      Comp.85      Comp.86   Comp.87
## Standard deviation     0.1318259677 0.1282629147 0.1260543403 0.1211127
## Proportion of Variance 0.0001408655 0.0001333537 0.0001288007 0.0001189
## Cumulative Proportion  0.9976336113 0.9977669649 0.9978957657 0.9980147
##                             Comp.88      Comp.89      Comp.90      Comp.91
## Standard deviation     0.1199902258 0.1163319481 0.1154746915 0.1135568412
## Proportion of Variance 0.0001167063 0.0001096985 0.0001080877 0.0001045272
## Cumulative Proportion  0.9981313720 0.9982410705 0.9983491582 0.9984536854
##                             Comp.92      Comp.93      Comp.94      Comp.95
## Standard deviation     0.1126370546 1.081564e-01 1.062043e-01 1.034501e-01
## Proportion of Variance 0.0001028408 9.482162e-05 9.142964e-05 8.674902e-05
## Cumulative Proportion  0.9985565262 9.986513e-01 9.987428e-01 9.988295e-01
##                             Comp.96      Comp.97      Comp.98      Comp.99
## Standard deviation     1.016806e-01 9.931112e-02 9.791585e-02 9.516176e-02
## Proportion of Variance 8.380665e-05 7.994632e-05 7.771568e-05 7.340534e-05
## Cumulative Proportion  9.989133e-01 9.989933e-01 9.990710e-01 9.991444e-01
##                            Comp.100     Comp.101     Comp.102     Comp.103
## Standard deviation     9.173623e-02 9.009086e-02 8.454985e-02 8.319505e-02
## Proportion of Variance 6.821572e-05 6.579065e-05 5.794666e-05 5.610449e-05
## Cumulative Proportion  9.992126e-01 9.992784e-01 9.993364e-01 9.993925e-01
##                            Comp.104     Comp.105     Comp.106     Comp.107
## Standard deviation     8.150336e-02 7.978429e-02 7.786984e-02 7.622690e-02
## Proportion of Variance 5.384604e-05 5.159855e-05 4.915201e-05 4.709982e-05
## Cumulative Proportion  9.994463e-01 9.994979e-01 9.995471e-01 9.995942e-01
##                            Comp.108     Comp.109     Comp.110     Comp.111
## Standard deviation     7.242232e-02 6.965001e-02 6.890923e-02 6.608172e-02
## Proportion of Variance 4.251553e-05 3.932285e-05 3.849084e-05 3.539691e-05
## Cumulative Proportion  9.996367e-01 9.996760e-01 9.997145e-01 9.997499e-01
##                            Comp.112     Comp.113     Comp.114     Comp.115
## Standard deviation     6.381898e-02 6.024836e-02 5.825819e-02 5.485355e-02
## Proportion of Variance 3.301432e-05 2.942342e-05 2.751165e-05 2.439002e-05
## Cumulative Proportion  9.997829e-01 9.998123e-01 9.998398e-01 9.998642e-01
##                            Comp.116     Comp.117     Comp.118     Comp.119
## Standard deviation     0.0540414728 4.982572e-02 4.948327e-02 0.0469600934
## Proportion of Variance 0.0000236732 2.012379e-05 1.984812e-05 0.0000178756
## Cumulative Proportion  0.9998878932 9.999080e-01 9.999279e-01 0.9999457407
##                            Comp.120     Comp.121     Comp.122     Comp.123
## Standard deviation     4.299296e-02 4.241059e-02 3.986074e-02 3.395811e-02
## Proportion of Variance 1.498295e-05 1.457979e-05 1.287933e-05 9.347377e-06
## Cumulative Proportion  9.999607e-01 9.999753e-01 9.999882e-01 9.999975e-01
##                            Comp.124
## Standard deviation     1.745566e-02
## Proportion of Variance 2.469876e-06
## Cumulative Proportion  1.000000e+00

Here we can see most of the variation is explained by the first few PC axes. BUT we also know that humans can have a lot of variation in their metabolomic profiles. What if the difference due to infection accounts for only a small amount of variation in metabolites? In that case, we might be excluding valuable data.

Feature Selection

Since we have a categorical response variable, we have another option for feature selection. Instead of collapsing to a few PC axes, we can quickly look at the predictive power of each feature and only retain those that are reasonably good at distinguishing among groups.

‘Reasonably good’ is a subjective term, and the definition we use will depend on the number of features that we want to include.

Probably the simplest approach is to do a simple linear model for each feature. However, in this case we will use Class.name as the predictor and each Feature as the response. Then we can use R^2 or p-value to define a cutoff. We aren’t too worried about the False Discovery Rate here because we aren’t testing 124 independent hypotheses. Instead, we are just using the fit of the models to decide which features to include in the analysis.

Now the problem is that we would have to write code for 124 different linear models, look at the output, and then decide which p-value to use as a cutoff.

We can use a trick here that we first covered in the Mixed Models Tutorial when we talked about repeated measures data and the ‘long’ format.

In this case, the ‘long’ format means that we collapse all of our 124 features column into two columns: one for the value observed for each feature (i.e. the number in each cell) and a second column specifying which feature it is (i.e. the column name).

This is easily done with the pivot_longer() function from the tidyr library:

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.0.5

FeatureSel<-ScalComp %>%
  mutate(Class.name=VirDat$Class.name) %>% # Add the Response variable
  pivot_longer(cols=-Class.name, 
               names_to="Chem",
               values_to="Conc")
str(FeatureSel)

## tibble [27,404 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Class.name: chr [1:27404] "VTM" "VTM" "VTM" "VTM" ...
##  $ Chem      : chr [1:27404] "Gly" "Ala" "Ser" "Histamine" ...
##  $ Conc      : num [1:27404, 1] -1.31 -1.23 -1.39 -0.33 -1.12 ...

Here we see just three columns: Class.name, the specific chemical (formerly the column header) and the concentration (formerly cell value for each column header).

We have 27,404 rows instead of 221 because each of the original rows is repeated 124 times (one for each feature). We can do a lot of things with this data. One obvious one would be to use facets=Chem in a qplot or ggplot to create a separate graph for each of the chemicals. This would be useful for visually inspecting for outliers, lognormal distributions, etc. We might want to output this to a large jpeg or pdf file since it will be hard to squeeze all those graphs into our small plotting area.

Another thing we can do now is to calculate summary statistics to make sure we have mean ~ 0 and sd ~ 1, and check the max and min values

FeatureSel %>%
  group_by(Chem) %>%
  summarize(MeanConc=mean(Conc),
            sd=sd(Conc),
            max=max(Conc),
            min=min(Conc))

## # A tibble: 124 x 5
##    Chem                      MeanConc    sd   max    min
##    <chr>                        <dbl> <dbl> <dbl>  <dbl>
##  1 Ac.Orn                   -4.33e-17 1     11.8  -0.420
##  2 Ala                      -4.37e-17 1      5.25 -1.23 
##  3 Arg                       8.19e-17 1      3.66 -1.34 
##  4 Asn                      -6.19e-17 1      5.29 -1.16 
##  5 Asp                      -2.65e-17 0.995  3.64 -1.47 
##  6 beta.Hydroxybutyric.acid  1.25e-17 1     11.6  -0.370
##  7 Betaine                  -6.06e-17 1      2.71 -2.92 
##  8 Butyric.acid             -3.10e-17 0.998 11.2  -0.595
##  9 C0                       -1.20e-16 1      7.25 -1.23 
## 10 C10                       7.05e-17 1      2.24 -1.09 
## # ... with 114 more rows

We can also run separate linear models using the do function with group_by, except that we also want to extract the p-values specifically. To think about how to do this, let’s look at a single linear model:

Mod<-lm(ScalComp$Gly ~ RespDat$Class.name)
ModOut<-anova(Mod)
ModOut

## Analysis of Variance Table
## 
## Response: ScalComp$Gly
##                     Df  Sum Sq Mean Sq F value    Pr(>F)    
## RespDat$Class.name   4  60.042 15.0105   20.27 3.427e-14 ***
## Residuals          216 159.958  0.7405                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the anova function gives us the p-values, which we can also pull out if we set the output of anova to an object

ModOut[1,"Pr(>F)"]

## [1] 3.427345e-14

OR we can do just create a nested function:

anova(lm(ScalComp$Gly ~ RespDat$Class.name))[1,"Pr(>F)"]

## [1] 3.427345e-14

Now we apply this with dplyr, which gives us a weird object, which we can convert back to a data.frame

PVals<-FeatureSel %>%
  group_by(Chem) %>%
  summarize(P = anova(lm(Conc ~ Class.name))[1,"Pr(>F)"]) %>%
  dplyr::select(Chem,P)

head(PVals)

## # A tibble: 6 x 2
##   Chem                            P
##   <chr>                       <dbl>
## 1 Ac.Orn                   7.04e- 1
## 2 Ala                      2.07e-12
## 3 Arg                      9.53e-18
## 4 Asn                      3.32e-10
## 5 Asp                      1.93e-20
## 6 beta.Hydroxybutyric.acid 1.12e- 1

So how do we find a good cut-off value? The more stringent the cutoff (i.e. smaller P), the stronger our prediction, but the less features we will retain. We need to find a balance. A good first step is to plot a histogram:

qplot(x=P,data=PVals)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see more than 40 features close to zero. That shows us that there are a lot of features (i.e. biological compounds) that differ amon the groups. It’s much better than 124, but still quite a few for this small dataset of just 221 observations. We can try to reduce this further by thinking about the biology of the system and the goal of our model.

We want to predict the differences among groups:

unique(VirDat$Class.name)

## [1] "VTM"       "Control"   "COVID19"   "Influenza" "RSV"

But we also need to come up with binary groupings to run RDA, so could focus on a few different goals:

Distinguish the control stabilizing solution (VTM) from the patient samples (all others)
Distinguish healthy patients (Control) from infected (COVID19, Influenz, RSV)
Distinguish COVID19 from health patients
Distinguish COVID19 from other respiratory (Influenza & RSV)

There are of course many more goals/models we could come up with using different binary combinations of these groups, but these are probabably the more interesting ones. For each model we could do a separate feature selection step. Let’s focus on the last one – distinguishing COVID19 patients from other respiratory infections. This means we’ll have to drop VTM and Control, and then combine Influenza and RSV into a single group. Then we re-run our feature selection pipeline as above:

PCOV<-FeatureSel %>%
  filter(Class.name %in% c("COVID19","Influenza","RSV")) %>%
  mutate(NewGroup = replace(Class.name, Class.name == "Influenza", "NonCOV")) %>%
  mutate(NewGroup = replace(NewGroup, Class.name == "RSV", "NonCOV")) %>%
  group_by(Chem) %>%
  summarize(P = anova(lm(Conc ~ NewGroup))[1,"Pr(>F)"]) %>%
  dplyr::select(Chem,P)
qplot(x=P,data=PCOV) + xlim(0,0.1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 102 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Now we are down to ~15 features with P < 0.05, which we can use to define a new Features dataset:

Keep<-PCOV %>%
  filter(PCOV$P < 0.05)
Keep<-paste(Keep$Chem)

This gives us a vector of chemical names that we can use to select columns

ScaledSub<-ScalComp %>%
  dplyr::select(all_of(Keep))
names(ScaledSub)

##  [1] "Betaine"    "C14.1OH"    "C16.1OH"    "C2"         "C7DC"      
##  [6] "Carnosine"  "Cit"        "Glucose"    "LYSOC16.0"  "LYSOC18.0" 
## [11] "Met.SO"     "Putrescine" "Taurine"    "Tyramine"

LDA

Now that we have our subset of features, we can run a Linear Discriminant Analysis (LDA). The lda function from the MASS package is pretty straight forward. There are a couple of ways we can specify the model. The first is to use an equation similar to the linear model: Y ~ . where Y is the column name of the categorical variable and . means ‘all other columns of data’. We would also need to specify the data object with data =

Since we have our categorical response variable and features in two different objects, we can plug them in without the data = parameter.

We also have to remember to recode our response into the same binary variables as above.

RDAResp<-RespDat %>%
  mutate(NewGroup = replace(Class.name, Class.name != "COVID19", "NonCOV"))

LDAMod<-lda(x=ScaledSub,grouping=RDAResp$NewGroup)

Output

Let’s take a look at the output of the LDAMod. What type of data structure does it create?

str(LDAMod)

## List of 8
##  $ prior  : Named num [1:2] 0.249 0.751
##   ..- attr(*, "names")= chr [1:2] "COVID19" "NonCOV"
##  $ counts : Named int [1:2] 55 166
##   ..- attr(*, "names")= chr [1:2] "COVID19" "NonCOV"
##  $ means  : num [1:2, 1:14] 0.1595 -0.0528 0.4789 -0.1587 0.7463 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "COVID19" "NonCOV"
##   .. ..$ : chr [1:14] "Betaine" "C14.1OH" "C16.1OH" "C2" ...
##  $ scaling: num [1:14, 1] -0.5058 -0.0492 -0.5285 0.4728 0.1058 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:14] "Betaine" "C14.1OH" "C16.1OH" "C2" ...
##   .. ..$ : chr "LD1"
##  $ lev    : chr [1:2] "COVID19" "NonCOV"
##  $ svd    : num 13.6
##  $ N      : int 221
##  $ call   : language lda(x = ScaledSub, grouping = RDAResp$NewGroup)
##  - attr(*, "class")= chr "lda"

We can try the summary() function:

summary(LDAMod)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   28     -none- numeric  
## scaling 14     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call

Unlike an linear model (lm) object, the summary() function for the lda object summarizes the object itself. The left-hand column gives the names of the list items, and the ‘Length’ gives the number of elements.

Let’s take a quick look at a few of the list items, which are also explained the lda help: ?lda.

Counts show the sample size for each group:

LDAMod$counts

## COVID19  NonCOV 
##      55     166

Scaling shows the factor loadings. Compare this to the Eigenvectors of a the Principal Components Analysis. There are some important differences. We only have ONE LD axis for 16 features, whereas PCA would give us 16 axes. The number of axes in an LD are determined by the number of categories of the response variable, rather than the number of features.

The number of LD axes = Number of Categories - 1

Apart from that, the idea is the same. The loadings for LD1 show how each feature contributes to the LD1 axis.

LDAMod$scaling

##                    LD1
## Betaine    -0.50577476
## C14.1OH    -0.04919074
## C16.1OH    -0.52854092
## C2          0.47279012
## C7DC        0.10577965
## Carnosine   0.38982307
## Cit        -0.05929275
## Glucose     0.18166895
## LYSOC16.0   0.23176779
## LYSOC18.0  -0.53863330
## Met.SO      0.24488153
## Putrescine -0.29620077
## Taurine    -0.35756495
## Tyramine    0.64130142

We can see above that higher values of LDA are determined largely by higher values of LYSOC18(.1 and .2) and Tyramine, and lower values of Taurine. This might also tell us something about the pathology of the disease, but here we are just focusing on the data analysis.

Another important distinction from the princomp output is that there are no scores in the LDA output. To find the scores, we use a different approach:

Predictions

We can use the predict function with the LDA object to generate additional info:

PredOut<-predict(LDAMod)
summary(PredOut)

##           Length Class  Mode   
## class     221    factor numeric
## posterior 442    -none- numeric
## x         221    -none- numeric

Note that we have a class prediction and an x prediction, and both have the same length as the number of observations in our dataset.

The x object here is the predicted score for the LDA axis:

head(PredOut$x)

##           LD1
## [1,] 1.089149
## [2,] 1.142026
## [3,] 1.033887
## [4,] 2.002258
## [5,] 1.425790
## [6,] 1.625491

and the class object is the predicted category:

head(PredOut$class)

## [1] NonCOV NonCOV NonCOV NonCOV NonCOV NonCOV
## Levels: COVID19 NonCOV

Confusion Matrix

We can generate a confusion matrix by comparing the predicted vs. observed.

CatDat<-data.frame(Observed=as.factor(RDAResp$NewGroup),Predicted=PredOut$class)
table(CatDat)

##          Predicted
## Observed  COVID19 NonCOV
##   COVID19      36     19
##   NonCOV        7    159

This is called a confusion matrix and it is related to the false discovery rates covered in the Advanced LM Tutorial.

Be careful with the rows and columns. By convention, the prediction is usually given as columns and the observed/actual given as rows, but it’s worth checking.

Using our LDA model, we have 40 True Positive, 155 True Negative, 11 False Positive, and 15 False Negative cases. We can calculate the model accuracy, specificity and sensitivity using these values.

Calculate the Accuracy, Specificity, and Sensitivity of this LDA model

Posterior Probabilities

The posterior object gives the posterior probability for each category. This is a concept in Bayesian statistics that isn’t covered here. For now, just know that these are probabilities for assigning each observation to each group:

Post<-as.data.frame(PredOut$posterior)
head(Post)

##       COVID19    NonCOV
## 1 0.010602495 0.9893975
## 2 0.009490279 0.9905097
## 3 0.011902721 0.9880973
## 4 0.001548195 0.9984518
## 5 0.005226896 0.9947731
## 6 0.003430964 0.9965690

The rows add up to 1 and the columns represent the probability that each individual (row) belongs to that category. For example, we see in all 6 cases that there is predicted to be >90% chance that these individuals belong to the NonCOV group. We can get a sense of how well our model performs by plotting these:

Post$Group<-RDAResp$NewGroup
qplot(x=COVID19,y=NonCOV,colour=Group,data=Post,alpha=I(0.3))

A perfect model would have all of the COVID samples in the lower right corner and all of the NonCOV samples in the top left corner.

The X-axis of this graph is the predicted probability that the patient has COVID19. Let’s compare this to the LD1 scores:

Post$LD1<-as.vector(PredOut$x)
qplot(LD1,COVID19,data=Post)

We can see how the probability is a nonlinear function of LD1. Now imagine a new patient comes in and we look at their metabolite profile to predict whether they have COVID19. We can define the probability, but policy is usually based on a firm decision. Ultimately we have to classify this patient one way or the other, and that will have real-world consequences for the patient and the potential spread of infection.

Which value of LD1 should we use to categorize the patient?

One answer might be the point along LD that corresponds to 0.5 on the y-axis.

On the other hand, we might want to error on the side of caution to limit our false-negative rate (i.e. patients with COVID who are told they don’t have it)

There may be other (non-COVID) datasets where we want to error on the side of limiting our false positive rate.

Regardless, it would be helpful to know how different threshold values of LD1 affect the number of false positive vs. false negatives. This is shown in a graph called the ROC.

ROC

The Receiver-Operator Curve (ROC) is a measure of model performance based on the rates of false positive vs. false negatives. To calculate the curve, we set a value of LD1, then look at the confusion matrix and calculate the False positive rate vs. True positive rate. Here ‘rate’ means proportion: If we look at all the predicted positives, what proportion are true positive vs true negative? These rates will of course add up to 1 (100%).

Rather than calculate by hand…

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

plot.roc(Group~LD1, data=Post)

## Setting levels: control = COVID19, case = NonCOV

## Setting direction: controls < cases

Let’s take a moment to think about what this curve means. If our LD axis was a random preditor, we would expect predictions along the grey line. Our actual model is the dark black line.

Now looking at the x=axis you can see we start on the left-hand side with 100% specificity and 0% sensitivity. As we move right we quickly increase sensitivity with only a tiny decresae in specificity up to about 0.8 on the y-axis. After this, we have to sacrifice a lot of specificity for a small increase in sensitivity.

Overall, this is a pretty good model.

What would the curve look like for a super-accurate model?

AUROC

We can measure the area Area Under the Curve (AUC) or Area Under the ROC (AUROC) as a measure of the performance of the model. The random model is a simple triangle (grey line, above) which shows that the minimum AUC is 0.5. As our model improves we get something closer to a square (100% accuracy across all values of Specificity), giving us the maximum value of 1.

AUC ranges from 0.5 to 1.

We can use the auc function from the same pROC package to calculate the area under the curve:

auc(Group~LD1, data=Post)

## Setting levels: control = COVID19, case = NonCOV

## Setting direction: controls < cases

## Area under the curve: 0.9298

Multi-Category

What if we want to expand our model to look at all of the different classes, not just COVID vs nonCOV? A good approach is to set different binary axes and run a separate LDA for each.

On the other hand, if we want to throw everything into an LDA, we can do that too:

LDAMult<-lda(x=ScalComp,grouping=RDAResp$Class.name)
summary(LDAMult)

##         Length Class  Mode     
## prior     5    -none- numeric  
## counts    5    -none- numeric  
## means   620    -none- numeric  
## scaling 496    -none- numeric  
## lev       5    -none- character
## svd       4    -none- numeric  
## N         1    -none- numeric  
## call      3    -none- call

Note that we now have 5 categories and 4 LD axes. Each LD axis is like a mini LDA for each category vs. all of the others grouped together.

QDA

Notice in the LDA how our LD1 axis is a linear combination of our features. Just like principal components axes, we can calculate a score for each observation (i.e. patient) by multiplying the standardized Feature value by its factor loading.

But what if factors have non-linear effects? The QDA is a form of nonlinear RDA that is good when the different groups have different variances. In our case, we grouped together VTM and uninfected patients with Influenza and RSV patients. These are likely more variable than the COVID19 patients.

The Quadratic Discriminant Analysis (QDA) scales the transformation matrix that is used to calculate the LD axes. The ‘Quadratic’ in QDA refers to the predictor, which is quadratic as opposed to the linear LD1 axis in LDA. In R this is as simple as changing the function name:

QDAMod<-qda(x=ScaledSub,grouping=RDAResp$NewGroup)
summary(QDAMod)

##         Length Class  Mode     
## prior     2    -none- numeric  
## counts    2    -none- numeric  
## means    28    -none- numeric  
## scaling 392    -none- numeric  
## ldet      2    -none- numeric  
## lev       2    -none- character
## N         1    -none- numeric  
## call      3    -none- call

and the predictions:

QpredOut<-predict(QDAMod)
summary(QpredOut)

##           Length Class  Mode   
## class     221    factor numeric
## posterior 442    -none- numeric

Compare these to LDA. What is missing?

The biggest change is the loss of a linear predictor (x in the LDA prediction output). That’s because of the transformation of the data.

How good are the predictions? Let’s look at the confusion matrix:

Qcon<-data.frame(Obs=as.factor(RDAResp$NewGroup),Pred=QpredOut$class)
table(Qcon)

##          Pred
## Obs       COVID19 NonCOV
##   COVID19      36     19
##   NonCOV        2    164

This model performs a lot better. We also can’t do an ROC or AUROC because we don’t have a linear predictor. Compare this to the LDA and you can see this model performs better.

Comparing the LDA to the QDA is a good example of the trade-off between prediction and interpretability that we covered in the Intro to ML Tutorial. The QDA is better at predicting the category, but the LDA is more interpretable because we can look the scaling of LD1 to see how the features (biochemicals) map to the prediction (LD1).

RDA

Finally, we arrive at the Regularized Discriminant Analysis (RDA), which incorporates elements of both the LDA and QDA.

The analysis requires two ‘tuning parameters’ \(\lambda\) and \(\gamma\). Each of these range from 0 to 1 and define a range of models with 4 special cases:

LDA when \(\lambda=1\) and \(\gamma=0\)
QDA when \(\lambda=0\) and \(\gamma=0\)
When \(\lambda=0\) and \(\gamma=1\): Different variance like the LDA
When \(\lambda=1\) and \(\gamma=1\): Equal variances like the QDA

From these examples, you can see that the \(\lambda\) parameter ranges from QDA to LDA when \(\gamma=0\). This is the scaling of the variances within groups.

The \(\gamma\) parameter itself is called a ‘regularization’ parameter. It defines the scale of the covariance matrix:

\[COV_\gamma = (1-\gamma)COV + \gamma I \]

Where COV is the covariance matrix and I is the identity matrix. The identity matrix is just a matrix with 1 along the diagonal. So gamma=1 sets the covariance matrix to the identity matrix.

One BIG advantage of the RDA over LDA is that it can be applied to data where the number of features is much larger than the number of observations. The reason is that LDA (like PCA) works on the covariance matrix, which is undefined when Features > Observations. Instead of the covariance (or correlation) matrix, the RDA uses a regularized covariance matrix.

The important question is: how do we choose which values for \(\gamma\) and \(\lambda\)? The beauty of machine learning is that we can let the data tell us as part of the model training process.

First, we’ll load the klaR and caret libraries, which include tools for rda and machine learning, respectively

library(klaR)
library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: lattice

Grid Search

We have two tuning parameters that we want to ‘optimize’. One way to do this is to set a range of values for each, run a model with each combination of parameter values, and then compare the fit of the different models. This is called a grid search. Think of a the parameters as a grid, say with \(\lambda\) on the x-axis and \(\gamma\) on the y axis (or more axes for more parameters) – each grid square/cube/hypercube is a specific set of parameter values.

The problem with grid search is that the number of models increases exponentially with the number of parameters and the degree of precision we want in the parameters (e.g. 0.5 vs. 0.1 vs 0.01 etc). There are more efficient methods for more complicated models.

But for an RDA, we don’t care so much about getting high-precision values since we just want to find a ‘good’ model even if it is not the ‘best’ model.

Random Search

Instead of defining regularly spaced intervals, we can randomly select values along the range of 0 to 1 for each parameter, run the model, and then measure its performance. This is called a random search

`trainControl`

Rather than write some long code to do a parameter search, we can use the train function from the caret package to automate a lot of this.

The first step is to make a trainControl object specifying some of the details of the training options that we want to use

CTL<-trainControl(method="repeatedcv",
                         number=4,
                         repeats=24,
                         classProbs=T,
                         verboseIter=F,
                         search="random")

method="LOOCV" – the method of sampling from the dataset. We are using the Leave-One-Out Cross Validation (LOOCV) approach because we don’t have a very big dataset. If we had thousands or millions of rows, we might want to do a repeated k-fold cross-validation (see Intro to ML Tutorial).
classProbs=T – specifies that we want to use the class probabilities specific to each subsample (rather than use the global probability)
verboseIter=F – the word verbose is often used in programming to mean “Give the user more detailed feedback as the program is running”. This is handy for debugging and also to make sure the program is running and you computer is not just ‘stuck’. On the other hand, it adds to the runtime, which can significantly slow down large analyses. You can try re-running this code and the RDA with verboseIter=T to compare the output.
search="random" – specifies a random search (alternative = grid)

To understand what we’ve made, look at the structure of the object. Note the extra (unspecified) parameters are just thee default:

str(CTL)

## List of 27
##  $ method           : chr "repeatedcv"
##  $ number           : num 4
##  $ repeats          : num 24
##  $ search           : chr "random"
##  $ p                : num 0.75
##  $ initialWindow    : NULL
##  $ horizon          : num 1
##  $ fixedWindow      : logi TRUE
##  $ skip             : num 0
##  $ verboseIter      : logi FALSE
##  $ returnData       : logi TRUE
##  $ returnResamp     : chr "final"
##  $ savePredictions  : logi FALSE
##  $ classProbs       : logi TRUE
##  $ summaryFunction  :function (data, lev = NULL, model = NULL)  
##  $ selectionFunction: chr "best"
##  $ preProcOptions   :List of 6
##   ..$ thresh   : num 0.95
##   ..$ ICAcomp  : num 3
##   ..$ k        : num 5
##   ..$ freqCut  : num 19
##   ..$ uniqueCut: num 10
##   ..$ cutoff   : num 0.9
##  $ sampling         : NULL
##  $ index            : NULL
##  $ indexOut         : NULL
##  $ indexFinal       : NULL
##  $ timingSamps      : num 0
##  $ predictionBounds : logi [1:2] FALSE FALSE
##  $ seeds            : logi NA
##  $ adaptive         :List of 4
##   ..$ min     : num 5
##   ..$ alpha   : num 0.05
##   ..$ method  : chr "gls"
##   ..$ complete: logi TRUE
##  $ trim             : logi FALSE
##  $ allowParallel    : logi TRUE

`trainClass`

The trainClass function does the heavy lifting, with the above object as input for the trControl parameter. Note that this will take a few seconds to run. If it is taking too long, reduce the tuneLength parameter to a smaller number like 3 or 4.

Since we are using a ‘random’ design, we will use set.seed() to make the results reproducible. See the R Crash Course Tutorial if you need a refresher on this.

set.seed(123)
randomRDA<-train(x=ScaledSub,y=RDAResp$NewGroup,
                    method="rda",
                    metric="Accuracy",
                    tuneLength = 24,
                    trControl=CTL)

Most of these parameters are self-explanatory

Now we can inspect the output and plot the parameters

randomRDA

## Regularized Discriminant Analysis 
## 
## 221 samples
##  14 predictor
##   2 classes: 'COVID19', 'NonCOV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 24 times) 
## Summary of sample sizes: 167, 165, 165, 166, 166, 166, ... 
## Resampling results across tuning parameters:
## 
##   gamma       lambda      Accuracy   Kappa    
##   0.04205953  0.41454634  0.8486348  0.5550309
##   0.04555650  0.14711365  0.8439167  0.5335700
##   0.10292468  0.31818101  0.8429900  0.5294530
##   0.24608773  0.14280002  0.8280944  0.4663718
##   0.32792072  0.41372433  0.8320140  0.4814076
##   0.40897692  0.54406602  0.8357953  0.4985482
##   0.45333416  0.47779597  0.8274951  0.4675415
##   0.45661474  0.79546742  0.8487937  0.5535279
##   0.52810549  0.96302423  0.8546381  0.5803430
##   0.55143501  0.69070528  0.8376590  0.5156362
##   0.57263340  0.21640794  0.8118554  0.4069088
##   0.64050681  0.23303410  0.8069371  0.3959042
##   0.65570580  0.26597264  0.8067578  0.3983065
##   0.67757064  0.75845954  0.8376352  0.5248663
##   0.69280341  0.13880606  0.7986641  0.3696990
##   0.70853047  0.85782772  0.8446127  0.5505070
##   0.88301740  0.59414202  0.8121891  0.4523258
##   0.88953932  0.15244475  0.7873601  0.3545434
##   0.89241904  0.90229905  0.8353410  0.5369691
##   0.89982497  0.23162579  0.7894196  0.3656448
##   0.94046728  0.28915974  0.7899674  0.3741002
##   0.95450365  0.36884545  0.7931533  0.3879286
##   0.95683335  0.02461368  0.7751258  0.3272864
##   0.99426978  0.46596245  0.7980647  0.4148300
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0.5281055 and lambda
##  = 0.9630242.

ggplot(randomRDA) + theme(legend.position="bottom")

The size of the point corresponds to the accuracy of the model. Here we can see that there isn’t much variation as long as \(\gamma\) is less than 0.6 and \(\lambda\) is above 0.5.

If we set \(\gamma=0 and \lambda=1\) then it is just the LDA by definition, as described above. Since the LDA is also more interpretable, it seems like a good choice.

`rda()`

In other cases, we may want to run the RDA with parameters chosen from the above analysis, which is just done using the rda function from the klaR package:

RDAmod<-rda(x=ScaledSub,grouping=RDAResp$NewGroup,
            regularization=c(gamma=0.3, lambda=0.9))
summary(RDAmod)

##                Length Class  Mode     
## call             4    -none- call     
## regularization   2    -none- numeric  
## classes          2    -none- character
## prior            2    -none- numeric  
## error.rate       1    -none- numeric  
## varnames        14    -none- character
## means           28    -none- numeric  
## covariances    392    -none- numeric  
## covpooled      196    -none- numeric  
## converged        1    -none- logical  
## iter             1    -none- numeric

Compared to the QDA output we have a few more list items, specifically the regularization parameters that we specified, an estimate of the error rate, the individual group covariances and the pooled covariances across all the samples.

Cross-validation

Recall the problem of overfitting from the Intro to ML Tutorial. In the RDA example above, we used cross-validation to fit the parameters. In addition, we can split our data into a training set and a validation set.

Data Splitting

Data splitting is as simple as breaking the data into two sets: one used to train the model and the second one used to validate the model. We want to be careful to make sure each subset has enough observations, and a good representation of the data. In the extreme example, we wouldn’t want 90% of the COVID cases in the training dataset and 90% of the NonCOV cases in the validation dataset.

A typical way to break up the dataset is just to select every Nth row. For example, if we want to split the data in half, then we can just split up the odd vs. even rows of data.

Since we are working with multiple datasets, we can define a vector of row numbers for each dataset and use it as an index. A simple way to do this is to divide the row number by 2. Odd rows will have a remainder and even rows won’t. We can use the %% operator which returns a 1 if there is a remainder and 0 if there is not

Rows<-c(1:nrow(RDAResp))
Train<-Rows %% 2 == 1
Validate<-Rows %% 2 == 0

Take a minute to review each object to understand what we did

Now we have a training set (Train) and validation set (Validate):

head(RDAResp[Train,])

##    Sample.Name Batch.Number Class.name  Sex  Age   CT NewGroup
## 1        VTM a            1        VTM <NA> <NA> <NA>   NonCOV
## 3       VTM a2            1        VTM <NA> <NA> <NA>   NonCOV
## 5       VTM b1            2        VTM <NA> <NA> <NA>   NonCOV
## 7       VTM b3            2        VTM <NA> <NA> <NA>   NonCOV
## 9       VTM c1            3        VTM <NA> <NA> <NA>   NonCOV
## 11      VTM c3            3        VTM <NA> <NA> <NA>   NonCOV

head(RDAResp[Validate,])

##    Sample.Name Batch.Number Class.name  Sex  Age       CT NewGroup
## 2       VTM a1            1        VTM <NA> <NA>     <NA>   NonCOV
## 4        VTM b            2        VTM <NA> <NA>     <NA>   NonCOV
## 6       VTM b2            2        VTM <NA> <NA>     <NA>   NonCOV
## 8        VTM c            3        VTM <NA> <NA>     <NA>   NonCOV
## 10      VTM c2            3        VTM <NA> <NA>     <NA>   NonCOV
## 12     SLB3000            1    Control Male   25 Negative   NonCOV

Notice the row numbers along the left-hand side of the output.

Now we re-run the code above on the Training dataset and then generate predictions for the Validation dataset. We should start back at the parameter estimation step, specifying the training dataset with square brackets [Train,] and [Train]

set.seed(123)
randomRDA2<-train(x=ScaledSub[Train,],y=RDAResp$NewGroup[Train],
                    method="rda",
                    metric="Accuracy",
                    tuneLength = 24,
                    trControl=CTL)

Now inspect the graph and select appropriate parameter values:

randomRDA2

## Regularized Discriminant Analysis 
## 
## 111 samples
##  14 predictor
##   2 classes: 'COVID19', 'NonCOV' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 24 times) 
## Summary of sample sizes: 84, 83, 83, 83, 83, 83, ... 
## Resampling results across tuning parameters:
## 
##   gamma       lambda      Accuracy   Kappa    
##   0.04205953  0.41454634  0.8064512  0.4892385
##   0.04555650  0.14711365  0.8087109  0.4854357
##   0.10292468  0.31818101  0.8113288  0.5059791
##   0.24608773  0.14280002  0.8147321  0.5205551
##   0.32792072  0.41372433  0.8129271  0.5026774
##   0.40897692  0.54406602  0.8083940  0.4863681
##   0.45333416  0.47779597  0.8106674  0.4935945
##   0.45661474  0.79546742  0.8150766  0.4912073
##   0.52810549  0.96302423  0.8124587  0.4714484
##   0.55143501  0.69070528  0.8143463  0.4918643
##   0.57263340  0.21640794  0.8121142  0.5024089
##   0.64050681  0.23303410  0.8109706  0.4972834
##   0.65570580  0.26597264  0.8098683  0.4927514
##   0.67757064  0.75845954  0.8125000  0.4845934
##   0.69280341  0.13880606  0.8105848  0.4965317
##   0.70853047  0.85782772  0.8113839  0.4762306
##   0.88301740  0.59414202  0.8109843  0.4800787
##   0.88953932  0.15244475  0.8106537  0.4902058
##   0.89241904  0.90229905  0.8053351  0.4596441
##   0.89982497  0.23162579  0.8117697  0.4901212
##   0.94046728  0.28915974  0.8109981  0.4859356
##   0.95450365  0.36884545  0.8102403  0.4813170
##   0.95683335  0.02461368  0.8136299  0.4951022
##   0.99426978  0.46596245  0.8094687  0.4742717
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0.4566147 and lambda
##  = 0.7954674.

ggplot(randomRDA2) + theme(legend.position="bottom")

And finally, run the model on the training dataset:

RDAmod2<-rda(x=ScaledSub[Train,],grouping=RDAResp$NewGroup[Train],
            regularization=c(gamma=0.3, lambda=0.9))

Confusion Matrix

To generate the confusion matrix, we first have to predict the classes for our Validation set, using the model generated from our Training set:

Pred<-predict(RDAmod2,ScaledSub[Validate,])

Inspect the structure of Pred to see what kind of object we created

Now we have our predictions that we can compare against the observations

CatDat<-data.frame(Obs=as.factor(RDAResp$NewGroup[Validate]),
                   Pred=Pred$class)
table(CatDat)

##          Pred
## Obs       COVID19 NonCOV
##   COVID19      19      9
##   NonCOV        4     78

CV + Split

Take a minute to think about what we did and why this represents a very robust analysis. First, we separate the data into two sets and we set aside the Validation dataset. Then, we use cross-validation on only the Training dataset in order to train a robust predictive model. Then, we apply the model back to the Validation dataset to see how well the model performs on new data that were not included in the training steps.

How could we improve this even further? Collect new data to test the model!