In the PCA Tutorial we looked at Principal Components Analysis as an example of Unsupervised Machine Learning because there were no a priori groups that we could use to train the model. Instead, we just used PCA to identify uncorrelated (i.e. orthogonal) axes of variation, and then looked at our groups post hoc to see how they mapped onto the major PC axes.
In this tutorial we look at Regularlized Discriminant Analysis (RDA) as an extension of PCA that uses a categorical variable for Supervised Learning. In supervised learning, we know the groups ahead of time and try to find a model that distinguishes them. With PCA, we defined component axes based on variation in the features themselves. In our example case, we found that the major axes correspond to genetic population (PC1) and growing environment (PC2). BUT that won’t always be the case.
With RDA, we try to redefine the First Component Axis by changing the axis loadings, applying a linear or nonlinear equation to predict the response variable. The closest analogy is to think of a logistic regression with a binary response variable and multiple continuous predictors. With an RDA we also have a binary response variable and multiple continuous predictors. Remember that we can redefine a variable with N categories with N-1 binary variables, as we’ll see in our example below.
The terminology can get a bit confusing. You may see the term Discriminant Analysis (DA), which is really a Linear Discrimimant Analysis (LDA) or sometimes caled a Discriminant Function Analysis (DFA). The LDA is a generalization of Fisher’s Linear Discriminant – the same Fisher who invented ANOVA and Fisher’s Exact Test, who was also an Evolutionary Biologist who contributed key insights into population genetics.
Later, the LDA was generalized to the Quadratic Discriminant Analysis (QDA), which includes a tuning parameter for unequal variances in the feature predictions. And then later the QDA was generalized to the Regularized Discriminant Analysis (RDA) by adding a second tuning parameter. We’ll look at these tuning parameters (\(\gamma\) and \(\lambda\)) in the example below.
Load the usual plotting and data management functions:
library(ggplot2) # plotting library
library(dplyr) # data management
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
For RDA we will use functions from the MASS
package. Don’t forget to install.packages("MASS")
the first time you use it.
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.5
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
The data we’ll be working on today come from nasal swabs of patients that were analyzed using a method called metabolomics. This is just a fancy word for chemical analysis using High Performance Liquid Chromatography (HPLC). For our purposes, just know that there are a set of chemicals that we can identify and then measure their concentrations. These concentrations vary by orders of magnitude so this is a prime candidate for scaling our data, as discussed in the Intro to ML Tutorial.
<-read.csv("https://colauttilab.github.io/Data/ViralMetData.csv", header=T)
VirDathead(VirDat)
## Sample.Name Batch.Number Class.name Sex Age CT Gly Ala Ser Histamine
## 1 VTM a 1 VTM <NA> <NA> <NA> 33.6 7.48 5.16 0.183
## 2 VTM a1 1 VTM <NA> <NA> <NA> 33.9 10.50 7.49 0.138
## 3 VTM a2 1 VTM <NA> <NA> <NA> 32.1 19.20 7.70 0.159
## 4 VTM b 2 VTM <NA> <NA> <NA> 28.9 13.00 7.25 0.363
## 5 VTM b1 2 VTM <NA> <NA> <NA> 29.6 11.90 4.63 0.181
## 6 VTM b2 2 VTM <NA> <NA> <NA> 28.1 12.10 2.37 0.219
## Pro Val Thr Taurine Putrescine t4.OH.Pro Leu Ile Asn Asp Gln Glu
## 1 3.99 2.81 4.02 4.70 0.000 0.247 0.00 2.49 4.860 31.1 0.00 5840
## 2 4.20 3.42 6.78 3.87 0.000 0.213 0.00 1.54 5.750 35.2 0.00 5550
## 3 3.66 3.56 3.87 3.60 0.000 0.206 0.00 2.28 7.540 35.4 0.00 5230
## 4 3.95 2.80 5.70 4.27 0.472 3.290 3.46 2.37 3.610 22.9 2.43 0
## 5 4.30 3.12 7.96 3.73 0.510 3.420 5.50 2.10 0.353 25.6 1.97 0
## 6 3.43 2.66 5.91 3.98 0.467 2.910 3.50 2.14 0.967 30.1 1.86 0
## Met His Phe Met.SO Arg Ac.Orn Cit Tyr DOPA total.DMA Kynurenine
## 1 0.0674 2.21 2.81 0.114 6.90 0.0964 0.766 6.03 0.481 0.0000 0.000
## 2 0.0313 2.91 3.11 0.177 6.09 0.1510 1.480 5.78 0.516 0.1260 0.000
## 3 0.1380 2.81 3.16 0.246 6.79 0.1420 0.822 7.08 0.315 0.0000 0.000
## 4 0.3870 2.78 3.03 0.281 6.83 0.0000 0.000 7.11 0.641 0.0798 1.320
## 5 0.6350 1.94 3.17 0.284 6.14 0.1300 0.000 6.60 0.649 0.0000 0.938
## 6 0.5710 2.56 3.15 0.152 6.41 0.0000 0.000 6.49 0.190 0.0548 0.286
## Carnosine Nitro.Tyr Orn Lys Spermidine Spermine Sarcosine Tyramine
## 1 2.91 0.1710 1.230 3.58 0.2740 0.1580 0.0126 0.149
## 2 2.60 0.1620 1.300 3.84 0.0000 0.1350 0.0000 0.157
## 3 2.72 0.1540 0.987 3.80 0.0525 0.1280 0.0282 0.182
## 4 2.55 0.0505 1.240 4.66 1.8400 0.0442 0.1400 0.118
## 5 2.60 0.0620 1.010 3.82 3.0400 0.0403 0.1360 0.104
## 6 2.73 0.0479 0.938 4.31 1.6400 0.0000 0.1550 0.126
## Creatine Betaine Choline TMAO Lactic.acid beta.Hydroxybutyric.acid
## 1 0.895 23.6 2.25 0.151 30.2 0.000
## 2 0.847 21.8 2.46 0.191 22.8 0.000
## 3 0.858 20.5 3.06 0.144 29.1 0.000
## 4 1.030 20.3 2.88 0.683 30.8 0.522
## 5 1.010 21.9 1.31 0.503 36.3 0.512
## 6 0.869 18.5 1.60 0.506 27.2 0.509
## Citric.acid Butyric.acid Propionic.acid Succinic.acid Fumaric.acid
## 1 4.12 0.838 5.99 0.460 0.0476
## 2 4.15 1.040 8.58 0.211 0.0423
## 3 4.07 1.280 10.50 0.538 0.0473
## 4 4.87 0.327 2.60 1.870 0.0740
## 5 4.41 0.627 4.50 1.040 0.0700
## 6 4.26 0.598 4.17 0.782 0.0595
## Pyruvic.acid Isobutyric.acid Hippuric.acid Glucose LYSOC14.0 LYSOC16.1
## 1 3.56 1.200 0.160 28780 0.2513 0.0221
## 2 2.64 1.890 0.115 28971 0.2271 0.0159
## 3 3.77 2.350 0.145 28047 0.3094 0.0239
## 4 27.00 0.643 0.217 36007 0.4712 0.0855
## 5 4.95 1.210 0.223 37108 0.4591 0.0994
## 6 4.60 1.220 0.225 37698 0.4360 0.0806
## LYSOC16.0 LYSOC17.0 LYSOC18.2 LYSOC18.1 LYSOC18.0 LYSOC20.4 LYSOC20.3
## 1 0.1046 0.0293 0.0811 0.1958 0.0980 0.1013 0.9735
## 2 0.0741 0.0468 0.1276 0.1647 0.0944 0.1135 1.3377
## 3 0.1036 0.0444 0.1445 0.1153 0.1211 0.0981 1.3555
## 4 0.1951 0.1390 0.8328 1.0528 0.2628 0.3331 1.2771
## 5 0.1871 0.1907 0.8766 0.4555 0.3154 0.4256 2.4573
## 6 0.1688 0.0983 0.6199 0.5133 0.2453 0.4074 1.7006
## LYSOC24.0 LYSOC26.1 LYSOC26.0 LYSOC28.1 LYSOC28.0 X14.1SMOH X16.1SM X16.0SM
## 1 0.3398 0.3130 0.5302 0.0874 0.2037 0.1319 0.0947 0.1016
## 2 0.3465 0.3562 0.4859 0.0882 0.2694 0.0995 0.0938 0.1280
## 3 0.2983 0.3351 0.5359 0.1080 0.3199 0.1242 0.1380 0.1306
## 4 0.8346 0.7999 1.6933 0.4765 0.9388 0.9707 0.7634 0.8328
## 5 0.9145 1.0149 1.7288 0.5151 1.0953 1.1390 0.9097 1.1899
## 6 0.8560 0.7375 1.5725 0.4250 1.0660 0.7859 0.7659 0.8999
## X16.1SMOH X18.1SM PC32.2AA X18.0SM X20.2SM PC36.0AE PC36.6AA PC36.0AA
## 1 0.0672 0.1107 0.1325 0.1304 0.0349 0.0605 0.1681 0.1674
## 2 0.0629 0.0997 0.1121 0.1185 0.0337 0.1111 0.2039 0.2051
## 3 0.0859 0.1399 0.0993 0.1292 0.0406 0.0868 0.1289 0.1229
## 4 0.8141 0.7329 1.4576 0.8903 0.7397 1.0864 1.2992 1.7285
## 5 1.0085 0.9001 2.0743 0.8559 0.8003 1.8082 1.1285 1.3786
## 6 0.7482 0.7201 1.0170 0.6784 0.7201 1.1852 1.0221 0.6879
## X22.2SMOH X22.1SMOH PC38.6AA PC38.0AA PC40.6AE X24.1SMOH PC40.6AA PC40.2AA
## 1 0.0469 1.0360 0.0433 0.0583 0.0476 0.0355 0.0444 0.0287
## 2 0.0287 0.9698 0.0473 0.0606 0.0280 0.0242 0.0468 0.0276
## 3 0.0452 0.9878 0.0556 0.0606 0.0404 0.0402 0.0339 0.0180
## 4 0.5306 3.1635 0.7399 0.5914 0.7078 0.4832 0.5457 0.4900
## 5 0.7034 2.9238 1.0035 0.8721 0.9650 0.5335 0.7759 0.7727
## 6 0.5170 2.8234 0.4993 0.5475 0.4972 0.4367 0.4655 0.4162
## PC40.1AA C0 C2 C3.1 C3 C4.1 C4 C3OH C5.1 C5
## 1 0.0253 0.7811 0.1734 0.0418 0.0738 0.0514 0.0969 0.1107 0.0328 0.0366
## 2 0.0349 0.8777 0.1731 0.0561 0.0847 0.0711 0.0962 0.1275 0.0388 0.0420
## 3 0.0301 0.7166 0.1827 0.0332 0.0948 0.0797 0.0704 0.0896 0.0511 0.0365
## 4 0.5444 10.3667 1.4544 0.0968 0.3844 0.1764 0.8741 0.1950 0.0871 0.1848
## 5 0.6797 6.5042 1.0957 0.2116 0.4547 0.2367 0.5804 0.1676 0.0914 0.1451
## 6 0.4747 3.1333 0.5489 0.2988 0.4535 0.1486 0.3734 0.0825 0.0549 0.0795
## C4OH C6.1 C6 C5OH C5.1DC C5DC C8 C5MDC C9 C7DC
## 1 0.2561 0.1601 0.4502 2.0265 0.0547 0.0893 0.2407 0.0408 0.0481 0.0499
## 2 0.2341 0.2025 0.5572 2.1246 0.0796 0.1204 0.1818 0.0632 0.0548 0.0753
## 3 0.2128 0.1853 0.5039 2.1108 0.0857 0.0961 0.2002 0.0462 0.0417 0.0436
## 4 0.1919 0.0654 0.6398 6.0530 0.0684 0.0749 0.5027 0.0624 0.1417 0.0418
## 5 0.1918 0.0981 0.3269 7.2123 0.0838 0.0729 0.2413 0.0572 0.2149 0.0388
## 6 0.1338 0.0635 0.2497 7.6669 0.0839 0.0557 0.2117 0.0499 0.2214 0.0404
## C10.2 C10.1 C10 C12.1 C12 C14.2 C14.1 C14 C12DC
## 1 0.07403994 0.3575579 0.2306 0.2175 0.3944 0.0643 0.0124 1.1884 0.3570
## 2 0.04419391 0.3615120 0.2614 0.2451 0.4553 0.0885 0.1427 1.0042 0.4498
## 3 0.03847946 0.3324331 0.2935 0.2501 0.4633 0.0836 0.1344 0.9804 0.4447
## 4 0.24881239 0.2241368 0.3145 0.0611 0.2671 0.0114 -0.1195 0.1990 0.0947
## 5 0.19996217 0.2384443 0.1436 0.0680 0.2248 0.0208 -0.3304 0.1534 0.1300
## 6 0.13533840 0.2486022 0.1176 0.0623 0.1673 0.0135 0.0157 0.1201 0.0939
## C14.2OH C14.1OH C16.2 C16.1 C16 C16.2OH C16.1OH C16OH C18.2 C18.1
## 1 0.0384 0.1149 0.0367 0.0381 0.0338 0.0193 0.0258 0.1279 0.0391 0.0641
## 2 0.0439 0.1248 0.0386 0.0332 0.0346 0.0246 0.0333 0.1222 0.0395 0.0705
## 3 0.0385 0.1227 0.0395 0.0356 0.0513 0.0236 0.0346 0.1254 0.0466 0.0726
## 4 0.0151 0.0222 0.0192 0.0203 0.1364 0.0156 0.0178 0.0893 0.0334 0.0446
## 5 0.0250 0.0271 0.0290 0.0258 0.0812 0.0202 0.0288 0.0999 0.0327 0.0532
## 6 0.0225 0.0254 0.0363 0.0240 0.0528 0.0190 0.0258 0.0855 0.0450 0.0410
## C18 C18.1OH
## 1 0.0382 0.1378
## 2 0.0476 0.1566
## 3 0.0626 0.1735
## 4 0.1363 0.0396
## 5 0.0929 0.0583
## 6 0.0688 0.0648
Note: Quantitative PCR (qPCR) is also known as Real-Time PCR (RT-PCR), which is NOT THE SAME as reverse-transcription PCR (also RT-PCR). See Wikepedia qPCR/RT-PCR vs Reverse-transcription PCR.
The CT or ‘count threshold’ is the number of cycles of PCR that have run before the target sequence reaches the detection threshold. So the LARGER the CT, the more PCR cycles, and therefore the LESS template. Less template DNA represents a lower viral load. Therefore: Higher CT = Lower viral load
Looking at a few of the concentrations, we can see characteristic log-normal distributions, with very different median concentrations (compare x-axes)
qplot(x=Gly,data=VirDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(x=Pyruvic.acid,data=VirDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
qplot(x=log(Gly+1),data=VirDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
First we should do some quality checks and modify data to fit assumptions of multivariate normality.
One thing we should do is set our batch effects as a factor so that we don’t accidentally analyze it as a numeric variable:
$Batch.Number<-as.factor(VirDat$Batch.Number) VirDat
The next thing we’ll want to do to make things easier, is to create separate data sets for our predictor and response variables.
From above, we see that the first six columns of data are the predictors, and the rest are our features:
<-VirDat %>%
RespDatselect(1:6)
## Error in select(., 1:6): unused argument (1:6)
What does this error mean? It’s a bit tricky, but there is a clue if we look at the help for ?select
:
If you do that, you will see that select
is a function with the same name in the dplyr
AND the MASS
libraries. Since MASS
was loaded second in our setup above, R assumes that MASS
is the version of select
that we want to use. To avoid this error, we use ::
to specify the package:
<-VirDat %>%
RespDat::select(1:6)
dplyr<-VirDat %>%
Features::select(-c(1:6)) dplyr
Now we have two datasets, one containing the potential predictors, and one containing the responses.
Verify that the correct columns are subset in each dataset
Scaling is an important part of QA/QC, but it’s important enough to get its own heading.
We want to do is scale all of our chemical columns to the same mean and standard deviation. We could code a very long pipe function to scale each column, or we can take a shortcut by separating our predictor data from our feature data.
We can use a shortcut to scale each column of the Features dataset. Really, we should look at each scaled histogram to decide if the feature should also be log-transformed, but for now we’ll just use the regular scaling.
<-Features %>%
Scaledmutate_all(scale)
As with PCA, missing data will cause a problem for our analyis, so we should check if any of our features have missing data:
%>%
Scaled select_if(function(x) any(is.na(x))) %>%
names()
## [1] "Putrescine" "Leu" "Asp" "Lactic.acid"
## [5] "Butyric.acid" "Succinic.acid" "Pyruvic.acid"
As with our PCA, we want to impute missing data or else we have to exclude the entire row even if only one value is missing. Since we have scaled everything to a mean of 0, there is a quick and easy way to do this now that we have our features in a separate object:
<-Scaled %>%
ScalCompmutate(Putrescine = ifelse(is.na(Putrescine),0,Putrescine),
Leu = ifelse(is.na(Leu),0,Leu),
Asp = ifelse(is.na(Asp),0,Asp),
Lactic.acid = ifelse(is.na(Lactic.acid),0,Lactic.acid),
Butyric.acid = ifelse(is.na(Butyric.acid),0,Butyric.acid),
Succinic.acid = ifelse(is.na(Succinic.acid),0,Succinic.acid),
Pyruvic.acid = ifelse(is.na(Pyruvic.acid),0,Pyruvic.acid))
Now check our QA/QC output:
mean(ScalComp$Gly)
## [1] 1.932406e-17
sd(ScalComp$Gly)
## [1] 1
qplot(x=Gly,data=ScalComp)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note the mean = 0 and sd = 1. The data are not perfectly normal, but it’s close enough to continue our analysis. Real-world data are messy.
Looking at the dimensions of our scaled features:
dim(ScalComp)
## [1] 221 124
We can see that we have almost as many columns of predictors (\(k=124\)) as rows of observations (\(n=221\)). If we just throw everything into an RDA we risk ‘overfitting’ our model, just as we would run into a false discovery problem if we ran a separate linear model for each feature as a predictor.
This problem is discussed in the Intro ML Tutorial.
One solution would be to do a PCA and retain only the major axes. Since we have already scaled our variables, we do not need cor=TRUE
as we did in the PCA Tutorial.
<-princomp(ScalComp)
PCAsummary(PCA)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 6.3821252 5.1867984 2.66306699 2.31134733 2.28112784
## Proportion of Variance 0.3301668 0.2180728 0.05748663 0.04330451 0.04217955
## Cumulative Proportion 0.3301668 0.5482395 0.60572616 0.64903067 0.69121022
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 2.19188178 1.91730212 1.90138064 1.72335124 1.54361558
## Proportion of Variance 0.03894368 0.02979777 0.02930494 0.02407411 0.01931439
## Cumulative Proportion 0.73015390 0.75995167 0.78925661 0.81333072 0.83264512
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 1.36845303 1.25622475 1.18204835 1.17288175 1.093284717
## Proportion of Variance 0.01517968 0.01279197 0.01132591 0.01115093 0.009688784
## Cumulative Proportion 0.84782479 0.86061676 0.87194267 0.88309360 0.892782387
## Comp.16 Comp.17 Comp.18 Comp.19
## Standard deviation 1.003211070 0.893831514 0.856258086 0.848119859
## Proportion of Variance 0.008158068 0.006476107 0.005943087 0.005830653
## Cumulative Proportion 0.900940455 0.907416562 0.913359649 0.919190302
## Comp.20 Comp.21 Comp.22 Comp.23
## Standard deviation 0.782919361 0.750597233 0.735602081 0.715453180
## Proportion of Variance 0.004968631 0.004566849 0.004386202 0.004149207
## Cumulative Proportion 0.924158933 0.928725782 0.933111983 0.937261191
## Comp.24 Comp.25 Comp.26 Comp.27
## Standard deviation 0.684060473 0.643251089 0.621057226 0.603620066
## Proportion of Variance 0.003793077 0.003354005 0.003126554 0.002953453
## Cumulative Proportion 0.941054268 0.944408274 0.947534828 0.950488280
## Comp.28 Comp.29 Comp.30 Comp.31
## Standard deviation 0.597518251 0.579210835 0.54923767 0.536302211
## Proportion of Variance 0.002894043 0.002719419 0.00244525 0.002331427
## Cumulative Proportion 0.953382324 0.956101742 0.95854699 0.960878420
## Comp.32 Comp.33 Comp.34 Comp.35
## Standard deviation 0.507611912 0.499457327 0.489584585 0.475509121
## Proportion of Variance 0.002088653 0.002022085 0.001942935 0.001832822
## Cumulative Proportion 0.962967073 0.964989158 0.966932093 0.968764915
## Comp.36 Comp.37 Comp.38 Comp.39
## Standard deviation 0.467444736 0.453521722 0.430367887 0.423244431
## Proportion of Variance 0.001771182 0.001667243 0.001501352 0.001452062
## Cumulative Proportion 0.970536098 0.972203341 0.973704692 0.975156754
## Comp.40 Comp.41 Comp.42 Comp.43
## Standard deviation 0.41543478 0.394554362 0.37686464 0.367478215
## Proportion of Variance 0.00139897 0.001261875 0.00115126 0.001094626
## Cumulative Proportion 0.97655572 0.977817600 0.97896886 0.980063486
## Comp.44 Comp.45 Comp.46 Comp.47
## Standard deviation 0.363294995 0.353788911 0.3401805528 0.3366898090
## Proportion of Variance 0.001069847 0.001014591 0.0009380407 0.0009188881
## Cumulative Proportion 0.981133333 0.982147924 0.9830859647 0.9840048529
## Comp.48 Comp.49 Comp.50 Comp.51
## Standard deviation 0.3327567474 0.3090193922 0.3055648632 0.2927015712
## Proportion of Variance 0.0008975454 0.0007740592 0.0007568495 0.0006944689
## Cumulative Proportion 0.9849023983 0.9856764575 0.9864333070 0.9871277759
## Comp.52 Comp.53 Comp.54 Comp.55
## Standard deviation 0.2840360071 0.2798492127 0.2627268047 0.2516579531
## Proportion of Variance 0.0006539575 0.0006348204 0.0005595147 0.0005133624
## Cumulative Proportion 0.9877817333 0.9884165538 0.9889760684 0.9894894308
## Comp.56 Comp.57 Comp.58 Comp.59
## Standard deviation 0.2487533502 0.2406933129 0.2363567318 0.2323744829
## Proportion of Variance 0.0005015804 0.0004696029 0.0004528336 0.0004377031
## Cumulative Proportion 0.9899910112 0.9904606141 0.9909134477 0.9913511507
## Comp.60 Comp.61 Comp.62 Comp.63
## Standard deviation 0.2277088209 0.221973178 0.2164625404 0.208967530
## Proportion of Variance 0.0004203029 0.000399396 0.0003798116 0.000353965
## Cumulative Proportion 0.9917714537 0.992170850 0.9925506612 0.992904626
## Comp.64 Comp.65 Comp.66 Comp.67
## Standard deviation 0.2073410339 0.2040356432 0.1956967524 0.191130312
## Proportion of Variance 0.0003484763 0.0003374542 0.0003104345 0.000296116
## Cumulative Proportion 0.9932531025 0.9935905567 0.9939009911 0.994197107
## Comp.68 Comp.69 Comp.70 Comp.71
## Standard deviation 0.1853142226 0.1838355514 0.1804110054 0.1757225048
## Proportion of Variance 0.0002783686 0.0002739439 0.0002638328 0.0002502981
## Cumulative Proportion 0.9944754757 0.9947494196 0.9950132524 0.9952635505
## Comp.72 Comp.73 Comp.74 Comp.75
## Standard deviation 0.1729722998 0.1665406314 0.1636398334 0.1601132101
## Proportion of Variance 0.0002425246 0.0002248242 0.0002170605 0.0002078055
## Cumulative Proportion 0.9955060751 0.9957308993 0.9959479598 0.9961557653
## Comp.76 Comp.77 Comp.78 Comp.79
## Standard deviation 0.1548932263 0.1538713272 0.1493139357 0.1418543039
## Proportion of Variance 0.0001944767 0.0001919191 0.0001807188 0.0001631127
## Cumulative Proportion 0.9963502420 0.9965421610 0.9967228798 0.9968859925
## Comp.80 Comp.81 Comp.82 Comp.83
## Standard deviation 0.1403878890 0.1384831961 0.1357036337 0.1324809584
## Proportion of Variance 0.0001597578 0.0001554522 0.0001492745 0.0001422688
## Cumulative Proportion 0.9970457503 0.9972012025 0.9973504770 0.9974927458
## Comp.84 Comp.85 Comp.86 Comp.87
## Standard deviation 0.1318259677 0.1282629147 0.1260543403 0.1211127
## Proportion of Variance 0.0001408655 0.0001333537 0.0001288007 0.0001189
## Cumulative Proportion 0.9976336113 0.9977669649 0.9978957657 0.9980147
## Comp.88 Comp.89 Comp.90 Comp.91
## Standard deviation 0.1199902258 0.1163319481 0.1154746915 0.1135568412
## Proportion of Variance 0.0001167063 0.0001096985 0.0001080877 0.0001045272
## Cumulative Proportion 0.9981313720 0.9982410705 0.9983491582 0.9984536854
## Comp.92 Comp.93 Comp.94 Comp.95
## Standard deviation 0.1126370546 1.081564e-01 1.062043e-01 1.034501e-01
## Proportion of Variance 0.0001028408 9.482162e-05 9.142964e-05 8.674902e-05
## Cumulative Proportion 0.9985565262 9.986513e-01 9.987428e-01 9.988295e-01
## Comp.96 Comp.97 Comp.98 Comp.99
## Standard deviation 1.016806e-01 9.931112e-02 9.791585e-02 9.516176e-02
## Proportion of Variance 8.380665e-05 7.994632e-05 7.771568e-05 7.340534e-05
## Cumulative Proportion 9.989133e-01 9.989933e-01 9.990710e-01 9.991444e-01
## Comp.100 Comp.101 Comp.102 Comp.103
## Standard deviation 9.173623e-02 9.009086e-02 8.454985e-02 8.319505e-02
## Proportion of Variance 6.821572e-05 6.579065e-05 5.794666e-05 5.610449e-05
## Cumulative Proportion 9.992126e-01 9.992784e-01 9.993364e-01 9.993925e-01
## Comp.104 Comp.105 Comp.106 Comp.107
## Standard deviation 8.150336e-02 7.978429e-02 7.786984e-02 7.622690e-02
## Proportion of Variance 5.384604e-05 5.159855e-05 4.915201e-05 4.709982e-05
## Cumulative Proportion 9.994463e-01 9.994979e-01 9.995471e-01 9.995942e-01
## Comp.108 Comp.109 Comp.110 Comp.111
## Standard deviation 7.242232e-02 6.965001e-02 6.890923e-02 6.608172e-02
## Proportion of Variance 4.251553e-05 3.932285e-05 3.849084e-05 3.539691e-05
## Cumulative Proportion 9.996367e-01 9.996760e-01 9.997145e-01 9.997499e-01
## Comp.112 Comp.113 Comp.114 Comp.115
## Standard deviation 6.381898e-02 6.024836e-02 5.825819e-02 5.485355e-02
## Proportion of Variance 3.301432e-05 2.942342e-05 2.751165e-05 2.439002e-05
## Cumulative Proportion 9.997829e-01 9.998123e-01 9.998398e-01 9.998642e-01
## Comp.116 Comp.117 Comp.118 Comp.119
## Standard deviation 0.0540414728 4.982572e-02 4.948327e-02 0.0469600934
## Proportion of Variance 0.0000236732 2.012379e-05 1.984812e-05 0.0000178756
## Cumulative Proportion 0.9998878932 9.999080e-01 9.999279e-01 0.9999457407
## Comp.120 Comp.121 Comp.122 Comp.123
## Standard deviation 4.299296e-02 4.241059e-02 3.986074e-02 3.395811e-02
## Proportion of Variance 1.498295e-05 1.457979e-05 1.287933e-05 9.347377e-06
## Cumulative Proportion 9.999607e-01 9.999753e-01 9.999882e-01 9.999975e-01
## Comp.124
## Standard deviation 1.745566e-02
## Proportion of Variance 2.469876e-06
## Cumulative Proportion 1.000000e+00
Here we can see most of the variation is explained by the first few PC axes. BUT we also know that humans can have a lot of variation in their metabolomic profiles. What if the difference due to infection accounts for only a small amount of variation in metabolites? In that case, we might be excluding valuable data.
Since we have a categorical response variable, we have another option for feature selection. Instead of collapsing to a few PC axes, we can quickly look at the predictive power of each feature and only retain those that are reasonably good at distinguishing among groups.
‘Reasonably good’ is a subjective term, and the definition we use will depend on the number of features that we want to include.
Probably the simplest approach is to do a simple linear model for each feature. However, in this case we will use Class.name
as the predictor and each Feature as the response. Then we can use R^2 or p-value to define a cutoff. We aren’t too worried about the False Discovery Rate here because we aren’t testing 124 independent hypotheses. Instead, we are just using the fit of the models to decide which features to include in the analysis.
Now the problem is that we would have to write code for 124 different linear models, look at the output, and then decide which p-value to use as a cutoff.
We can use a trick here that we first covered in the Mixed Models Tutorial when we talked about repeated measures data and the ‘long’ format.
In this case, the ‘long’ format means that we collapse all of our 124 features column into two columns: one for the value observed for each feature (i.e. the number in each cell) and a second column specifying which feature it is (i.e. the column name).
This is easily done with the pivot_longer()
function from the tidyr
library:
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
<-ScalComp %>%
FeatureSelmutate(Class.name=VirDat$Class.name) %>% # Add the Response variable
pivot_longer(cols=-Class.name,
names_to="Chem",
values_to="Conc")
str(FeatureSel)
## tibble [27,404 x 3] (S3: tbl_df/tbl/data.frame)
## $ Class.name: chr [1:27404] "VTM" "VTM" "VTM" "VTM" ...
## $ Chem : chr [1:27404] "Gly" "Ala" "Ser" "Histamine" ...
## $ Conc : num [1:27404, 1] -1.31 -1.23 -1.39 -0.33 -1.12 ...
Here we see just three columns: Class.name, the specific chemical (formerly the column header) and the concentration (formerly cell value for each column header).
We have 27,404 rows instead of 221 because each of the original rows is repeated 124 times (one for each feature). We can do a lot of things with this data. One obvious one would be to use facets=Chem
in a qplot
or ggplot
to create a separate graph for each of the chemicals. This would be useful for visually inspecting for outliers, lognormal distributions, etc. We might want to output this to a large jpeg or pdf file since it will be hard to squeeze all those graphs into our small plotting area.
Another thing we can do now is to calculate summary statistics to make sure we have mean ~ 0 and sd ~ 1, and check the max and min values
%>%
FeatureSel group_by(Chem) %>%
summarize(MeanConc=mean(Conc),
sd=sd(Conc),
max=max(Conc),
min=min(Conc))
## # A tibble: 124 x 5
## Chem MeanConc sd max min
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Ac.Orn -4.33e-17 1 11.8 -0.420
## 2 Ala -4.37e-17 1 5.25 -1.23
## 3 Arg 8.19e-17 1 3.66 -1.34
## 4 Asn -6.19e-17 1 5.29 -1.16
## 5 Asp -2.65e-17 0.995 3.64 -1.47
## 6 beta.Hydroxybutyric.acid 1.25e-17 1 11.6 -0.370
## 7 Betaine -6.06e-17 1 2.71 -2.92
## 8 Butyric.acid -3.10e-17 0.998 11.2 -0.595
## 9 C0 -1.20e-16 1 7.25 -1.23
## 10 C10 7.05e-17 1 2.24 -1.09
## # ... with 114 more rows
We can also run separate linear models using the do
function with group_by
, except that we also want to extract the p-values specifically. To think about how to do this, let’s look at a single linear model:
<-lm(ScalComp$Gly ~ RespDat$Class.name)
Mod<-anova(Mod)
ModOut ModOut
## Analysis of Variance Table
##
## Response: ScalComp$Gly
## Df Sum Sq Mean Sq F value Pr(>F)
## RespDat$Class.name 4 60.042 15.0105 20.27 3.427e-14 ***
## Residuals 216 159.958 0.7405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the anova
function gives us the p-values, which we can also pull out if we set the output of anova to an object
1,"Pr(>F)"] ModOut[
## [1] 3.427345e-14
OR we can do just create a nested function:
anova(lm(ScalComp$Gly ~ RespDat$Class.name))[1,"Pr(>F)"]
## [1] 3.427345e-14
Now we apply this with dplyr
, which gives us a weird object, which we can convert back to a data.frame
<-FeatureSel %>%
PValsgroup_by(Chem) %>%
summarize(P = anova(lm(Conc ~ Class.name))[1,"Pr(>F)"]) %>%
::select(Chem,P)
dplyr
head(PVals)
## # A tibble: 6 x 2
## Chem P
## <chr> <dbl>
## 1 Ac.Orn 7.04e- 1
## 2 Ala 2.07e-12
## 3 Arg 9.53e-18
## 4 Asn 3.32e-10
## 5 Asp 1.93e-20
## 6 beta.Hydroxybutyric.acid 1.12e- 1
So how do we find a good cut-off value? The more stringent the cutoff (i.e. smaller P), the stronger our prediction, but the less features we will retain. We need to find a balance. A good first step is to plot a histogram:
qplot(x=P,data=PVals)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see more than 40 features close to zero. That shows us that there are a lot of features (i.e. biological compounds) that differ amon the groups. It’s much better than 124, but still quite a few for this small dataset of just 221 observations. We can try to reduce this further by thinking about the biology of the system and the goal of our model.
We want to predict the differences among groups:
unique(VirDat$Class.name)
## [1] "VTM" "Control" "COVID19" "Influenza" "RSV"
But we also need to come up with binary groupings to run RDA, so could focus on a few different goals:
There are of course many more goals/models we could come up with using different binary combinations of these groups, but these are probabably the more interesting ones. For each model we could do a separate feature selection step. Let’s focus on the last one – distinguishing COVID19 patients from other respiratory infections. This means we’ll have to drop VTM and Control, and then combine Influenza and RSV into a single group. Then we re-run our feature selection pipeline as above:
<-FeatureSel %>%
PCOVfilter(Class.name %in% c("COVID19","Influenza","RSV")) %>%
mutate(NewGroup = replace(Class.name, Class.name == "Influenza", "NonCOV")) %>%
mutate(NewGroup = replace(NewGroup, Class.name == "RSV", "NonCOV")) %>%
group_by(Chem) %>%
summarize(P = anova(lm(Conc ~ NewGroup))[1,"Pr(>F)"]) %>%
::select(Chem,P)
dplyrqplot(x=P,data=PCOV) + xlim(0,0.1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 102 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Now we are down to ~15 features with P < 0.05, which we can use to define a new Features dataset:
<-PCOV %>%
Keepfilter(PCOV$P < 0.05)
<-paste(Keep$Chem) Keep
This gives us a vector of chemical names that we can use to select columns
<-ScalComp %>%
ScaledSub::select(all_of(Keep))
dplyrnames(ScaledSub)
## [1] "Betaine" "C14.1OH" "C16.1OH" "C2" "C7DC"
## [6] "Carnosine" "Cit" "Glucose" "LYSOC16.0" "LYSOC18.0"
## [11] "Met.SO" "Putrescine" "Taurine" "Tyramine"
Now that we have our subset of features, we can run a Linear Discriminant Analysis (LDA). The lda
function from the MASS package is pretty straight forward. There are a couple of ways we can specify the model. The first is to use an equation similar to the linear model: Y ~ .
where Y
is the column name of the categorical variable and .
means ‘all other columns of data’. We would also need to specify the data object with data =
Since we have our categorical response variable and features in two different objects, we can plug them in without the data =
parameter.
We also have to remember to recode our response into the same binary variables as above.
<-RespDat %>%
RDARespmutate(NewGroup = replace(Class.name, Class.name != "COVID19", "NonCOV"))
<-lda(x=ScaledSub,grouping=RDAResp$NewGroup) LDAMod
Let’s take a look at the output of the LDAMod. What type of data structure does it create?
str(LDAMod)
## List of 8
## $ prior : Named num [1:2] 0.249 0.751
## ..- attr(*, "names")= chr [1:2] "COVID19" "NonCOV"
## $ counts : Named int [1:2] 55 166
## ..- attr(*, "names")= chr [1:2] "COVID19" "NonCOV"
## $ means : num [1:2, 1:14] 0.1595 -0.0528 0.4789 -0.1587 0.7463 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "COVID19" "NonCOV"
## .. ..$ : chr [1:14] "Betaine" "C14.1OH" "C16.1OH" "C2" ...
## $ scaling: num [1:14, 1] -0.5058 -0.0492 -0.5285 0.4728 0.1058 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:14] "Betaine" "C14.1OH" "C16.1OH" "C2" ...
## .. ..$ : chr "LD1"
## $ lev : chr [1:2] "COVID19" "NonCOV"
## $ svd : num 13.6
## $ N : int 221
## $ call : language lda(x = ScaledSub, grouping = RDAResp$NewGroup)
## - attr(*, "class")= chr "lda"
We can try the summary()
function:
summary(LDAMod)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 28 -none- numeric
## scaling 14 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
Unlike an linear model (lm
) object, the summary()
function for the lda
object summarizes the object itself. The left-hand column gives the names of the list items, and the ‘Length’ gives the number of elements.
Let’s take a quick look at a few of the list items, which are also explained the lda help: ?lda
.
Counts show the sample size for each group:
$counts LDAMod
## COVID19 NonCOV
## 55 166
Scaling shows the factor loadings. Compare this to the Eigenvectors of a the Principal Components Analysis. There are some important differences. We only have ONE LD axis for 16 features, whereas PCA would give us 16 axes. The number of axes in an LD are determined by the number of categories of the response variable, rather than the number of features.
The number of LD axes = Number of Categories - 1
Apart from that, the idea is the same. The loadings for LD1 show how each feature contributes to the LD1 axis.
$scaling LDAMod
## LD1
## Betaine -0.50577476
## C14.1OH -0.04919074
## C16.1OH -0.52854092
## C2 0.47279012
## C7DC 0.10577965
## Carnosine 0.38982307
## Cit -0.05929275
## Glucose 0.18166895
## LYSOC16.0 0.23176779
## LYSOC18.0 -0.53863330
## Met.SO 0.24488153
## Putrescine -0.29620077
## Taurine -0.35756495
## Tyramine 0.64130142
We can see above that higher values of LDA are determined largely by higher values of LYSOC18(.1 and .2) and Tyramine, and lower values of Taurine. This might also tell us something about the pathology of the disease, but here we are just focusing on the data analysis.
Another important distinction from the princomp
output is that there are no scores
in the LDA output. To find the scores, we use a different approach:
We can use the predict
function with the LDA object to generate additional info:
<-predict(LDAMod)
PredOutsummary(PredOut)
## Length Class Mode
## class 221 factor numeric
## posterior 442 -none- numeric
## x 221 -none- numeric
Note that we have a class
prediction and an x
prediction, and both have the same length as the number of observations in our dataset.
The x
object here is the predicted score
for the LDA axis:
head(PredOut$x)
## LD1
## [1,] 1.089149
## [2,] 1.142026
## [3,] 1.033887
## [4,] 2.002258
## [5,] 1.425790
## [6,] 1.625491
and the class
object is the predicted category:
head(PredOut$class)
## [1] NonCOV NonCOV NonCOV NonCOV NonCOV NonCOV
## Levels: COVID19 NonCOV
We can generate a confusion matrix by comparing the predicted vs. observed.
<-data.frame(Observed=as.factor(RDAResp$NewGroup),Predicted=PredOut$class)
CatDattable(CatDat)
## Predicted
## Observed COVID19 NonCOV
## COVID19 36 19
## NonCOV 7 159
This is called a confusion matrix and it is related to the false discovery rates covered in the Advanced LM Tutorial.
Be careful with the rows and columns. By convention, the prediction is usually given as columns and the observed/actual given as rows, but it’s worth checking.
Using our LDA model, we have 40 True Positive, 155 True Negative, 11 False Positive, and 15 False Negative cases. We can calculate the model accuracy, specificity and sensitivity using these values.
Calculate the Accuracy, Specificity, and Sensitivity of this LDA model
The posterior
object gives the posterior probability for each category. This is a concept in Bayesian statistics that isn’t covered here. For now, just know that these are probabilities for assigning each observation to each group:
<-as.data.frame(PredOut$posterior)
Posthead(Post)
## COVID19 NonCOV
## 1 0.010602495 0.9893975
## 2 0.009490279 0.9905097
## 3 0.011902721 0.9880973
## 4 0.001548195 0.9984518
## 5 0.005226896 0.9947731
## 6 0.003430964 0.9965690
The rows add up to 1 and the columns represent the probability that each individual (row) belongs to that category. For example, we see in all 6 cases that there is predicted to be >90% chance that these individuals belong to the NonCOV group. We can get a sense of how well our model performs by plotting these:
$Group<-RDAResp$NewGroup
Postqplot(x=COVID19,y=NonCOV,colour=Group,data=Post,alpha=I(0.3))
A perfect model would have all of the COVID samples in the lower right corner and all of the NonCOV samples in the top left corner.
The X-axis of this graph is the predicted probability that the patient has COVID19. Let’s compare this to the LD1 scores:
$LD1<-as.vector(PredOut$x)
Postqplot(LD1,COVID19,data=Post)
We can see how the probability is a nonlinear function of LD1. Now imagine a new patient comes in and we look at their metabolite profile to predict whether they have COVID19. We can define the probability, but policy is usually based on a firm decision. Ultimately we have to classify this patient one way or the other, and that will have real-world consequences for the patient and the potential spread of infection.
Which value of LD1 should we use to categorize the patient?
One answer might be the point along LD that corresponds to 0.5 on the y-axis.
On the other hand, we might want to error on the side of caution to limit our false-negative rate (i.e. patients with COVID who are told they don’t have it)
There may be other (non-COVID) datasets where we want to error on the side of limiting our false positive rate.
Regardless, it would be helpful to know how different threshold values of LD1 affect the number of false positive vs. false negatives. This is shown in a graph called the ROC.
The Receiver-Operator Curve (ROC) is a measure of model performance based on the rates of false positive vs. false negatives. To calculate the curve, we set a value of LD1, then look at the confusion matrix and calculate the False positive rate vs. True positive rate. Here ‘rate’ means proportion: If we look at all the predicted positives, what proportion are true positive vs true negative? These rates will of course add up to 1 (100%).
Rather than calculate by hand…
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
plot.roc(Group~LD1, data=Post)
## Setting levels: control = COVID19, case = NonCOV
## Setting direction: controls < cases
Let’s take a moment to think about what this curve means. If our LD axis was a random preditor, we would expect predictions along the grey line. Our actual model is the dark black line.
Now looking at the x=axis you can see we start on the left-hand side with 100% specificity and 0% sensitivity. As we move right we quickly increase sensitivity with only a tiny decresae in specificity up to about 0.8 on the y-axis. After this, we have to sacrifice a lot of specificity for a small increase in sensitivity.
Overall, this is a pretty good model.
What would the curve look like for a super-accurate model?
We can measure the area Area Under the Curve (AUC) or Area Under the ROC (AUROC) as a measure of the performance of the model. The random model is a simple triangle (grey line, above) which shows that the minimum AUC is 0.5. As our model improves we get something closer to a square (100% accuracy across all values of Specificity), giving us the maximum value of 1.
AUC ranges from 0.5 to 1.
We can use the auc
function from the same pROC
package to calculate the area under the curve:
auc(Group~LD1, data=Post)
## Setting levels: control = COVID19, case = NonCOV
## Setting direction: controls < cases
## Area under the curve: 0.9298
What if we want to expand our model to look at all of the different classes, not just COVID vs nonCOV? A good approach is to set different binary axes and run a separate LDA for each.
On the other hand, if we want to throw everything into an LDA, we can do that too:
<-lda(x=ScalComp,grouping=RDAResp$Class.name)
LDAMultsummary(LDAMult)
## Length Class Mode
## prior 5 -none- numeric
## counts 5 -none- numeric
## means 620 -none- numeric
## scaling 496 -none- numeric
## lev 5 -none- character
## svd 4 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
Note that we now have 5 categories and 4 LD axes. Each LD axis is like a mini LDA for each category vs. all of the others grouped together.
Notice in the LDA how our LD1 axis is a linear combination of our features. Just like principal components axes, we can calculate a score
for each observation (i.e. patient) by multiplying the standardized Feature value by its factor loading.
But what if factors have non-linear effects? The QDA is a form of nonlinear RDA that is good when the different groups have different variances. In our case, we grouped together VTM and uninfected patients with Influenza and RSV patients. These are likely more variable than the COVID19 patients.
The Quadratic Discriminant Analysis (QDA) scales the transformation matrix that is used to calculate the LD axes. The ‘Quadratic’ in QDA refers to the predictor, which is quadratic as opposed to the linear LD1 axis in LDA. In R this is as simple as changing the function name:
<-qda(x=ScaledSub,grouping=RDAResp$NewGroup)
QDAModsummary(QDAMod)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 28 -none- numeric
## scaling 392 -none- numeric
## ldet 2 -none- numeric
## lev 2 -none- character
## N 1 -none- numeric
## call 3 -none- call
and the predictions:
<-predict(QDAMod)
QpredOutsummary(QpredOut)
## Length Class Mode
## class 221 factor numeric
## posterior 442 -none- numeric
Compare these to LDA. What is missing?
The biggest change is the loss of a linear predictor (x
in the LDA prediction output). That’s because of the transformation of the data.
How good are the predictions? Let’s look at the confusion matrix:
<-data.frame(Obs=as.factor(RDAResp$NewGroup),Pred=QpredOut$class)
Qcontable(Qcon)
## Pred
## Obs COVID19 NonCOV
## COVID19 36 19
## NonCOV 2 164
This model performs a lot better. We also can’t do an ROC or AUROC because we don’t have a linear predictor. Compare this to the LDA and you can see this model performs better.
Comparing the LDA to the QDA is a good example of the trade-off between prediction and interpretability that we covered in the Intro to ML Tutorial. The QDA is better at predicting the category, but the LDA is more interpretable because we can look the scaling
of LD1 to see how the features (biochemicals) map to the prediction (LD1).
Finally, we arrive at the Regularized Discriminant Analysis (RDA), which incorporates elements of both the LDA and QDA.
The analysis requires two ‘tuning parameters’ \(\lambda\) and \(\gamma\). Each of these range from 0 to 1 and define a range of models with 4 special cases:
From these examples, you can see that the \(\lambda\) parameter ranges from QDA to LDA when \(\gamma=0\). This is the scaling of the variances within groups.
The \(\gamma\) parameter itself is called a ‘regularization’ parameter. It defines the scale of the covariance matrix:
\[COV_\gamma = (1-\gamma)COV + \gamma I \]
Where COV is the covariance matrix and I is the identity matrix. The identity matrix is just a matrix with 1 along the diagonal. So gamma=1 sets the covariance matrix to the identity matrix.
One BIG advantage of the RDA over LDA is that it can be applied to data where the number of features is much larger than the number of observations. The reason is that LDA (like PCA) works on the covariance matrix, which is undefined when Features > Observations. Instead of the covariance (or correlation) matrix, the RDA uses a regularized covariance matrix.
The important question is: how do we choose which values for \(\gamma\) and \(\lambda\)? The beauty of machine learning is that we can let the data tell us as part of the model training process.
First, we’ll load the klaR
and caret
libraries, which include tools for rda and machine learning, respectively
library(klaR)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
We have two tuning parameters that we want to ‘optimize’. One way to do this is to set a range of values for each, run a model with each combination of parameter values, and then compare the fit of the different models. This is called a grid search. Think of a the parameters as a grid, say with \(\lambda\) on the x-axis and \(\gamma\) on the y axis (or more axes for more parameters) – each grid square/cube/hypercube is a specific set of parameter values.
The problem with grid search is that the number of models increases exponentially with the number of parameters and the degree of precision we want in the parameters (e.g. 0.5 vs. 0.1 vs 0.01 etc). There are more efficient methods for more complicated models.
But for an RDA, we don’t care so much about getting high-precision values since we just want to find a ‘good’ model even if it is not the ‘best’ model.
Instead of defining regularly spaced intervals, we can randomly select values along the range of 0 to 1 for each parameter, run the model, and then measure its performance. This is called a random search
trainControl
Rather than write some long code to do a parameter search, we can use the train
function from the caret
package to automate a lot of this.
The first step is to make a trainControl
object specifying some of the details of the training options that we want to use
<-trainControl(method="repeatedcv",
CTLnumber=4,
repeats=24,
classProbs=T,
verboseIter=F,
search="random")
method="LOOCV"
– the method of sampling from the dataset. We are using the Leave-One-Out Cross Validation (LOOCV) approach because we don’t have a very big dataset. If we had thousands or millions of rows, we might want to do a repeated k-fold cross-validation (see Intro to ML Tutorial).classProbs=T
– specifies that we want to use the class probabilities specific to each subsample (rather than use the global probability)verboseIter=F
– the word verbose is often used in programming to mean “Give the user more detailed feedback as the program is running”. This is handy for debugging and also to make sure the program is running and you computer is not just ‘stuck’. On the other hand, it adds to the runtime, which can significantly slow down large analyses. You can try re-running this code and the RDA with verboseIter=T
to compare the output.search="random"
– specifies a random search (alternative = grid
)To understand what we’ve made, look at the structure of the object. Note the extra (unspecified) parameters are just thee default:
str(CTL)
## List of 27
## $ method : chr "repeatedcv"
## $ number : num 4
## $ repeats : num 24
## $ search : chr "random"
## $ p : num 0.75
## $ initialWindow : NULL
## $ horizon : num 1
## $ fixedWindow : logi TRUE
## $ skip : num 0
## $ verboseIter : logi FALSE
## $ returnData : logi TRUE
## $ returnResamp : chr "final"
## $ savePredictions : logi FALSE
## $ classProbs : logi TRUE
## $ summaryFunction :function (data, lev = NULL, model = NULL)
## $ selectionFunction: chr "best"
## $ preProcOptions :List of 6
## ..$ thresh : num 0.95
## ..$ ICAcomp : num 3
## ..$ k : num 5
## ..$ freqCut : num 19
## ..$ uniqueCut: num 10
## ..$ cutoff : num 0.9
## $ sampling : NULL
## $ index : NULL
## $ indexOut : NULL
## $ indexFinal : NULL
## $ timingSamps : num 0
## $ predictionBounds : logi [1:2] FALSE FALSE
## $ seeds : logi NA
## $ adaptive :List of 4
## ..$ min : num 5
## ..$ alpha : num 0.05
## ..$ method : chr "gls"
## ..$ complete: logi TRUE
## $ trim : logi FALSE
## $ allowParallel : logi TRUE
trainClass
The trainClass
function does the heavy lifting, with the above object as input for the trControl
parameter. Note that this will take a few seconds to run. If it is taking too long, reduce the tuneLength
parameter to a smaller number like 3 or 4.
Since we are using a ‘random’ design, we will use set.seed()
to make the results reproducible. See the R Crash Course Tutorial if you need a refresher on this.
set.seed(123)
<-train(x=ScaledSub,y=RDAResp$NewGroup,
randomRDAmethod="rda",
metric="Accuracy",
tuneLength = 24,
trControl=CTL)
Most of these parameters are self-explanatory
Now we can inspect the output and plot the parameters
randomRDA
## Regularized Discriminant Analysis
##
## 221 samples
## 14 predictor
## 2 classes: 'COVID19', 'NonCOV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 24 times)
## Summary of sample sizes: 167, 165, 165, 166, 166, 166, ...
## Resampling results across tuning parameters:
##
## gamma lambda Accuracy Kappa
## 0.04205953 0.41454634 0.8486348 0.5550309
## 0.04555650 0.14711365 0.8439167 0.5335700
## 0.10292468 0.31818101 0.8429900 0.5294530
## 0.24608773 0.14280002 0.8280944 0.4663718
## 0.32792072 0.41372433 0.8320140 0.4814076
## 0.40897692 0.54406602 0.8357953 0.4985482
## 0.45333416 0.47779597 0.8274951 0.4675415
## 0.45661474 0.79546742 0.8487937 0.5535279
## 0.52810549 0.96302423 0.8546381 0.5803430
## 0.55143501 0.69070528 0.8376590 0.5156362
## 0.57263340 0.21640794 0.8118554 0.4069088
## 0.64050681 0.23303410 0.8069371 0.3959042
## 0.65570580 0.26597264 0.8067578 0.3983065
## 0.67757064 0.75845954 0.8376352 0.5248663
## 0.69280341 0.13880606 0.7986641 0.3696990
## 0.70853047 0.85782772 0.8446127 0.5505070
## 0.88301740 0.59414202 0.8121891 0.4523258
## 0.88953932 0.15244475 0.7873601 0.3545434
## 0.89241904 0.90229905 0.8353410 0.5369691
## 0.89982497 0.23162579 0.7894196 0.3656448
## 0.94046728 0.28915974 0.7899674 0.3741002
## 0.95450365 0.36884545 0.7931533 0.3879286
## 0.95683335 0.02461368 0.7751258 0.3272864
## 0.99426978 0.46596245 0.7980647 0.4148300
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0.5281055 and lambda
## = 0.9630242.
ggplot(randomRDA) + theme(legend.position="bottom")
The size of the point corresponds to the accuracy of the model. Here we can see that there isn’t much variation as long as \(\gamma\) is less than 0.6 and \(\lambda\) is above 0.5.
If we set \(\gamma=0 and \lambda=1\) then it is just the LDA by definition, as described above. Since the LDA is also more interpretable, it seems like a good choice.
rda()
In other cases, we may want to run the RDA with parameters chosen from the above analysis, which is just done using the rda
function from the klaR
package:
<-rda(x=ScaledSub,grouping=RDAResp$NewGroup,
RDAmodregularization=c(gamma=0.3, lambda=0.9))
summary(RDAmod)
## Length Class Mode
## call 4 -none- call
## regularization 2 -none- numeric
## classes 2 -none- character
## prior 2 -none- numeric
## error.rate 1 -none- numeric
## varnames 14 -none- character
## means 28 -none- numeric
## covariances 392 -none- numeric
## covpooled 196 -none- numeric
## converged 1 -none- logical
## iter 1 -none- numeric
Compared to the QDA output we have a few more list items, specifically the regularization parameters that we specified, an estimate of the error rate, the individual group covariances and the pooled covariances across all the samples.
Recall the problem of overfitting from the Intro to ML Tutorial. In the RDA example above, we used cross-validation to fit the parameters. In addition, we can split our data into a training set and a validation set.
Data splitting is as simple as breaking the data into two sets: one used to train the model and the second one used to validate the model. We want to be careful to make sure each subset has enough observations, and a good representation of the data. In the extreme example, we wouldn’t want 90% of the COVID cases in the training dataset and 90% of the NonCOV cases in the validation dataset.
A typical way to break up the dataset is just to select every Nth row. For example, if we want to split the data in half, then we can just split up the odd vs. even rows of data.
Since we are working with multiple datasets, we can define a vector of row numbers for each dataset and use it as an index. A simple way to do this is to divide the row number by 2. Odd rows will have a remainder and even rows won’t. We can use the %%
operator which returns a 1 if there is a remainder and 0 if there is not
<-c(1:nrow(RDAResp))
Rows<-Rows %% 2 == 1
Train<-Rows %% 2 == 0 Validate
Take a minute to review each object to understand what we did
Now we have a training set (Train) and validation set (Validate):
head(RDAResp[Train,])
## Sample.Name Batch.Number Class.name Sex Age CT NewGroup
## 1 VTM a 1 VTM <NA> <NA> <NA> NonCOV
## 3 VTM a2 1 VTM <NA> <NA> <NA> NonCOV
## 5 VTM b1 2 VTM <NA> <NA> <NA> NonCOV
## 7 VTM b3 2 VTM <NA> <NA> <NA> NonCOV
## 9 VTM c1 3 VTM <NA> <NA> <NA> NonCOV
## 11 VTM c3 3 VTM <NA> <NA> <NA> NonCOV
head(RDAResp[Validate,])
## Sample.Name Batch.Number Class.name Sex Age CT NewGroup
## 2 VTM a1 1 VTM <NA> <NA> <NA> NonCOV
## 4 VTM b 2 VTM <NA> <NA> <NA> NonCOV
## 6 VTM b2 2 VTM <NA> <NA> <NA> NonCOV
## 8 VTM c 3 VTM <NA> <NA> <NA> NonCOV
## 10 VTM c2 3 VTM <NA> <NA> <NA> NonCOV
## 12 SLB3000 1 Control Male 25 Negative NonCOV
Notice the row numbers along the left-hand side of the output.
Now we re-run the code above on the Training dataset and then generate predictions for the Validation dataset. We should start back at the parameter estimation step, specifying the training dataset with square brackets [Train,] and [Train]
set.seed(123)
<-train(x=ScaledSub[Train,],y=RDAResp$NewGroup[Train],
randomRDA2method="rda",
metric="Accuracy",
tuneLength = 24,
trControl=CTL)
Now inspect the graph and select appropriate parameter values:
randomRDA2
## Regularized Discriminant Analysis
##
## 111 samples
## 14 predictor
## 2 classes: 'COVID19', 'NonCOV'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 24 times)
## Summary of sample sizes: 84, 83, 83, 83, 83, 83, ...
## Resampling results across tuning parameters:
##
## gamma lambda Accuracy Kappa
## 0.04205953 0.41454634 0.8064512 0.4892385
## 0.04555650 0.14711365 0.8087109 0.4854357
## 0.10292468 0.31818101 0.8113288 0.5059791
## 0.24608773 0.14280002 0.8147321 0.5205551
## 0.32792072 0.41372433 0.8129271 0.5026774
## 0.40897692 0.54406602 0.8083940 0.4863681
## 0.45333416 0.47779597 0.8106674 0.4935945
## 0.45661474 0.79546742 0.8150766 0.4912073
## 0.52810549 0.96302423 0.8124587 0.4714484
## 0.55143501 0.69070528 0.8143463 0.4918643
## 0.57263340 0.21640794 0.8121142 0.5024089
## 0.64050681 0.23303410 0.8109706 0.4972834
## 0.65570580 0.26597264 0.8098683 0.4927514
## 0.67757064 0.75845954 0.8125000 0.4845934
## 0.69280341 0.13880606 0.8105848 0.4965317
## 0.70853047 0.85782772 0.8113839 0.4762306
## 0.88301740 0.59414202 0.8109843 0.4800787
## 0.88953932 0.15244475 0.8106537 0.4902058
## 0.89241904 0.90229905 0.8053351 0.4596441
## 0.89982497 0.23162579 0.8117697 0.4901212
## 0.94046728 0.28915974 0.8109981 0.4859356
## 0.95450365 0.36884545 0.8102403 0.4813170
## 0.95683335 0.02461368 0.8136299 0.4951022
## 0.99426978 0.46596245 0.8094687 0.4742717
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0.4566147 and lambda
## = 0.7954674.
ggplot(randomRDA2) + theme(legend.position="bottom")
And finally, run the model on the training dataset:
<-rda(x=ScaledSub[Train,],grouping=RDAResp$NewGroup[Train],
RDAmod2regularization=c(gamma=0.3, lambda=0.9))
To generate the confusion matrix, we first have to predict the classes for our Validation set, using the model generated from our Training set:
<-predict(RDAmod2,ScaledSub[Validate,]) Pred
Inspect the structure of
Pred
to see what kind of object we created
Now we have our predictions that we can compare against the observations
<-data.frame(Obs=as.factor(RDAResp$NewGroup[Validate]),
CatDatPred=Pred$class)
table(CatDat)
## Pred
## Obs COVID19 NonCOV
## COVID19 19 9
## NonCOV 4 78
Take a minute to think about what we did and why this represents a very robust analysis. First, we separate the data into two sets and we set aside the Validation dataset. Then, we use cross-validation on only the Training dataset in order to train a robust predictive model. Then, we apply the model back to the Validation dataset to see how well the model performs on new data that were not included in the training steps.
How could we improve this even further? Collect new data to test the model!