By now you should be comfortable with the PCA Tutorial and the LDA/QDA/RDA Tutorial.
Remember that PCA is a type of unsupervised learning because we only include the features in the analysis. We plotted PC axes 1 and 2 and colour-coded based on genotype and environment to see how the PC axes capture structure, but the categories themselves were not involved in the analysis.
Remember also that the LDA, QDA, and RDA are different formulations of discriminant analysis. These are all examples of supervised learning because we included the grouping variable of interest (COVID/non-COVID) in the analysis. We saw how the LDA produces a multivariate axis LD1
that is conceptually very similar to a principal components axis. Once we move to QDA and RDA we lose the interpretability of a linear feature but gain more powerful models that can (but don’t always) improve predictability.
More generally, the LDA, QDA and RDA try to optimize classification. There are many other kinds of classification models, but we’ll look at three more algorithms that are similar to each other but different from the LDA/QDA/RDA
Conceptually, the classifiers (1 & 2) are very similar to the models covered in the LM Tutorial and the support vector machines are like the spline models covered in the GAM Tutorial.
There is one critical difference though. With LM/GAM models we are trying to explain variation in the response variable by fitting lines/curves to the features (aka predictors). But with MMC/SVC/SVM we are trying to find a hyperplane that separates the data into categories.
What is a hyperplane?
The hyperplane is a geometric shape with dimension equal to the number of features minus 1. It represents a ‘flat’ surface that separates the classes of interest.
This can be a bit tough to conceptualize, but here is an example that can help.
Imagine you have two features \(X_1\) and \(X_2\). This means we have 2 dimensions of features, so our hyperplane has dimension of 1, which is just a straight line, as shown in the example below.
Now imagine adding a 3rd feature \(X_3\) to the above analysis. We would add third axis to the graph and then our hyperplane would have 2 dimensions. You can imagine this as your points floating in 3D space with a sheet of paper trying to separate them. In principle, we can keep adding dimensions to both the features and they hyperplane, which is not intuitive for the human brain. But, just remember these 1d to 3d examples and know that you can expand to any number of features \(k\) that define your points in \(k\)-dimensional space, which you can try to separate using a \((k-1)\)-dimensional hyperplane.
One other difference between LM/GAM and MMC/SVC/SVM is in the way the lines are fit. With lm
function in R uses the Least Squares method, while the GAM typically uses an algorithm called maximum likelihood or restricted maximum likelihood for mixed models.
The MMC, SVC, and SVM use different optimization algorithms that quickly increase in complexity. It’s not important to understand how these work in order to apply the models, but it is helpful to know this: the function that finds the optimal ‘fit’ or solution is called the kernel. You will see this word a lot in the ML literature, and you can get pretty far by just knowing it is some kind of mathematical function and/or computational algorithm for fitting the model to the data.
The specific kernels are described below with each method. HOWEVER, but you don’t need to have a deep understanding of these algorithms to apply and interpret the model. In the end, you are just running an algorithm to find a hyperplane that separates your classes of interest. Conceptually, these are not much different from a QDA/RDA, except that you are using different kernels.
Load the usual plotting and data management functions.
library(ggplot2) # plotting library
library(dplyr) # data management
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
We’ll use the pROC
library and a new library called e1071
for support vectors
library(pROC) # For ROC curves
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(e1071) # For fitting support vector machines
And we’ll work with the same viral infection data as described in the RDA Tutorial.
<-read.csv("https://colauttilab.github.io/Data/ViralMetData.csv", header=T) SVCdat
The code below is copied from the RDA Tutorial, where we separate our Features so that we can scale them to mean = 0 and sd = 1…
<-SVCdat %>%
RespDatselect(Class.name) %>%
filter(Class.name %in% c("COVID19","Influenza","RSV")) %>%
mutate(Class.name = replace(Class.name, Class.name %in% c("Influenza","RSV"), "NonCOV")) %>%
mutate(Class.name = replace(Class.name, Class.name == "RSV", "NonCOV"))
<-SVCdat %>%
Featuresfilter(Class.name %in% c("COVID19","Influenza","RSV")) %>%
select(c("Betaine","C14.1OH","C16.1OH","C2","C7DC","Carnosine","Cit","Glucose","LYSOC16.0","LYSOC18.0","Met.SO","Putrescine","Taurine","Tyramine"))
<-Features %>%
ScalCompmutate_all(scale)
… and replace missing data. Here we use a short-cut to find all NA
anywhere in the data.frame and replace with 0
.
is.na(ScalComp)]<-0 ScalComp[
Finally, we’ll change the name to ‘Class’ and move it to the first column.
<-ScalComp %>%
SVDatmutate(Class=as.factor(RespDat$Class.name)) %>%
select(Class,everything())
head(SVDat)
## Class Betaine C14.1OH C16.1OH C2 C7DC Carnosine
## 1 COVID19 0.2409352 1.508541 0.8896714 -0.68780381 1.46873340 -0.4845749
## 2 COVID19 1.3505343 2.031696 1.2410205 0.03439207 1.52099421 -0.9322041
## 3 COVID19 -0.9680309 1.611545 0.8741707 -0.89837016 0.05769153 -0.4877496
## 4 COVID19 -1.2330097 1.841950 1.4890316 -0.69995900 1.67777664 -0.7960106
## 5 COVID19 1.7645638 1.882610 1.1376825 -0.10605962 -0.15135171 -0.9322041
## 6 COVID19 -1.3986215 1.101944 0.1353043 -0.97041187 0.10341974 -0.1004392
## Cit Glucose LYSOC16.0 LYSOC18.0 Met.SO Putrescine
## 1 -0.221708173 -0.49290147 1.2129254 1.83929439 -0.08362552 -0.3547133
## 2 0.002385734 -0.40130071 0.3902849 -0.01582664 0.86929911 0.0000000
## 3 -0.278504388 -0.39702779 -0.1046710 0.08039177 -0.97418848 -0.4658633
## 4 -0.312891212 -0.13437663 0.4458326 0.45755940 -0.90291799 -0.4438996
## 5 -0.293959140 -0.08056452 -0.5072743 -0.17936584 -0.21089426 0.4386449
## 6 -0.384948994 -0.55592707 -0.7670555 -0.42520978 -0.99152885 -0.5678285
## Taurine Tyramine
## 1 -0.08819386 -0.2081689
## 2 1.30214703 -1.1750035
## 3 -0.89283996 -1.0543820
## 4 -0.76694295 -1.0231788
## 5 0.40992040 -1.1936322
## 6 -1.41558624 -0.7810044
Note also the as.factor
function above, to avoid an error if we try to run the SVM on character data.
The support vector classifier is an extension of the MMC that allows some overlap in points but tries to find the hyperplane that separates most of the points without being unduly influenced by individual points.
For a set of \(k\) features, and a set of observation point \(x_i\) is a vector of values for each of the \(k\) features, with a group assignment \(y_i\) taking on a value of -1 or +1:
The support vector classifier is the \(k-1\) hyperplane that best separates the data, following the optimization algorithm (note difference with MMC):
Where \(\epsilon\) are slack variables that are on the wrong side of the hyperplane.
\(C\) is the tuning parameter that is typically chosen by cross-validation. A higher \(C\) allows for more slack variables, so it’s more like a tolerance parameter – it affects how much the hyperplane ‘tolerates’ misclassified points
For a set of \(k\) features, and a set of observation point \(x_i\) is a vector of values for each of the \(k\) features, with a group assignment \(y_i\) taking on a value of -1 or +1:
The maximum marginal classifier is the \(k-1\) hyperplane that best separates the data, following the optimization algorithm:
This method assumes that it is possible to draw a hyperplane that will correctly separate every single point. In reality, there may be some overlap, for which the MMC is not very useful. Compare the optimization parameter in #3 to the linear equations in the LM Tutorial.
Let’s plot two of our features to take a look
qplot(x=Betaine,y=C2,colour=Class, data=SVDat)
There seems to be a lot of overlap in the data points. However, we are only looking at two features. There might be a higher-dimensional hyperplane that can separate these groups.
The SVC is run with the svm
function with the linear kernel. We also have to define a Cost
parameter. The cost determines how much overlap we allow our points to have. A higher cost means that we penalize the model more for any points that overlap the hyperplane.
<-svm(Class~., data=SVDat, kernel="linear",
Mod1cost=10,scale=F)
summary(Mod1)
##
## Call:
## svm(formula = Class ~ ., data = SVDat, kernel = "linear", cost = 10,
## scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 38
##
## ( 19 19 )
##
##
## Number of Classes: 2
##
## Levels:
## COVID19 NonCOV
As we saw in the RDA Tutorial, we can compare the model’s predictions to the observed classes of our response variable to generate a confusion matrix.
<-data.frame(Obs=SVDat$Class,Pred=predict(Mod1))
CatDattable(CatDat)
## Pred
## Obs COVID19 NonCOV
## COVID19 48 7
## NonCOV 3 108
Calculate the model accuracy
Hint: This was covered in the Advanced LM Tutorial
Let’s try with a smaller cost parameter:
<-svm(Class~.,data=SVDat,kernel="linear",cost=0.1,scale=F)
Mod2<-data.frame(Obs=SVDat$Class,Pred=predict(Mod2))
CatDat2table(CatDat2)
## Pred
## Obs COVID19 NonCOV
## COVID19 45 10
## NonCOV 5 106
Now try a higher cost like 100
What cost should we use? We can use cross-validation, just like we did with the tuning parameters in the RDA Tutorial.
The tune
function from the e1071
library controls the cross-validation models, similar to the train
function in the RDA Tutorial. We can do a grid search by defining a range of cost values to test.
set.seed(123)
<-tune(svm,Class~.,data=SVDat,kernel="linear",
Mod3ranges=list(cost=10^c(-3:2)))
We can look at the performance of the models across different parameters of cost
$performances Mod3
## cost error dispersion
## 1 1e-03 0.3319853 0.05414531
## 2 1e-02 0.2224265 0.07357535
## 3 1e-01 0.1323529 0.05439299
## 4 1e+00 0.1319853 0.06153010
## 5 1e+01 0.1386029 0.04978116
## 6 1e+02 0.1621324 0.04801201
We look for the model with the lowest error, which is a cost value close to 0 in this case.
Now that we have had some practice doing cross-validation, we should also follow the best practice of dividing our data into a training set and a validation set. The latter is a subset of observations (rows) that we set aside during the model training process and then bring it back at the end to test how well the model performs on data it hasn’t ‘seen’ before. The training set is the subset of data used to train the model. Note, however, that we can still use resampling and cross-validation algorithms to train the model using only the training set.
A simple but effective way to split the dataset is to put all of the even-numbered rows in the training set and the odd-numbered rows in the validation set. An easy way to do this is with the remainder function %%
which returns a 0
if there is no remainder and a 1
if there is a remainder. If we divide our row number by 2 then odd numbers will have a remainder and even numbers wont. We can use this to set two vectors – odd and even – that we can use to index our original dataset.
<-c(1:nrow(SVDat)) %% 2
Train<-1-Train Validate
Support vector machines are a non-linear version of support vector classifiers. Here the math is a bit more complicated to write out, but it is conceptually not too bad. Recall that each ‘point’ is defined by a vector \(x_i\), so for example:
The SVM compares the the inner-product of each pair of points. So for example, in the 2-d version with vector \(x_1 = [0.2,0.8]\) and vector \(x_2 = [0.3,0.5]\), the inner product is:
<-c(0.2,0.8)
v1<-c(0.3,0.5)
v2*v2 v1
## [1] 0.06 0.40
sum(v1*v2) ## <--- This is the inner product
## [1] 0.46
What about for the same point?
sum(c(0.2,0.2)*c(0.2,0.2))
## [1] 0.08
If all of that sounds complicated, here is a very simple way to understand it well enough to run the model on your data:
The difference between the SVC and SVM is the kernel
parameter, which is linear for the SVC and nonlinear for the SVM. To fit an SVM we again use the svm()
function we use the same code but with a nonlinear kernel
parameter. We’ll try the sigmoid kernel, which requires optimizing both the cost and a gamma tuning parameters, as shown in the equation in the ?svm
help:
\[K(x_i,x_j) = tanh(\gamma x_i' x_j + \tau)\]
To reduce the number of parameters to train, we can keep \(\tau = 0\) (aka coef0
) for a sigmoid kernel (see ?svm
).
set.seed(123)
<-tune(svm,Class~.,data=SVDat,kernel="sigmoid",
Mod4ranges=list(cost=10^c(-3:2),gamma=c(0.5,1,2,3,4)))
We could print out the performances, but now we have to consider 2 different parameters. An easier way is to graph.
<-Mod4$performances
PDatqplot(x=cost,y=gamma,size=error,data=PDat) + scale_x_log10()
Remember that we want to minimize error, so the small dots are better models. In this case the best models are cost = 0.1 and gamma 0.5 or 1.
Now run the SVM model with the ‘best’ cost and gamma parameters.
Then compare the predicted vs. observed to generate a confusion matrix. Calculate model accuracy, sensitivity and specificity.
Try ‘fine tuning’ your parameters by choosing only values close to cost=0.1 and gamma=0.5