Overview

In the RDA Tutorial we looked at discriminant analysis, a form of classification. in the SVM Tutorial we looked at support vectors as another form of classification.

In this tutorial, we look at another class of classification models called Decision Trees. These trees represent another type of supervised learning because we try to predict a response variable (\(Y\)) from several features (\(X_N\)). The ‘trees’ get their name from the bifurcating structure, which resembles a trunk (root) and branching pattern with nodes connecting pairs of branches.

Classification Trees are decision trees that predict categorical responses while Regression Trees are decision trees that predict continuous responses. These are computationally very similar, and you will often seem them grouped under the acronym CART (Classification and Regression Tree).

As with RDA and SVM, we can use cross validation to look at multiple models. However, the branching pattern of decision trees can change from one set of data to the next. This provides a challenge but also an opportunity to combine multiple (weak) models into a single, large (strong) model.

Ensemble Learning

An Ensemble in machine learning, is a group of similar models. Ensemble Learning is a general term that refers to models that are built by combining other models, just as a different instruments in a musical ensemble contribute different sounds to make a more complex sound.

Different decision tree models can be generated from the same dataset by cross-validation or random sampling of the input data. Models generated this way are called Random Forest models and these are among some of the most powerful models in Machine Learning!

Once we have multiple models, we can combine them in different ways:

Bagging or Boostrap Aggegating – Attempt to decrease the variance of prediction by averaging across the models. Often this will be a weighted average with more weight given to models that reduce more of the prediction variance.
Boosting – Boosting builds on previous models, similar to fitting new models to the residuals of a linear mode. The focus is on trying to predict the observations that were misclassified or classified with low probability
Stacking – Stacking is like a ‘meta-model’ that stacks on top of the other models to define how best to combine them.

When we build separate models based on subsets of our data, we can do it with replacement or without replacement. We can subsample from the full set of features, but include all observations – this is called Random Subspaces sampling. Or we can subsample both individual observations (rows) and feature subspaces (columns) – this is called Random Patches sampling.

A Random Forest is generally just an ensemble of many decision trees created with many options for subsampling and combining models.

Setup

Load the usual plotting and data management functions:

library(ggplot2) # plotting library
library(dplyr) # data management

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())

Libraries

library(tree) # Used to build classification and regression trees
library(rpart) # Adds recursive partitioning for bootstrapping
library(randomForest) # For random forests

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm) # For boosting

## Loaded gbm 2.1.8

Data

For this tutorial, we’ll work on the same data from Lythrum salicaria that we used in the PCA Tutorial.

LythrumDat<-read.csv("https://colauttilab.github.io/Data/ColauttiBarrett2013Data.csv",header=T)
str(LythrumDat)

## 'data.frame':    432 obs. of  23 variables:
##  $ Ind      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Site     : chr  "1_BEF" "1_BEF" "1_BEF" "1_BEF" ...
##  $ Row      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Pos      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Mat      : chr  "C13" "C11" "A16" "C20" ...
##  $ Pop      : chr  "C" "C" "A" "C" ...
##  $ Region   : chr  "NORTH" "NORTH" "NORTH" "NORTH" ...
##  $ Flwr07   : int  39285 39293 39286 39305 39293 39291 39287 39293 39287 39303 ...
##  $ FVeg07   : num  41 40 32.5 45 47.5 55 51.5 38 51 37.5 ...
##  $ InfMass07: num  9.7 6.5 16 1.5 17.4 13.1 3 0.6 0 12.2 ...
##  $ Fruits07 : int  625 329 1029 46 1121 504 122 15 0 896 ...
##  $ Flwr08   : int  39613 39625 39610 39638 39616 39625 39614 39621 39617 39638 ...
##  $ FVeg08   : num  79 117 70 94.5 97 ...
##  $ HVeg08   : num  71 108 62.5 87.5 90 124 85 76.5 91 96.5 ...
##  $ InfMass08: num  9.7 36.1 28.9 18.2 47.9 30.7 23.3 8.5 28.5 16.6 ...
##  $ Flwr09   : int  39980 39993 NA 40000 39986 39995 39990 39993 39996 39997 ...
##  $ FVeg09   : num  64 70 NA 84 66 91 80 62 58 99 ...
##  $ HVeg09   : num  82 65.5 NA 72 65.5 82 76 60.5 56.5 93.5 ...
##  $ InfMass09: num  11.7 10.1 NA 9.6 38.1 16.7 11.1 2.6 1.7 18.8 ...
##  $ Flwr10   : int  40345 40350 NA 40369 40353 40357 40350 NA 40358 40357 ...
##  $ FVeg10   : num  68.7 83.4 NA 69.9 70.1 97 105 NA 50.8 85.1 ...
##  $ HVeg10   : num  66 96 NA 57.5 65.5 77 98 57 49.5 82 ...
##  $ InfMass10: num  1.9 7.3 NA 5.7 27.9 4.9 5.3 5.5 1.5 15.3 ...

The important things to remember about the data:

Multiple traits were measured across 4 years, shown in the columns ending with numbers
PC1 generally corresponds to variation among 20 different genetic population, defined by the Pop column
PC2 generally corresponds to variation among 3 different growing conditions, defined by the Site column

As with RDA and SVM, we should make sure that our categorical response variables are factors.

tDat<-LythrumDat %>%
  mutate(Site=as.factor(Site),
         Pop=as.factor(Pop))

Finally, we’ll break this into two datasets. One to predict population and the other to predict garden site.

popDat<-tDat %>% 
  select(-c("Ind","Site","Row","Pos","Mat","Region"))
siteDat<-tDat %>% 
  select(-c("Ind","Row","Pos","Mat","Pop","Region"))

Looking back at the PC projection figures in that tutorial, we can imagine drawing vertical lines to separate genotypes along PC1 or horizontal lines to separate growing sites along PC2. This is the idea with decision trees – find a threshold value that distinguishes the group(s) of interest.

Note: No need to transform/standardize the data!

A really nice feature of CART analysis is that it makes very few assumptions about the underlying distribution of data. That makes for a very robust analysis.

CART

We can run a CART analysis to try to predict which site a plant comes from based on its phenotype:

Model fitting

PopTree <- tree(Pop ~ ., data=popDat)
PopTree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 190 663.900 J ( 0.105263 0.210526 0.168421 0.236842 0.173684 0.105263 )  
##     2) FVeg07 < 45.75 89 257.200 E ( 0.168539 0.314607 0.348315 0.067416 0.000000 0.101124 )  
##       4) HVeg10 < 64.3 44  90.090 E ( 0.272727 0.045455 0.590909 0.000000 0.000000 0.090909 )  
##         8) FVeg10 < 57 19   7.835 E ( 0.052632 0.000000 0.947368 0.000000 0.000000 0.000000 ) *
##         9) FVeg10 > 57 25  61.060 A ( 0.440000 0.080000 0.320000 0.000000 0.000000 0.160000 )  
##          18) FVeg07 < 30 5   5.004 E ( 0.000000 0.200000 0.800000 0.000000 0.000000 0.000000 ) *
##          19) FVeg07 > 30 20  44.890 A ( 0.550000 0.050000 0.200000 0.000000 0.000000 0.200000 )  
##            38) Flwr07 < 39309.5 11  16.710 A ( 0.727273 0.090909 0.000000 0.000000 0.000000 0.181818 ) *
##            39) Flwr07 > 39309.5 9  19.100 E ( 0.333333 0.000000 0.444444 0.000000 0.000000 0.222222 ) *
##       5) HVeg10 > 64.3 45 112.900 C ( 0.066667 0.577778 0.111111 0.133333 0.000000 0.111111 )  
##        10) Flwr07 < 39307.5 28  42.500 C ( 0.000000 0.714286 0.000000 0.214286 0.000000 0.071429 )  
##          20) InfMass10 < 25.3 22  13.400 C ( 0.000000 0.909091 0.000000 0.090909 0.000000 0.000000 )  
##            40) Fruits07 < 359 17   0.000 C ( 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ) *
##            41) Fruits07 > 359 5   6.730 C ( 0.000000 0.600000 0.000000 0.400000 0.000000 0.000000 ) *
##          21) InfMass10 > 25.3 6   7.638 J ( 0.000000 0.000000 0.000000 0.666667 0.000000 0.333333 ) *
##        11) Flwr07 > 39307.5 17  45.550 C ( 0.176471 0.352941 0.294118 0.000000 0.000000 0.176471 )  
##          22) InfMass07 < 6.05 7  15.110 A ( 0.428571 0.000000 0.285714 0.000000 0.000000 0.285714 ) *
##          23) InfMass07 > 6.05 10  17.960 C ( 0.000000 0.600000 0.300000 0.000000 0.000000 0.100000 )  
##            46) Flwr09 < 40019 5   9.503 E ( 0.000000 0.200000 0.600000 0.000000 0.000000 0.200000 ) *
##            47) Flwr09 > 40019 5   0.000 C ( 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ) *
##     3) FVeg07 > 45.75 101 287.200 J ( 0.049505 0.118812 0.009901 0.386139 0.326733 0.108911 )  
##       6) InfMass10 < 13.9 36 108.900 C ( 0.138889 0.333333 0.000000 0.250000 0.194444 0.083333 )  
##        12) HVeg08 < 80.75 10  23.370 A ( 0.500000 0.000000 0.000000 0.100000 0.100000 0.300000 )  
##          24) Flwr08 < 39637.5 5   0.000 A ( 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ) *
##          25) Flwr08 > 39637.5 5   9.503 T ( 0.000000 0.000000 0.000000 0.200000 0.200000 0.600000 ) *
##        13) HVeg08 > 80.75 26  55.010 C ( 0.000000 0.461538 0.000000 0.307692 0.230769 0.000000 )  
##          26) FVeg07 < 58.25 18  31.230 C ( 0.000000 0.666667 0.000000 0.166667 0.166667 0.000000 )  
##            52) Flwr08 < 39637.5 12  13.500 C ( 0.000000 0.750000 0.000000 0.250000 0.000000 0.000000 ) *
##            53) Flwr08 > 39637.5 6   8.318 C ( 0.000000 0.500000 0.000000 0.000000 0.500000 0.000000 ) *
##          27) FVeg07 > 58.25 8  10.590 J ( 0.000000 0.000000 0.000000 0.625000 0.375000 0.000000 ) *
##       7) InfMass10 > 13.9 65 135.900 J ( 0.000000 0.000000 0.015385 0.461538 0.400000 0.123077 )  
##        14) HVeg10 < 77 8  10.590 T ( 0.000000 0.000000 0.000000 0.000000 0.375000 0.625000 ) *
##        15) HVeg10 > 77 57 106.000 J ( 0.000000 0.000000 0.017544 0.526316 0.403509 0.052632 )  
##          30) HVeg08 < 92 24  35.310 J ( 0.000000 0.000000 0.000000 0.750000 0.125000 0.125000 )  
##            60) InfMass07 < 24.1 9  19.780 J ( 0.000000 0.000000 0.000000 0.333333 0.333333 0.333333 ) *
##            61) InfMass07 > 24.1 15   0.000 J ( 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 ) *
##          31) HVeg08 > 92 33  51.300 S ( 0.000000 0.000000 0.030303 0.363636 0.606061 0.000000 )  
##            62) Flwr07 < 39297.5 8   6.028 J ( 0.000000 0.000000 0.000000 0.875000 0.125000 0.000000 ) *
##            63) Flwr07 > 39297.5 25  32.960 S ( 0.000000 0.000000 0.040000 0.200000 0.760000 0.000000 )  
##             126) Fruits07 < 872 15  25.600 S ( 0.000000 0.000000 0.066667 0.333333 0.600000 0.000000 )  
##               252) Fruits07 < 489.5 6   0.000 S ( 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 ) *
##               253) Fruits07 > 489.5 9  16.860 J ( 0.000000 0.000000 0.111111 0.555556 0.333333 0.000000 ) *
##             127) Fruits07 > 872 10   0.000 S ( 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 ) *

Note that the default output for PopTree is only part of the object, which you can see if you type str(PopTree).

The output above shows the structure of the bifurcating tree from the CART analysis. It’s a bit hard to understand the structure, but luckily we can use the basic plot function in base R to visualize it and the text function to add labels.

plot(PopTree)
text(PopTree, cex=0.7)

Use ?text to see what the cex parameter does and some other options for formatting.

It’s no ggplot now we can compare this to the figure to the output table to identify the nodes, starting with the root (top; FVeg07) then moving down to 1 (HVeg10) and 2 (InfMass).

The bottom of the tree shows the predicted population (A-J). These are called the tips of the tree.

We can also see that the lengths of the branches differ along the y-axis. The lengths represent the number of observations, so longer branches represent stronger predictions that classify more of the observations. In this case, the node branch for FVeg07 distinguishes most of the northern populations (A, C, E) from the southern ones (J, S, T).

Performance

We can use the summary function to summarize the model and how well it performs

summary(PopTree)

## 
## Classification tree:
## tree(formula = Pop ~ ., data = popDat)
## Variables actually used in tree construction:
##  [1] "FVeg07"    "HVeg10"    "FVeg10"    "Flwr07"    "InfMass10" "Fruits07" 
##  [7] "InfMass07" "Flwr09"    "HVeg08"    "Flwr08"   
## Number of terminal nodes:  22 
## Residual mean deviance:  1.088 = 182.8 / 168 
## Misclassification error rate: 0.2368 = 45 / 190

The top rows show the model structure. The last two rows show the performance of the model.

Confusion matrix

As with other ML models, we can look at the confusion matrix to test model performance.

CatDat<-data.frame(Obs=popDat$Pop,Pred=predict(PopTree, popDat, type="class"))
table(CatDat)

##    Pred
## Obs  A  C  E  J  S  T
##   A 16  0 11 18  0  0
##   C  3 44 10 23  4  2
##   E  2  1 48 34  1  0
##   J  0  6  0 72  1  3
##   S  0  3  0 53 19  5
##   T  4  1  4 32  2 10

One interesting thing to note in this confusion matrix is that the populations are sorted alphabetically, which also corresponds to latitiude of origin from the north to the south. We might therefore expect that adjacent populations would be more similar to each other (e.g. A vs C or E vs J) whereas more distant populations would be more different (e.g. A vs T). The data seem to support that hypothesis.

Model Accuracy

Note that we have 6 categories instead of the usual 2. Because we have multiple categories we don’t have true positives and true negatives (2 categories). Instead we have correct classifications along the diagonal and the misclassification error represented by the off-diagonal.

Classification rate

Correct classifications are simply where the prediction matches the observation, shown along the diagonal. We can calculate the correct classification rate if we divide by the total.

Correct<-CatDat %>%
  filter(Obs==Pred)
nrow(Correct)/nrow(CatDat)

## [1] 0.4837963

To calculate the misclassification rate, we can just subtract the above value from 1 or from the data itself.

MisClass<-CatDat %>%
  filter(Obs!=Pred)
nrow(MisClass)/nrow(CatDat)

## [1] 0.5162037

Why is this so much higher than the summary?

Note the sample size in the summary, compared to the full dataset

nrow(popDat)

## [1] 432

The difference seems to be that the model performs poorly for rows with missing data. A lot of the missing values are from plants that didn’t reproduce or died before being measured. If we are confident that these should be 0 rather than NA, then we can replace the values in the dataset and re-run the analysis. We’ll use a quick shortcut to replace all of the NA with 0

popDat[is.na(popDat)]<-0
PopTree <- tree(Pop ~ ., data=popDat)
plot(PopTree)
text(PopTree, cex=0.7)

summary(PopTree)

## 
## Classification tree:
## tree(formula = Pop ~ ., data = popDat)
## Variables actually used in tree construction:
##  [1] "HVeg09"    "InfMass08" "HVeg08"    "FVeg10"    "FVeg07"    "Fruits07" 
##  [7] "InfMass10" "InfMass09" "HVeg10"    "InfMass07"
## Number of terminal nodes:  18 
## Residual mean deviance:  2.17 = 898.6 / 414 
## Misclassification error rate: 0.4514 = 195 / 432

We see a different tree structure and the correct sample size, but not really any improvement in the misclassification rate.

Cross-validation and pruning

We can use k-fold cross-validation with CART to see how the structure of the tree changes depending on the input data. If we compare multiple trees grown from different datasets we may find that some nodes are consistently good at prediction but others are unstable – they only work well for a few particular datasets. We can then remove or ‘prune’ those unstable branches to build a tree that is more robust and less prone to overfitting.

There is a simple function in the tree package that does all of this at once.

PopPrune<-cv.tree(PopTree, k=24, FUN=prune)
plot(PopPrune)
text(PopPrune, cex=0.7)

Note that there are a few pruning options available that use different algorithms. You can learn more from the help functions:

?prune ?prune.misclass ?prune.rpart ?prune.tree

CV Error Rate

For comparison with the unpruned tree, we can again look at the confusion matrix and misclassification rates of our CV (pruned) tree

CatDat2<-data.frame(Obs=popDat$Pop,Pred=predict(PopPrune, popDat, type="class"))
table(CatDat2)

##    Pred
## Obs  A  C  E  J  S  T
##   A 16  6 17  1  0  5
##   C 13 54  6 11  0  2
##   E 11  9 54  3  0  9
##   J  2 12  4 46 16  2
##   S  0  6  2 22 42  8
##   T  7 15 14 13  2  2

MisClass<-CatDat2 %>%
  filter(Obs!=Pred)
nrow(MisClass)/nrow(CatDat2)

## [1] 0.5046296

Note that the misclassification rate is actually LOWER than the original tree, even though it has fewer branches.

Random Forests

Random forests offer more powerful predictions but at the cost of interpretability.

In R we can use the randomForests function from the library of the same name.

Remember that a Random Forest Model is an ensemble model built by combining many individual decision trees.

set.seed(123)
PopFor<-randomForest(Pop ~ ., data=popDat, 
                           ntree=100, mtry=3, nodesize=5, importance=TRUE)

There are a lot of parameters in the ?randomForests help. Here are a few

ntree - the number of trees. The default is 500 which is maybe a bit too high for our small dataset.
mtry - number of features to include in each subtree. We use 3 since we have few features
replace - sample, with replacement (default=T)
nodesize - the minimum number of nodes. Since we have 6 populations, we need at least 5 nodes to distinguish them.
importance - this helps us understand which features are important, given that model is a black box

Let’s take a look at the output

PopFor

## 
## Call:
##  randomForest(formula = Pop ~ ., data = popDat, ntree = 100, mtry = 3,      nodesize = 5, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 48.84%
## Confusion matrix:
##    A  C  E  J  S  T class.error
## A 10 12 15  0  1  7   0.7777778
## C  7 54 11  6  4  4   0.3720930
## E  7 14 52  3  0 10   0.3953488
## J  0 10  1 48 15  8   0.4146341
## S  0  6  0 25 41  8   0.4875000
## T  3 15  5  8  6 16   0.6981132

Here we see the class-specific error rates, calculated across rows, as well as the overall error rate calculated from the confusion matrix as we did above.

We can also get an idea of how each trait predicts each group

PopFor$importance

##                      A            C            E            J          S
## Flwr07     0.001620796  0.004350114 -0.004928596 -0.007738685 0.01405064
## FVeg07    -0.001631295  0.014988185  0.085166544  0.013185965 0.02567157
## InfMass07 -0.001913708  0.006713826  0.016722010  0.021764600 0.01520955
## Fruits07  -0.005508583  0.011782168  0.015681197  0.016040450 0.01826225
## Flwr08    -0.007992189 -0.001738240  0.018285094  0.010560324 0.01729374
## FVeg08     0.009578825  0.004223542  0.079519446  0.034625217 0.03496456
## HVeg08     0.008657368  0.013508157  0.119803482  0.033634613 0.05813508
## InfMass08 -0.019358255  0.032548006  0.003622360  0.012096420 0.03415076
## Flwr09     0.003275487  0.017736985  0.012515930  0.014014024 0.02550537
## FVeg09     0.011723225 -0.002466410  0.055300191  0.045465054 0.02435035
## HVeg09     0.014442269 -0.006799589  0.086546894  0.055549458 0.04932667
## InfMass09 -0.006096868 -0.002123047  0.021587275  0.054214124 0.02796757
## Flwr10    -0.005039790  0.009952488  0.003826057  0.008124307 0.02330501
## FVeg10    -0.010482669  0.016395718  0.088346795  0.030986298 0.04923777
## HVeg10    -0.019523173  0.016453316  0.093492822  0.023046527 0.00708809
## InfMass10  0.003710101  0.035875783  0.010813311  0.040835638 0.02960191
##                       T MeanDecreaseAccuracy MeanDecreaseGini
## Flwr07     0.0410079763          0.005910561        10.564895
## FVeg07     0.0369287233          0.030946907        19.224241
## InfMass07  0.0010490946          0.011673132        10.390711
## Fruits07   0.0009674302          0.011306184        10.414331
## Flwr08     0.0396751488          0.012301069        14.447687
## FVeg08     0.0411409429          0.035188562        19.233688
## HVeg08     0.0450593101          0.049839229        26.390087
## InfMass08  0.0540984405          0.020789711        17.565571
## Flwr09     0.0120617109          0.015649941        14.256480
## FVeg09     0.0338967637          0.027956872        16.512516
## HVeg09     0.0257167692          0.039130558        20.526872
## InfMass09 -0.0014795238          0.018473507        14.803727
## Flwr10     0.0076865417          0.008779833         8.886772
## FVeg10     0.0365081833          0.038035191        20.446154
## HVeg10     0.0173974316          0.027198121        18.028531
## InfMass10  0.0130130210          0.024508780        15.818013

The columns A-T show how the accuracy of the model for that particular group is affected by removing each feature. These are averaged in the MeanDecreaseAccuracy column. The higher the number, the more the accuracy of the model is impacted by this feature. We can look at the individual population columns to see how different features might be important for some populations more than others.

Boosting

The above example uses bagging of the number of trees defined by ntree. Bagging just means that we aggregate the results of all of the different models equally.

Rather than simply averaging the models, we can impose different structures that affect how each subsampled tree contributes to the whole prediction. This kind of model building is called boosting, which can be implemented with the gbm function in R from the library of the same name.

Three of the more common/popular boosting methods are:

AdaBoost – or Adaptive Boosting, weights models. See this YouTube Explanation from StatQuest or the more technical Wikipedia Explanation
Gradient Boost – is similar to AdaBoost except that downstream trees are fit to the residuals of upstream trees.

This opens up a lot of possibilities for

set.seed(123)
PopBoost<-gbm(Pop ~ ., data=popDat,
              distribution="tdist",
              n.trees = 100, interaction.depth=2, cv.folds=12)
PopBoost

## gbm(formula = Pop ~ ., distribution = "tdist", data = popDat, 
##     n.trees = 100, interaction.depth = 2, cv.folds = 12)
## A gradient boosted model with tdist loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 81.
## There were 16 predictors of which 16 had non-zero influence.

distribution refers to the response variable, which is categorical, not Gaussian, but we can treat it as Gaussi

summary(PopBoost)

##                 var   rel.inf
## InfMass08 InfMass08 18.094556
## InfMass10 InfMass10 13.552674
## HVeg09       HVeg09 11.375573
## HVeg08       HVeg08  9.783843
## Flwr08       Flwr08  7.334747
## FVeg07       FVeg07  6.810870
## Flwr09       Flwr09  6.442719
## Flwr07       Flwr07  5.093869
## FVeg08       FVeg08  4.318357
## FVeg10       FVeg10  3.617387
## HVeg10       HVeg10  2.951587
## InfMass09 InfMass09  2.587825
## Fruits07   Fruits07  2.585792
## FVeg09       FVeg09  2.259377
## InfMass07 InfMass07  1.886983
## Flwr10       Flwr10  1.303841

The relative influence here shows how important each feature is to the accuracy of the model, ranked from highest to lowest.

CatDat3<-data.frame(Obs=popDat$Pop,Pred=predict(PopBoost, popDat, type="response"))

## Using 81 trees...

head(CatDat3)

##   Obs     Pred
## 1   C 2.355361
## 2   C 2.262856
## 3   A 2.190231
## 4   C 2.448284
## 5   T 3.576273
## 6   C 2.953142

Notice that the predictions are fractional rather than categorical. The numbers represent the different categories numerically, which we can see with as.numeric

unique(CatDat3$Obs)

## [1] C A T E J S
## Levels: A C E J S T

unique(as.numeric(CatDat3$Obs))

## [1] 2 1 6 3 4 5

We can plot these to compare the observed across predicted categories

CatDat3$ObsNum<-as.numeric(CatDat3$Obs)
qplot(Pred,ObsNum, data=CatDat3, alpha=I(0.3))

Or we can round our predictions to find the closest category so that we can calculate our confusion matrix and misclassification rate.

CatDat4<-data.frame(Obs=CatDat3$ObsNum,Pred=round(CatDat3$Pred,0))
table(CatDat4)

##    Pred
## Obs  1  2  3  4  5
##   1  2 27 10  6  0
##   2  0 50 29  7  0
##   3  0 22 49 15  0
##   4  0  2 11 54 15
##   5  0  0  3 30 47
##   6  0  3 11 30  9

MisClass<-CatDat4 %>%
  filter(Obs!=Pred)
nrow(MisClass)/nrow(CatDat4)

## [1] 0.5324074

All that extra work, but not much improvement in the model.

Level Up

Now try running a similar analysis on the same data to look at predictions for Site and Region.

Then, try making a new variable by pasting the Site & Rgion to define a new column, and run the models on that column.

What phenotypes are most different among sites vs regions? How well can you predict the site and region based on its phenotype?

Try re-running one or more of the Decision Tree models on the two main PC axes rather than the features themselves.

See if you can improve the models by changing some of the pruning and cross-validation parameters.