In the RDA Tutorial we looked at discriminant analysis, a form of classification. in the SVM Tutorial we looked at support vectors as another form of classification.
In this tutorial, we look at another class of classification models called Decision Trees. These trees represent another type of supervised learning because we try to predict a response variable (\(Y\)) from several features (\(X_N\)). The ‘trees’ get their name from the bifurcating structure, which resembles a trunk (root) and branching pattern with nodes connecting pairs of branches.
Classification Trees are decision trees that predict categorical responses while Regression Trees are decision trees that predict continuous responses. These are computationally very similar, and you will often seem them grouped under the acronym CART (Classification and Regression Tree).
As with RDA and SVM, we can use cross validation to look at multiple models. However, the branching pattern of decision trees can change from one set of data to the next. This provides a challenge but also an opportunity to combine multiple (weak) models into a single, large (strong) model.
An Ensemble in machine learning, is a group of similar models. Ensemble Learning is a general term that refers to models that are built by combining other models, just as a different instruments in a musical ensemble contribute different sounds to make a more complex sound.
Different decision tree models can be generated from the same dataset by cross-validation or random sampling of the input data. Models generated this way are called Random Forest models and these are among some of the most powerful models in Machine Learning!
Once we have multiple models, we can combine them in different ways:
When we build separate models based on subsets of our data, we can do it with replacement or without replacement. We can subsample from the full set of features, but include all observations – this is called Random Subspaces sampling. Or we can subsample both individual observations (rows) and feature subspaces (columns) – this is called Random Patches sampling.
A Random Forest is generally just an ensemble of many decision trees created with many options for subsampling and combining models.
Load the usual plotting and data management functions:
library(ggplot2) # plotting library
library(dplyr) # data management
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
library(tree) # Used to build classification and regression trees
library(rpart) # Adds recursive partitioning for bootstrapping
library(randomForest) # For random forests
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm) # For boosting
## Loaded gbm 2.1.8
For this tutorial, we’ll work on the same data from Lythrum salicaria that we used in the PCA Tutorial.
<-read.csv("https://colauttilab.github.io/Data/ColauttiBarrett2013Data.csv",header=T)
LythrumDatstr(LythrumDat)
## 'data.frame': 432 obs. of 23 variables:
## $ Ind : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Site : chr "1_BEF" "1_BEF" "1_BEF" "1_BEF" ...
## $ Row : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Pos : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Mat : chr "C13" "C11" "A16" "C20" ...
## $ Pop : chr "C" "C" "A" "C" ...
## $ Region : chr "NORTH" "NORTH" "NORTH" "NORTH" ...
## $ Flwr07 : int 39285 39293 39286 39305 39293 39291 39287 39293 39287 39303 ...
## $ FVeg07 : num 41 40 32.5 45 47.5 55 51.5 38 51 37.5 ...
## $ InfMass07: num 9.7 6.5 16 1.5 17.4 13.1 3 0.6 0 12.2 ...
## $ Fruits07 : int 625 329 1029 46 1121 504 122 15 0 896 ...
## $ Flwr08 : int 39613 39625 39610 39638 39616 39625 39614 39621 39617 39638 ...
## $ FVeg08 : num 79 117 70 94.5 97 ...
## $ HVeg08 : num 71 108 62.5 87.5 90 124 85 76.5 91 96.5 ...
## $ InfMass08: num 9.7 36.1 28.9 18.2 47.9 30.7 23.3 8.5 28.5 16.6 ...
## $ Flwr09 : int 39980 39993 NA 40000 39986 39995 39990 39993 39996 39997 ...
## $ FVeg09 : num 64 70 NA 84 66 91 80 62 58 99 ...
## $ HVeg09 : num 82 65.5 NA 72 65.5 82 76 60.5 56.5 93.5 ...
## $ InfMass09: num 11.7 10.1 NA 9.6 38.1 16.7 11.1 2.6 1.7 18.8 ...
## $ Flwr10 : int 40345 40350 NA 40369 40353 40357 40350 NA 40358 40357 ...
## $ FVeg10 : num 68.7 83.4 NA 69.9 70.1 97 105 NA 50.8 85.1 ...
## $ HVeg10 : num 66 96 NA 57.5 65.5 77 98 57 49.5 82 ...
## $ InfMass10: num 1.9 7.3 NA 5.7 27.9 4.9 5.3 5.5 1.5 15.3 ...
The important things to remember about the data:
Pop
columnSite
columnAs with RDA and SVM, we should make sure that our categorical response variables are factors.
<-LythrumDat %>%
tDatmutate(Site=as.factor(Site),
Pop=as.factor(Pop))
Finally, we’ll break this into two datasets. One to predict population and the other to predict garden site.
<-tDat %>%
popDatselect(-c("Ind","Site","Row","Pos","Mat","Region"))
<-tDat %>%
siteDatselect(-c("Ind","Row","Pos","Mat","Pop","Region"))
Looking back at the PC projection figures in that tutorial, we can imagine drawing vertical lines to separate genotypes along PC1 or horizontal lines to separate growing sites along PC2. This is the idea with decision trees – find a threshold value that distinguishes the group(s) of interest.
Note: No need to transform/standardize the data!
A really nice feature of CART analysis is that it makes very few assumptions about the underlying distribution of data. That makes for a very robust analysis.
We can run a CART analysis to try to predict which site a plant comes from based on its phenotype:
<- tree(Pop ~ ., data=popDat)
PopTree PopTree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 190 663.900 J ( 0.105263 0.210526 0.168421 0.236842 0.173684 0.105263 )
## 2) FVeg07 < 45.75 89 257.200 E ( 0.168539 0.314607 0.348315 0.067416 0.000000 0.101124 )
## 4) HVeg10 < 64.3 44 90.090 E ( 0.272727 0.045455 0.590909 0.000000 0.000000 0.090909 )
## 8) FVeg10 < 57 19 7.835 E ( 0.052632 0.000000 0.947368 0.000000 0.000000 0.000000 ) *
## 9) FVeg10 > 57 25 61.060 A ( 0.440000 0.080000 0.320000 0.000000 0.000000 0.160000 )
## 18) FVeg07 < 30 5 5.004 E ( 0.000000 0.200000 0.800000 0.000000 0.000000 0.000000 ) *
## 19) FVeg07 > 30 20 44.890 A ( 0.550000 0.050000 0.200000 0.000000 0.000000 0.200000 )
## 38) Flwr07 < 39309.5 11 16.710 A ( 0.727273 0.090909 0.000000 0.000000 0.000000 0.181818 ) *
## 39) Flwr07 > 39309.5 9 19.100 E ( 0.333333 0.000000 0.444444 0.000000 0.000000 0.222222 ) *
## 5) HVeg10 > 64.3 45 112.900 C ( 0.066667 0.577778 0.111111 0.133333 0.000000 0.111111 )
## 10) Flwr07 < 39307.5 28 42.500 C ( 0.000000 0.714286 0.000000 0.214286 0.000000 0.071429 )
## 20) InfMass10 < 25.3 22 13.400 C ( 0.000000 0.909091 0.000000 0.090909 0.000000 0.000000 )
## 40) Fruits07 < 359 17 0.000 C ( 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ) *
## 41) Fruits07 > 359 5 6.730 C ( 0.000000 0.600000 0.000000 0.400000 0.000000 0.000000 ) *
## 21) InfMass10 > 25.3 6 7.638 J ( 0.000000 0.000000 0.000000 0.666667 0.000000 0.333333 ) *
## 11) Flwr07 > 39307.5 17 45.550 C ( 0.176471 0.352941 0.294118 0.000000 0.000000 0.176471 )
## 22) InfMass07 < 6.05 7 15.110 A ( 0.428571 0.000000 0.285714 0.000000 0.000000 0.285714 ) *
## 23) InfMass07 > 6.05 10 17.960 C ( 0.000000 0.600000 0.300000 0.000000 0.000000 0.100000 )
## 46) Flwr09 < 40019 5 9.503 E ( 0.000000 0.200000 0.600000 0.000000 0.000000 0.200000 ) *
## 47) Flwr09 > 40019 5 0.000 C ( 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ) *
## 3) FVeg07 > 45.75 101 287.200 J ( 0.049505 0.118812 0.009901 0.386139 0.326733 0.108911 )
## 6) InfMass10 < 13.9 36 108.900 C ( 0.138889 0.333333 0.000000 0.250000 0.194444 0.083333 )
## 12) HVeg08 < 80.75 10 23.370 A ( 0.500000 0.000000 0.000000 0.100000 0.100000 0.300000 )
## 24) Flwr08 < 39637.5 5 0.000 A ( 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ) *
## 25) Flwr08 > 39637.5 5 9.503 T ( 0.000000 0.000000 0.000000 0.200000 0.200000 0.600000 ) *
## 13) HVeg08 > 80.75 26 55.010 C ( 0.000000 0.461538 0.000000 0.307692 0.230769 0.000000 )
## 26) FVeg07 < 58.25 18 31.230 C ( 0.000000 0.666667 0.000000 0.166667 0.166667 0.000000 )
## 52) Flwr08 < 39637.5 12 13.500 C ( 0.000000 0.750000 0.000000 0.250000 0.000000 0.000000 ) *
## 53) Flwr08 > 39637.5 6 8.318 C ( 0.000000 0.500000 0.000000 0.000000 0.500000 0.000000 ) *
## 27) FVeg07 > 58.25 8 10.590 J ( 0.000000 0.000000 0.000000 0.625000 0.375000 0.000000 ) *
## 7) InfMass10 > 13.9 65 135.900 J ( 0.000000 0.000000 0.015385 0.461538 0.400000 0.123077 )
## 14) HVeg10 < 77 8 10.590 T ( 0.000000 0.000000 0.000000 0.000000 0.375000 0.625000 ) *
## 15) HVeg10 > 77 57 106.000 J ( 0.000000 0.000000 0.017544 0.526316 0.403509 0.052632 )
## 30) HVeg08 < 92 24 35.310 J ( 0.000000 0.000000 0.000000 0.750000 0.125000 0.125000 )
## 60) InfMass07 < 24.1 9 19.780 J ( 0.000000 0.000000 0.000000 0.333333 0.333333 0.333333 ) *
## 61) InfMass07 > 24.1 15 0.000 J ( 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 ) *
## 31) HVeg08 > 92 33 51.300 S ( 0.000000 0.000000 0.030303 0.363636 0.606061 0.000000 )
## 62) Flwr07 < 39297.5 8 6.028 J ( 0.000000 0.000000 0.000000 0.875000 0.125000 0.000000 ) *
## 63) Flwr07 > 39297.5 25 32.960 S ( 0.000000 0.000000 0.040000 0.200000 0.760000 0.000000 )
## 126) Fruits07 < 872 15 25.600 S ( 0.000000 0.000000 0.066667 0.333333 0.600000 0.000000 )
## 252) Fruits07 < 489.5 6 0.000 S ( 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 ) *
## 253) Fruits07 > 489.5 9 16.860 J ( 0.000000 0.000000 0.111111 0.555556 0.333333 0.000000 ) *
## 127) Fruits07 > 872 10 0.000 S ( 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 ) *
Note that the default output for PopTree is only part of the object, which you can see if you type str(PopTree)
.
The output above shows the structure of the bifurcating tree from the CART analysis. It’s a bit hard to understand the structure, but luckily we can use the basic plot
function in base R to visualize it and the text
function to add labels.
plot(PopTree)
text(PopTree, cex=0.7)
Use ?text
to see what the cex
parameter does and some other options for formatting.
It’s no ggplot now we can compare this to the figure to the output table to identify the nodes, starting with the root (top; FVeg07) then moving down to 1 (HVeg10) and 2 (InfMass).
The bottom of the tree shows the predicted population (A-J). These are called the tips of the tree.
We can also see that the lengths of the branches differ along the y-axis. The lengths represent the number of observations, so longer branches represent stronger predictions that classify more of the observations. In this case, the node branch for FVeg07 distinguishes most of the northern populations (A, C, E) from the southern ones (J, S, T).
We can use the summary
function to summarize the model and how well it performs
summary(PopTree)
##
## Classification tree:
## tree(formula = Pop ~ ., data = popDat)
## Variables actually used in tree construction:
## [1] "FVeg07" "HVeg10" "FVeg10" "Flwr07" "InfMass10" "Fruits07"
## [7] "InfMass07" "Flwr09" "HVeg08" "Flwr08"
## Number of terminal nodes: 22
## Residual mean deviance: 1.088 = 182.8 / 168
## Misclassification error rate: 0.2368 = 45 / 190
The top rows show the model structure. The last two rows show the performance of the model.
As with other ML models, we can look at the confusion matrix to test model performance.
<-data.frame(Obs=popDat$Pop,Pred=predict(PopTree, popDat, type="class"))
CatDattable(CatDat)
## Pred
## Obs A C E J S T
## A 16 0 11 18 0 0
## C 3 44 10 23 4 2
## E 2 1 48 34 1 0
## J 0 6 0 72 1 3
## S 0 3 0 53 19 5
## T 4 1 4 32 2 10
One interesting thing to note in this confusion matrix is that the populations are sorted alphabetically, which also corresponds to latitiude of origin from the north to the south. We might therefore expect that adjacent populations would be more similar to each other (e.g. A vs C or E vs J) whereas more distant populations would be more different (e.g. A vs T). The data seem to support that hypothesis.
Note that we have 6 categories instead of the usual 2. Because we have multiple categories we don’t have true positives and true negatives (2 categories). Instead we have correct classifications along the diagonal and the misclassification error represented by the off-diagonal.
Correct classifications are simply where the prediction matches the observation, shown along the diagonal. We can calculate the correct classification rate if we divide by the total.
<-CatDat %>%
Correctfilter(Obs==Pred)
nrow(Correct)/nrow(CatDat)
## [1] 0.4837963
To calculate the misclassification rate, we can just subtract the above value from 1 or from the data itself.
<-CatDat %>%
MisClassfilter(Obs!=Pred)
nrow(MisClass)/nrow(CatDat)
## [1] 0.5162037
Why is this so much higher than the summary?
Note the sample size in the summary, compared to the full dataset
nrow(popDat)
## [1] 432
The difference seems to be that the model performs poorly for rows with missing data. A lot of the missing values are from plants that didn’t reproduce or died before being measured. If we are confident that these should be 0 rather than NA, then we can replace the values in the dataset and re-run the analysis. We’ll use a quick shortcut to replace all of the NA
with 0
is.na(popDat)]<-0
popDat[<- tree(Pop ~ ., data=popDat)
PopTree plot(PopTree)
text(PopTree, cex=0.7)
summary(PopTree)
##
## Classification tree:
## tree(formula = Pop ~ ., data = popDat)
## Variables actually used in tree construction:
## [1] "HVeg09" "InfMass08" "HVeg08" "FVeg10" "FVeg07" "Fruits07"
## [7] "InfMass10" "InfMass09" "HVeg10" "InfMass07"
## Number of terminal nodes: 18
## Residual mean deviance: 2.17 = 898.6 / 414
## Misclassification error rate: 0.4514 = 195 / 432
We see a different tree structure and the correct sample size, but not really any improvement in the misclassification rate.
We can use k-fold cross-validation with CART to see how the structure of the tree changes depending on the input data. If we compare multiple trees grown from different datasets we may find that some nodes are consistently good at prediction but others are unstable – they only work well for a few particular datasets. We can then remove or ‘prune’ those unstable branches to build a tree that is more robust and less prone to overfitting.
There is a simple function in the tree
package that does all of this at once.
<-cv.tree(PopTree, k=24, FUN=prune)
PopPruneplot(PopPrune)
text(PopPrune, cex=0.7)
Note that there are a few pruning options available that use different algorithms. You can learn more from the help functions:
?prune
?prune.misclass
?prune.rpart
?prune.tree
For comparison with the unpruned tree, we can again look at the confusion matrix and misclassification rates of our CV (pruned) tree
<-data.frame(Obs=popDat$Pop,Pred=predict(PopPrune, popDat, type="class"))
CatDat2table(CatDat2)
## Pred
## Obs A C E J S T
## A 16 6 17 1 0 5
## C 13 54 6 11 0 2
## E 11 9 54 3 0 9
## J 2 12 4 46 16 2
## S 0 6 2 22 42 8
## T 7 15 14 13 2 2
<-CatDat2 %>%
MisClassfilter(Obs!=Pred)
nrow(MisClass)/nrow(CatDat2)
## [1] 0.5046296
Note that the misclassification rate is actually LOWER than the original tree, even though it has fewer branches.
Random forests offer more powerful predictions but at the cost of interpretability.
In R we can use the randomForests
function from the library of the same name.
Remember that a Random Forest Model is an ensemble model built by combining many individual decision trees.
set.seed(123)
<-randomForest(Pop ~ ., data=popDat,
PopForntree=100, mtry=3, nodesize=5, importance=TRUE)
There are a lot of parameters in the ?randomForests
help. Here are a few
ntree
- the number of trees. The default is 500 which is maybe a bit too high for our small dataset.mtry
- number of features to include in each subtree. We use 3 since we have few featuresreplace
- sample, with replacement (default=T)nodesize
- the minimum number of nodes. Since we have 6 populations, we need at least 5 nodes to distinguish them.importance
- this helps us understand which features are important, given that model is a black boxLet’s take a look at the output
PopFor
##
## Call:
## randomForest(formula = Pop ~ ., data = popDat, ntree = 100, mtry = 3, nodesize = 5, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 48.84%
## Confusion matrix:
## A C E J S T class.error
## A 10 12 15 0 1 7 0.7777778
## C 7 54 11 6 4 4 0.3720930
## E 7 14 52 3 0 10 0.3953488
## J 0 10 1 48 15 8 0.4146341
## S 0 6 0 25 41 8 0.4875000
## T 3 15 5 8 6 16 0.6981132
Here we see the class-specific error rates, calculated across rows, as well as the overall error rate calculated from the confusion matrix as we did above.
We can also get an idea of how each trait predicts each group
$importance PopFor
## A C E J S
## Flwr07 0.001620796 0.004350114 -0.004928596 -0.007738685 0.01405064
## FVeg07 -0.001631295 0.014988185 0.085166544 0.013185965 0.02567157
## InfMass07 -0.001913708 0.006713826 0.016722010 0.021764600 0.01520955
## Fruits07 -0.005508583 0.011782168 0.015681197 0.016040450 0.01826225
## Flwr08 -0.007992189 -0.001738240 0.018285094 0.010560324 0.01729374
## FVeg08 0.009578825 0.004223542 0.079519446 0.034625217 0.03496456
## HVeg08 0.008657368 0.013508157 0.119803482 0.033634613 0.05813508
## InfMass08 -0.019358255 0.032548006 0.003622360 0.012096420 0.03415076
## Flwr09 0.003275487 0.017736985 0.012515930 0.014014024 0.02550537
## FVeg09 0.011723225 -0.002466410 0.055300191 0.045465054 0.02435035
## HVeg09 0.014442269 -0.006799589 0.086546894 0.055549458 0.04932667
## InfMass09 -0.006096868 -0.002123047 0.021587275 0.054214124 0.02796757
## Flwr10 -0.005039790 0.009952488 0.003826057 0.008124307 0.02330501
## FVeg10 -0.010482669 0.016395718 0.088346795 0.030986298 0.04923777
## HVeg10 -0.019523173 0.016453316 0.093492822 0.023046527 0.00708809
## InfMass10 0.003710101 0.035875783 0.010813311 0.040835638 0.02960191
## T MeanDecreaseAccuracy MeanDecreaseGini
## Flwr07 0.0410079763 0.005910561 10.564895
## FVeg07 0.0369287233 0.030946907 19.224241
## InfMass07 0.0010490946 0.011673132 10.390711
## Fruits07 0.0009674302 0.011306184 10.414331
## Flwr08 0.0396751488 0.012301069 14.447687
## FVeg08 0.0411409429 0.035188562 19.233688
## HVeg08 0.0450593101 0.049839229 26.390087
## InfMass08 0.0540984405 0.020789711 17.565571
## Flwr09 0.0120617109 0.015649941 14.256480
## FVeg09 0.0338967637 0.027956872 16.512516
## HVeg09 0.0257167692 0.039130558 20.526872
## InfMass09 -0.0014795238 0.018473507 14.803727
## Flwr10 0.0076865417 0.008779833 8.886772
## FVeg10 0.0365081833 0.038035191 20.446154
## HVeg10 0.0173974316 0.027198121 18.028531
## InfMass10 0.0130130210 0.024508780 15.818013
The columns A-T show how the accuracy of the model for that particular group is affected by removing each feature. These are averaged in the MeanDecreaseAccuracy
column. The higher the number, the more the accuracy of the model is impacted by this feature. We can look at the individual population columns to see how different features might be important for some populations more than others.
The above example uses bagging
of the number of trees defined by ntree
. Bagging just means that we aggregate the results of all of the different models equally.
Rather than simply averaging the models, we can impose different structures that affect how each subsampled tree contributes to the whole prediction. This kind of model building is called boosting
, which can be implemented with the gbm
function in R from the library of the same name.
Three of the more common/popular boosting methods are:
This opens up a lot of possibilities for
set.seed(123)
<-gbm(Pop ~ ., data=popDat,
PopBoostdistribution="tdist",
n.trees = 100, interaction.depth=2, cv.folds=12)
PopBoost
## gbm(formula = Pop ~ ., distribution = "tdist", data = popDat,
## n.trees = 100, interaction.depth = 2, cv.folds = 12)
## A gradient boosted model with tdist loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 81.
## There were 16 predictors of which 16 had non-zero influence.
distribution
refers to the response variable, which is categorical, not Gaussian, but we can treat it as Gaussisummary(PopBoost)
## var rel.inf
## InfMass08 InfMass08 18.094556
## InfMass10 InfMass10 13.552674
## HVeg09 HVeg09 11.375573
## HVeg08 HVeg08 9.783843
## Flwr08 Flwr08 7.334747
## FVeg07 FVeg07 6.810870
## Flwr09 Flwr09 6.442719
## Flwr07 Flwr07 5.093869
## FVeg08 FVeg08 4.318357
## FVeg10 FVeg10 3.617387
## HVeg10 HVeg10 2.951587
## InfMass09 InfMass09 2.587825
## Fruits07 Fruits07 2.585792
## FVeg09 FVeg09 2.259377
## InfMass07 InfMass07 1.886983
## Flwr10 Flwr10 1.303841
The relative influence here shows how important each feature is to the accuracy of the model, ranked from highest to lowest.
<-data.frame(Obs=popDat$Pop,Pred=predict(PopBoost, popDat, type="response")) CatDat3
## Using 81 trees...
head(CatDat3)
## Obs Pred
## 1 C 2.355361
## 2 C 2.262856
## 3 A 2.190231
## 4 C 2.448284
## 5 T 3.576273
## 6 C 2.953142
Notice that the predictions are fractional rather than categorical. The numbers represent the different categories numerically, which we can see with as.numeric
unique(CatDat3$Obs)
## [1] C A T E J S
## Levels: A C E J S T
unique(as.numeric(CatDat3$Obs))
## [1] 2 1 6 3 4 5
We can plot these to compare the observed across predicted categories
$ObsNum<-as.numeric(CatDat3$Obs)
CatDat3qplot(Pred,ObsNum, data=CatDat3, alpha=I(0.3))
Or we can round
our predictions to find the closest category so that we can calculate our confusion matrix and misclassification rate.
<-data.frame(Obs=CatDat3$ObsNum,Pred=round(CatDat3$Pred,0))
CatDat4table(CatDat4)
## Pred
## Obs 1 2 3 4 5
## 1 2 27 10 6 0
## 2 0 50 29 7 0
## 3 0 22 49 15 0
## 4 0 2 11 54 15
## 5 0 0 3 30 47
## 6 0 3 11 30 9
<-CatDat4 %>%
MisClassfilter(Obs!=Pred)
nrow(MisClass)/nrow(CatDat4)
## [1] 0.5324074
All that extra work, but not much improvement in the model.
Now try running a similar analysis on the same data to look at predictions for Site
and Region
.
Then, try making a new variable by pasting the Site & Rgion to define a new column, and run the models on that column.
What phenotypes are most different among sites vs regions? How well can you predict the site and region based on its phenotype?
Try re-running one or more of the Decision Tree models on the two main PC axes rather than the features themselves.
See if you can improve the models by changing some of the pruning and cross-validation parameters.