library(ggplot2)
library(ape)
library(ggtree)
library(vegan)
Create a directory called Data
inside of your working directory. Then downlod the zip file below and unzip it to your Data folder:
The folder should contain these files:
These data are from a meta-analysis of plant microbiome studies using 16S metabarcodes. Each sample is a plant that was sequenced for 16S to identify the bacteria associated with it.
Let’s take a look at the data:
This file contains information about the samples. It is essentially an index that links a specific sample code to the species that it was taken from.
<-read.csv("./Data/SampleInfo.csv") Samples
<- read.delim("./Data/OTU_table_QIIME1.txt", header=T, row.names="OTU_ID") OTU_table
Note that we used the parameter header=T
, which sets the first row as column names. We also use row.names=OTU_ID
, which sets the column names to the values in the column called OTU_ID
. What does this do? You’ll understand why we did this in a minute.
This is a simple table tab-delimited data table. Note that it is imported with the read.delim()
function in R, which translates a tab character \t
into columns. If we were to open it in a text file, the first four lines would look like this:
OTU_ID SRR4267992 SRR4289215 SRR5211360 SRR5211361 SRR5439762 SRR5439772 SRR4267987 SRR4289218 SRR4734815 SRR4708713 SRR5051401 SRR5051407 SRR5051999 SRR5052001 SRR5052006 SRR5051567 SRR5052003 SRR5051415 SRR5051568 ERR632209 SRR4267988 SRR4289217 SRR2087833 SRR2087830 SRR2087835 ERR1745180 ERR1745155 ERR1745154 ERR1745153 ERR1745151 ERR1745152 ERR1745156 ERR1745178 SRR2204344 SRR2204709 SRR2087832 SRR3073557 ERR632218 SRR5439753 SRR3742182 SRR3742176 SRR3742138 SRR3073553 SRR3073555 ERR632208 SRR5439754 SRR2087831 SRR2087834 SRR3073554 SRR5439751 SRR5439752 SRR5439770 SRR5439755 SRR5439761 SRR5051400 ERR632220 SRR5185934 SRR4010871 SRR5439756 SRR5439757 SRR5439759 SRR5439763 SRR5439764 SRR5439765 SRR5439766 SRR5439767 SRR5439768 SRR5439769 SRR5439773 SRR5439774 SRR5439771 SRR4265353 SRR4289222 ERR1745165 ERR1745169 SRR3073556 SRR3073558 SRR3073559 ERR1745158 ERR1745179 SRR5439758 SRR5439760 SRR5052000 SRR5052004 ERR1745177 SRR4010872 SRR4265352 SRR4289224 ERR1745175 ERR1745176 SRR4267986 SRR4289219 ERR631092 ERR632207 ERR632221 SRR5185925 SRR5439796 SRR3747548 SRR5439793 SRR2204159 SRR5439778 SRR5439782 SRR5439776 SRR5439789 SRR5439781 SRR5439786 SRR5439798 ERR1745170 ERR1745167 SRR4706101 SRR2064974 SRR5439779 SRR5439780 SRR5439795 SRR5439794 SRR4453815 SRR3742177 SRR3742180 SRR3742197 SRR3742190 SRR3742179 SRR3742194 SRR3742184 ERR1745163 SRR2064975 SRR5185927 SRR5185938 SRR5185933 SRR5185932 SRR5185935 SRR5185926 SRR5185924 SRR5185931 SRR5185936 SRR5185929 SRR5185939 SRR5185937 SRR5185940 SRR4263477 SRR4289223 SRR5185944 SRR5185942 SRR5185946 SRR5185943 SRR5185941 ERR1745182 ERR632219 ERR1745181 ERR1745159 ERR1745157 ERR1745183 SRR4267985 SRR4289220 SRR5439785 SRR5439788 SRR5439775 SRR5439777 SRR5439783 SRR5439787 SRR5439791 SRR5439792 SRR5439797 SRR5439784 SRR5439790 ERR1745171 ERR1745168 SRR5185930 SRR4734837 SRR3742191 SRR3742193 SRR5185947 ERR1745164 ERR1745166 SRR5185945 SRR2204707 SRR5185928 ERR592858 taxonomy
4479946 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__MND1; f__; g__; s__
4386156 0.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria; g__; s__
Once parsed, the data are easier to understand:
SRR4267992 | SRR4289215 | SRR5211360 | SRR5211361 | SRR5439762 | SRR5439772 | SRR4267987 | SRR4289218 | SRR4734815 | SRR4708713 | SRR5051401 | SRR5051407 | SRR5051999 | SRR5052001 | SRR5052006 | SRR5051567 | SRR5052003 | SRR5051415 | SRR5051568 | ERR632209 | SRR4267988 | SRR4289217 | SRR2087833 | SRR2087830 | SRR2087835 | ERR1745180 | ERR1745155 | ERR1745154 | ERR1745153 | ERR1745151 | ERR1745152 | ERR1745156 | ERR1745178 | SRR2204344 | SRR2204709 | SRR2087832 | SRR3073557 | ERR632218 | SRR5439753 | SRR3742182 | SRR3742176 | SRR3742138 | SRR3073553 | SRR3073555 | ERR632208 | SRR5439754 | SRR2087831 | SRR2087834 | SRR3073554 | SRR5439751 | SRR5439752 | SRR5439770 | SRR5439755 | SRR5439761 | SRR5051400 | ERR632220 | SRR5185934 | SRR4010871 | SRR5439756 | SRR5439757 | SRR5439759 | SRR5439763 | SRR5439764 | SRR5439765 | SRR5439766 | SRR5439767 | SRR5439768 | SRR5439769 | SRR5439773 | SRR5439774 | SRR5439771 | SRR4265353 | SRR4289222 | ERR1745165 | ERR1745169 | SRR3073556 | SRR3073558 | SRR3073559 | ERR1745158 | ERR1745179 | SRR5439758 | SRR5439760 | SRR5052000 | SRR5052004 | ERR1745177 | SRR4010872 | SRR4265352 | SRR4289224 | ERR1745175 | ERR1745176 | SRR4267986 | SRR4289219 | ERR631092 | ERR632207 | ERR632221 | SRR5185925 | SRR5439796 | SRR3747548 | SRR5439793 | SRR2204159 | SRR5439778 | SRR5439782 | SRR5439776 | SRR5439789 | SRR5439781 | SRR5439786 | SRR5439798 | ERR1745170 | ERR1745167 | SRR4706101 | SRR2064974 | SRR5439779 | SRR5439780 | SRR5439795 | SRR5439794 | SRR4453815 | SRR3742177 | SRR3742180 | SRR3742197 | SRR3742190 | SRR3742179 | SRR3742194 | SRR3742184 | ERR1745163 | SRR2064975 | SRR5185927 | SRR5185938 | SRR5185933 | SRR5185932 | SRR5185935 | SRR5185926 | SRR5185924 | SRR5185931 | SRR5185936 | SRR5185929 | SRR5185939 | SRR5185937 | SRR5185940 | SRR4263477 | SRR4289223 | SRR5185944 | SRR5185942 | SRR5185946 | SRR5185943 | SRR5185941 | ERR1745182 | ERR632219 | ERR1745181 | ERR1745159 | ERR1745157 | ERR1745183 | SRR4267985 | SRR4289220 | SRR5439785 | SRR5439788 | SRR5439775 | SRR5439777 | SRR5439783 | SRR5439787 | SRR5439791 | SRR5439792 | SRR5439797 | SRR5439784 | SRR5439790 | ERR1745171 | ERR1745168 | SRR5185930 | SRR4734837 | SRR3742191 | SRR3742193 | SRR5185947 | ERR1745164 | ERR1745166 | SRR5185945 | SRR2204707 | SRR5185928 | ERR592858 | taxonomy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
432284 | 31 | 31 | 0 | 0 | 3 | 5 | 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 22 | 22 | 22 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 291 | 19 | 0 | 0 | 0 | 0 | 0 | 1 | 291 | 1 | 5 | 0 | 16 | 8 | 1 | 4 | 9 | 0 | 227 | 48 | 414 | 2 | 1 | 2 | 6 | 3 | 24 | 7 | 1 | 10 | 6 | 7 | 11 | 3 | 30 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 4 | 450 | 10 | 10 | 0 | 0 | 3 | 3 | 259 | 18 | 224 | 0 | 19 | 0 | 147 | 0 | 391 | 0 | 0 | 9 | 15 | 0 | 9 | 0 | 0 | 0 | 25957 | 0 | 92 | 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 254 | 88 | 79 | 0 | 3 | 0 | 0 | 0 | 10 | 9 | 166 | 9 | 35 | 0 | 18 | 18 | 10 | 583 | 21 | 183 | 5 | 0 | 346 | 0 | 0 | 0 | 0 | 18 | 18 | 4 | 0 | 1 | 5 | 1 | 0 | 1 | 1 | 0 | 12 | 4 | 0 | 0 | 10 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | k__Bacteria; p__Cyanobacteria; c__Chloroplast; oStreptophyta; f; g; s |
694558 | 0 | 0 | 91 | 108 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 17 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 15 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; gBacillus; s |
543472 | 0 | 0 | 0 | 0 | 69 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Bradyrhizobiaceae; g; s |
938794 | 0 | 0 | 22 | 0 | 0 | 0 | 7 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 12 | 12 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Moraxellaceae; g__Acinetobacter; s__guillouiae |
Note that this is a 16.5MB file, so only a few lines are printed here. The first column contains the OTU_ID
– this is a unique ID number assigned to a unique ‘Operational Taxonomic Unit’ or OTU. The term OTU is common for microbes like bacteria and fungi where species concepts don’t work so well. Since we are inferring taxonomy from a single gene, all sequences that share the same sequence are considered part of the same OTU.
The next columns start with SRR
. These are the the individual samples that were sequenced. Each number in the SRR
column corresponds to the number of sequences for the OTU that were present in that sample.
The last column is called taxonomy
and contains the predicted taxonomic assignment for each OTU. Let’s remove these so that we aren’t mixing numeric and character data.
<-OTU_table[,-ncol(OTU_table)] OTU_data
Let’s see how many OTUs and samples are in the dataset:
dim(OTU_data)
## [1] 21201 177
How many OTUs? How many samples?
Let’s see how many total sequences are in this dataset:
<-rowSums(OTU_data)
Xsum(X)
## [1] 18080169
With > 18,000,000 reads (i.e. individual sequences) we are bound to get some reads that are just contamination. To reduce the dataset slightly we can remove any OTUs that don’t have more than one sequence in more than one sample:
<-rowSums(OTU_data) < 2
Dropsum(Drop)
## [1] 3956
This will allow us to drop almost 4,000 OTUs from the analysis.
<-OTU_data[!Drop,] OTU_red
To analyze the OTU table, we need to rearrange with the species across the top and sample sizes along the bottom. A simple way to do this is with the transpose function t()
. Since it’s a big file, let’s do it only on the first 3 rows and columns to see what happens:
1:3,1:3] OTU_red[
## SRR4267992 SRR4289215 SRR5211360
## 4479946 1 1 0
## 4386156 0 0 1
## 4479944 0 0 0
t(OTU_red[1:3,1:3])
## 4479946 4386156 4479944
## SRR4267992 1 0 0
## SRR4289215 1 0 0
## SRR5211360 0 1 0
That looks good, so let’s do it for the whole dataset:
<-as.data.frame(t(OTU_red)) OTUs
Now that we have samples as rows and OTUs as columns, how can we compare the samples?
Hint: How did we compare dragons with different traits?
(see the Dragon Phylogeny tutorial)
In an analogous fashion, we can calculate pairwise distances among our samples to create a distance matrix.
We can change our OTUs
dataframe into binary data by changing the read counts in each square to 1 or 0. Of course, we lose information of the relative number of sequences for each OTU, so we’ll look how to analyze that data later.
Transform to 1 if # reads >=1 and 0 otherwise. We could use a nested loop through each row and each column. You CAN do IT this WAY, but it may take a long time, so you might want to skip this code, but take some time to look at it carefully and make sure you understand what it is doing (or replace nrow(OTUs)
with a smaller number like 5):
<-OTUs
OTU_binfor (i in 1:nrow(OTU_bin)){
for (j in 1:ncol(OTU_bin)){
if(OTU_bin[i,j]>0){
<-1
OTU_bin[i,j]
}
} }
Loops are not computationally efficient – especially loops nested within loops. This becomes a problem when we have a lot of rows and columns to deal with. Since we are working with a numeric data.frame (i.e. all data in rows and columns are numeric), there is a nice shortcut:
<-OTUs[1:3,1:3]
X X
## 4479946 4386156 4479944
## SRR4267992 1 0 0
## SRR4289215 1 0 0
## SRR5211360 0 1 0
>0]<-100
X[X X
## 4479946 4386156 4479944
## SRR4267992 100 0 0
## SRR4289215 100 0 0
## SRR5211360 0 100 0
This won’t work with most data frames, because data frames often contain heterogenous data (e.g. sample ID, measurements, counts, binary, factor).
Now we apply it to the full dataset to create a binary matrix:
<-OTUs
OTU_bin>0]<-1 OTU_bin[OTU_bin
Did you have to wait?
Compare that compute time to a nested loop. Remember:
dim(OTUs)
## [1] 177 17245
That means the total number of cells to replace is:
nrow(OTUs)*ncol(OTUs)
## [1] 3052365
Just as we did in the Dragon Phylogeny tutorial, we can calculate the pairwise distance of this binary matrix.
<-dist(OTU_bin,method='binary') OTU_bin_dist
What will our matrix size be? How many elements?
We are comparing the sites (rows), so:
nrow(OTU_bin)
## [1] 177
nrow(OTU_bin) * nrow(OTU_bin)
## [1] 31329
Just as we did with our Dragon Phylogeny, we can cluster our samples by their similarity. For example, we can use the neighbor-joining method:
library(ape)
library(ggtree)
<-nj(OTU_bin_dist)
OTU_treeggtree(OTU_tree,layout="rectangular")
Note that this is NOT A PHYLOGENY – a phylogeny shows an hypothesis about evolutionary relationships among organisms. This is a neighbour-joining tree that clusters samples based on the similarity of their microbes.
To see if species cluster together, we can colour-code and annotate our tree using our Samples
data.frame object.
ggtree(OTU_tree,layout="rectangular") %<+% Samples +
geom_tiplab(aes(colour=Species)) +
theme(legend.position="right")
In addition to binary distances, we can include abundance information. In our case ‘abundance’ is inferred from the number of sequence reads – the idea being that the more abundant a bacterial species is in the sample, the more likely it is for its DNA to be sequenced multiple times.
The dist
function has other options for calculating distances:
?dist
A popular measure is the Bray-Curtis dissimilarity metric. The Bray-Curtis dissimilarity (BC) between two samples \(i\) and \(j\) is:
\[BC_{ij} = 1-\frac{2 \sum_{k=1}^{N} min(Sp_{i,k},Sp_{j,k})}{\sum_{k=1}^{N} Sp_{i,k}+\sum_{k=1}^{N} Sp_{j,k}}\] where \(Sp_k\) is the number of ‘reads’ for the species k. Fortunately, the vegan
package does this so we don’t have to calculate this manually.
library(vegan)
<- vegdist(OTUs,method="bray",binary=F)
OTU_dist<-nj(OTU_dist)
OTU_tree2ggtree(OTU_tree2,layout="rectangular") %<+% Samples +
geom_tiplab(aes(colour=Species)) +
theme(legend.position="right")
Notice the change in tree topology. Why?
In addition to tree-based clustering, we can use a Non-Metric Multidimensional Scaling (NMDS) to visualize similarity/differences among our samples. This method is a bit complicated, but you can see the Wikipedia page for details. The important thing to know is that it is an algorithm for plotting points so that two points that are similar end up close together on the graph. Usually, this is done in two dimensions to generate a bivariate plot, but there are some cases where you might want just one or 3 or more.
set.seed(13)
<-metaMDS(OTU_dist,k=2) # k = 2 dimensions NMDSdat
## Run 0 stress 0.2481601
## Run 1 stress 0.2448077
## ... New best solution
## ... Procrustes: rmse 0.06313684 max resid 0.1776276
## Run 2 stress 0.2476016
## Run 3 stress 0.2534164
## Run 4 stress 0.2528905
## Run 5 stress 0.2533469
## Run 6 stress 0.2452108
## ... Procrustes: rmse 0.00558734 max resid 0.071541
## Run 7 stress 0.2447915
## ... New best solution
## ... Procrustes: rmse 0.002684137 max resid 0.03323182
## Run 8 stress 0.2504992
## Run 9 stress 0.2478204
## Run 10 stress 0.2479272
## Run 11 stress 0.2510769
## Run 12 stress 0.2718408
## Run 13 stress 0.2572388
## Run 14 stress 0.2815646
## Run 15 stress 0.2522907
## Run 16 stress 0.256157
## Run 17 stress 0.2498118
## Run 18 stress 0.256801
## Run 19 stress 0.2447743
## ... New best solution
## ... Procrustes: rmse 0.005080067 max resid 0.04743662
## Run 20 stress 0.2452108
## ... Procrustes: rmse 0.008389072 max resid 0.07155341
## *** No convergence -- monoMDS stopping criteria:
## 19: stress ratio > sratmax
## 1: scale factor of the gradient < sfgrmin
It looks like our model did not converge after 20 runs, we can try increasing this:
set.seed(13)
<-metaMDS(OTU_dist,k=2,trymax = 100) NMDSdat
## Run 0 stress 0.2481601
## Run 1 stress 0.2448077
## ... New best solution
## ... Procrustes: rmse 0.06313684 max resid 0.1776276
## Run 2 stress 0.2476016
## Run 3 stress 0.2534164
## Run 4 stress 0.2528905
## Run 5 stress 0.2533469
## Run 6 stress 0.2452108
## ... Procrustes: rmse 0.00558734 max resid 0.071541
## Run 7 stress 0.2447915
## ... New best solution
## ... Procrustes: rmse 0.002684137 max resid 0.03323182
## Run 8 stress 0.2504992
## Run 9 stress 0.2478204
## Run 10 stress 0.2479272
## Run 11 stress 0.2510769
## Run 12 stress 0.2718408
## Run 13 stress 0.2572388
## Run 14 stress 0.2815646
## Run 15 stress 0.2522907
## Run 16 stress 0.256157
## Run 17 stress 0.2498118
## Run 18 stress 0.256801
## Run 19 stress 0.2447743
## ... New best solution
## ... Procrustes: rmse 0.005080067 max resid 0.04743662
## Run 20 stress 0.2452108
## ... Procrustes: rmse 0.008389072 max resid 0.07155341
## Run 21 stress 0.2485014
## Run 22 stress 0.2458439
## Run 23 stress 0.2526073
## Run 24 stress 0.2514402
## Run 25 stress 0.253503
## Run 26 stress 0.2522898
## Run 27 stress 0.2501302
## Run 28 stress 0.2476545
## Run 29 stress 0.251593
## Run 30 stress 0.2452109
## ... Procrustes: rmse 0.008385411 max resid 0.07155543
## Run 31 stress 0.2594707
## Run 32 stress 0.2797743
## Run 33 stress 0.2542847
## Run 34 stress 0.2480837
## Run 35 stress 0.2514054
## Run 36 stress 0.2511438
## Run 37 stress 0.2505249
## Run 38 stress 0.2582468
## Run 39 stress 0.2580423
## Run 40 stress 0.2570759
## Run 41 stress 0.2596559
## Run 42 stress 0.2917437
## Run 43 stress 0.2514907
## Run 44 stress 0.2512227
## Run 45 stress 0.258097
## Run 46 stress 0.2456734
## Run 47 stress 0.2644926
## Run 48 stress 0.2457242
## Run 49 stress 0.2451999
## ... Procrustes: rmse 0.007929362 max resid 0.0714391
## Run 50 stress 0.2448077
## ... Procrustes: rmse 0.005733739 max resid 0.04747065
## Run 51 stress 0.2534336
## Run 52 stress 0.2704354
## Run 53 stress 0.2544404
## Run 54 stress 0.2511885
## Run 55 stress 0.2505814
## Run 56 stress 0.2588626
## Run 57 stress 0.2447924
## ... Procrustes: rmse 0.002638983 max resid 0.0326045
## Run 58 stress 0.2571638
## Run 59 stress 0.2478662
## Run 60 stress 0.260839
## Run 61 stress 0.2518501
## Run 62 stress 0.2505426
## Run 63 stress 0.2514963
## Run 64 stress 0.2566792
## Run 65 stress 0.2511014
## Run 66 stress 0.2540744
## Run 67 stress 0.2466613
## Run 68 stress 0.2511703
## Run 69 stress 0.2551103
## Run 70 stress 0.2476687
## Run 71 stress 0.2517213
## Run 72 stress 0.2452107
## ... Procrustes: rmse 0.008385849 max resid 0.0715598
## Run 73 stress 0.2505126
## Run 74 stress 0.2578466
## Run 75 stress 0.2495588
## Run 76 stress 0.2513394
## Run 77 stress 0.2601173
## Run 78 stress 0.2541433
## Run 79 stress 0.2447909
## ... Procrustes: rmse 0.002668734 max resid 0.03263473
## Run 80 stress 0.2526854
## Run 81 stress 0.2503426
## Run 82 stress 0.2554527
## Run 83 stress 0.2505249
## Run 84 stress 0.2477735
## Run 85 stress 0.249047
## Run 86 stress 0.2515544
## Run 87 stress 0.2498213
## Run 88 stress 0.2470329
## Run 89 stress 0.2454711
## Run 90 stress 0.2571176
## Run 91 stress 0.2555098
## Run 92 stress 0.2502964
## Run 93 stress 0.2532716
## Run 94 stress 0.2447905
## ... Procrustes: rmse 0.002703492 max resid 0.03268134
## Run 95 stress 0.2468473
## Run 96 stress 0.2458538
## Run 97 stress 0.2509962
## Run 98 stress 0.2473709
## Run 99 stress 0.2525378
## Run 100 stress 0.252097
## *** No convergence -- monoMDS stopping criteria:
## 2: no. of iterations >= maxit
## 88: stress ratio > sratmax
## 10: scale factor of the gradient < sfgrmin
If it still hasn’t converged after 100 iterations, we can increase trymax
to try to get a better solution. Go ahead and try and then compare to the figure below.
Create data for plotting:
<-data.frame(NMDS1=NMDSdat$points[,1],
PDatNMDS2=NMDSdat$points[,2],
SampleID=row.names(OTUs))
Add species labels using the merge function:
<-merge(PDat,Samples,by="SampleID",all.x=T,all.y=F) PDat
And plot:
qplot(x=NMDS1,y=NMDS2,colour=Species,alpha=I(0.6),data=PDat)+theme_bw()