Setup

Note that this tutorial requires installing and running packages that your computer may restrict or flag for security reasons. In Windows, you can avoid a lot of problems by right-clicking RStudio and choosing ‘Run as Administrator’. In Unix/Windows, you want to make sure you have ‘root privileges’ when running RStudio. This can be a bit tricky for recent versions of MacOS. If you have problems with this tutorial, try restarting with admin/root privileges.

Before working through this tutorial, you should:

  1. Be familiar with NCBI’s GenBank
  2. Work through the DNA Barcodes tutorial
  3. Make sure the genbankr and Biostrings packages from Bioconductor (BiocManager) are installed:
library(BiocManager)
install("genbankr")
install("Biostrings")
install("ggtree")
install("annotate")
install("muscle") 

NOTE: If it asks Update all/some/none [a/s/n] – choose n or else R may spend a very long time updating. However, if you are having trouble, you may need to install some of these. Carefully read any warnings as you install and then load each package to see which dependencies you need to install. Note that some dependencies are CRAN packages, which are loaded with install.packages().

  1. Install the reshape2 library if needed:
install.packages("reshape2")
install.packages("rentrez")

HINT: Keep an eye out for library() commands to see what each package is used for

Genbank

The genbankr package from Bioconductor has some tools for pulling records from GenBank. GenBank is a genetic sequence database that is maintained by the National Institutes of Health (NIH) in the United States. These records are curated by a staff of NIH researchers, and most peer-reviewed journals will require you to submit your sequences to GenBank and/or another database for DNA sequence data. In the DNA sequencing tutorial we saw the Sequence Read Archive (SRA), which is a database for storing raw data from high-throughput sequences.

Explore Genbank

First we’ll pull an accession for the 16S gene of Borrelia bissettii – one of the spirochete bacteria that causes Lyme disease. This is a human-readable text format that you can access online – accession on the GenBank website.

You will probably find it helpful to have that link open in a web browser to compare with the output of the code below.

First, load the library for reading Genbank files, then import the file as an object we can use in R.

library(rentrez)
library(genbankr)
BbID<-GBAccession("NR_148750")
BbID
## GenBank Accession Number(s):  NR_148750
BbGbk<-readGenBank(BbID)
## No exons read from genbank file. Assuming sections of CDS are full exons

NOTE: If you get an error that says ‘namespace load failed’, look for a package name associated with the error and try using install() from the BiocManager library to install (or update if already installed); if that doesn’t work, try install.packages().

You may also get a warning message along the lines of No exons... Think about what this might mean – we’ll come back to it soon.

The GenBank record is imported as a GenBankRecord object, which was created by the readGenBank file and includes a number of useful ‘slices’. These slices can be addressed with @. Think of this the same way we can use $ to subset a column from a data.frame object, except here we can have more complicated data structures than a simple row x column format. We can see this if we look at the structure and class of the object:

class(BbGbk)
## [1] "GenBankRecord"
## attr(,"package")
## [1] "genbankr"

This tells us that GenBankRecord is an attribute of the genbankr package.

Recall that class() is a function that returns the ‘class’ of an object. Here are a few other quick examples:

MyString<-"ATGCTCGGACA"
MyVec<-rnorm(1000)
MyDat<-data.frame(x=MyVec)
class(MyString)
## [1] "character"
class(MyVec)
## [1] "numeric"
class(c(MyString,MyVec))
## [1] "character"
class(MyDat)
## [1] "data.frame"

So a GenBankRecord is a special type of object in R, just like string or numeric or data.frame are different kinds of objects in R. To understand the structure of a GenBankRecord, we can use the structure function:

str(BbGbk)

output not shown (too long)

There are a few functions we can use to subset important slices. Here are some examples:

cds

CDS stands for Coding Sequence. This is useful for genomic DNA and for RNA sequence data, where some of the sequence may not be translated into proteins. The coding sequence is the part of the DNA or RNA that ‘codes’ for proteins.

cds(BbGbk)
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 1 sequence from NR_148750.1 genome

Note the output here: 0 ranges and 0 metadata columns. This is telling us that there are no cds entries for this data file. This is consistent with the warning we got when we imported the record, which told us that R was assuming the entire DNA sequence was a coding sequence. In this case, we are looking at the DNA sequence that codes for the 16S subunit of the ribosomal RNA, which is 100% coding sequence.

transcripts

The transcripts function shows the RNA sequences that are transcribed from DNA. Some genes can produce multiple transcripts, through a process known as alternative splicing. Different transcripts spliced from the same DNA gene are called isoforms.

transcripts(BbGbk)
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 1 sequence from NR_148750.1 genome

Similar to above, there are no transcripts listed. This gene does not have alternative splicing, so the entire sequence is encoded as a transcript.

exons

Exons are the sequences of RNA or DNA that code for proteins.

exons(BbGbk)
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 1 sequence from NR_148750.1 genome

As expected, there are no exons since the entire sequence is non-coding.

genes

The genes function would return a list of sequences corresponding to genes. We might see this in a large DNA sequence that represents a genome or a large part of the chromosome. In that case, our sequence might be millions of bases long and the sequences of individual genes would occur here.

genes(BbGbk)
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 1 sequence from NR_148750.1 genome

otherFeatures

We can see other features of the gene with the otherFeature function.

otherFeatures(BbGbk)
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |        type           product     loctype
##          <Rle> <IRanges>  <Rle> | <character>       <character> <character>
##   [1]    DN127    1-1536      + |        rRNA 16S ribosomal RNA      normal
##   -------
##   seqinfo: 1 sequence from NR_148750.1 genome

Here we can see why we don’t get outputs for most of the above functions – we are looking at part of the 16S rRNA gene, which is a sequence of DNA that codes for RNA that combines with other RNA and proteins to make the ribosome. The entire DNA sequence codes for RNA, without alternative splicing (only 1 transcript), and this RNA becomes part of the ribosome, so it doesn’t code for proteins.

Of course, there are many other types of DNA sequences in GenBank, which can have more complicated functions with multiple transcript isoforms, or even many different genes (e.g. on a genome or plasmid). R can be a bit

Pairwise Alignments

We already saw in the DNA Barcodes tutorial how to use BLAST to find similar sequences. Let’s try with our genbank sequence (using the @sequence slice from our genbank object). It could take a few seconds to run, if it asks you to keep searching, then choose y until it completes.

Note that this may take up to a minute or two to run.

library(annotate)
BbGbkBLAST<-blastSequences(paste(BbGbk@sequence),as = 'data.frame',
                           hitListSize = 20, timeout = 600)

Note: If you get a warning that says 'blastSequences' returned 0 matches try re-running with a smaller hitListSize and/or longer timeout.

Here we are using BLAST from the command line. The reason it can take a while to run, is that we are interacting with the NCBI server, which is searching through its entire database to find matching sequences.

Note that the timeout parameter specifies how long to let the query run before giving up. In some cases with complex DNA sequences, this can take a long time to run, so you will want to increase your timeout. This will prevent BLAST from quitting before finishing the search.

The hitListSize specifies the number of hits – sequence matches – to keep from the output.

?blastSequences

The output of BLAST gives a set of pairwise alignments between the ‘query’ sequence that we submitted and each of the matches from the database, or ‘hits’.

Now that we have several matching sequences, we might want to align them so that we can see which bases are similar or different among the sequences.

Multiple Alignments

The ape package is commonly used in phylogenetics in R, and has a number of useful tools for comparing sequences.

library(ape)

First, let’s make a simple vector of accession numbers from the BLAST results above and make them into a simple data.frame object with two columns:

BbHitsDF<-data.frame(ID=BbGbkBLAST$Hit_accession,Seq=BbGbkBLAST$Hsp_hseq,
                     stringsAsFactors = FALSE)

Take a minute to think about what we are doing here. We are specifying an ID column, which is the Hit_accessions from our BLAST output. We are storing the sequences as another column, and we are keeping everything as strings instead of converting to factors. Remember that we often use strings to define factors like treatment, individual, population, etc. It’s handy that the data.frame() function will automatically convert those strings into factors for our analysis, but in this case we don’t want to convert the accession IDS or sequences.

Let’s take a look at the lengths of the sequences:

BbGbkBLAST$Hit_len
##  [1] "1536"   "900755" "900755" "906707" "908512" "908512" "908512" "909921"
##  [9] "910728" "909921" "903660" "903660" "903660" "910736" "909995" "909995"
## [17] "909995" "902191" "902191" "902191" "922801" "922801" "922801" "910724"
## [25] "1522"   "906449" "906449" "906449" "906449" "905638" "905638" "905638"
## [33] "905638" "902487" "902487" "902487" "906948" "906948" "906948" "903654"
## [41] "903654" "903654" "902789" "902789" "902789"

Note the number of large sequences (>900Kb). What could these be?

The entire rRNA gene is only about 1,500b, whereas the entire Borrelia burgdorferi genome is 910,725b. Therefore, these are likely hits to genome assemblies for this species. We can read in the sequences from GenBank using the read.Genbank() command from the ape library, but this could take a while since the entire sequences would be loaded for each. To demonstrate, let’s just look at the first three.

NOTE: This can take a few minutes to run, depending on your computer

BbHitSeqs<-read.GenBank(BbGbkBLAST$Hit_accession[1:3])

Let’s take a look at the species:

attr(BbHitSeqs,"species")
## [1] "Borreliella_bissettii_DN127" "Borreliella_bissettii_DN127"
## [3] "Borreliella_bissettii_DN127"

Note the species name is bissettii rather than burgdorferi. This is due to some taxonomic disagreement, but these are just different names for the same genetic species.

Take a minute to contrast the structure of the objects created by read.GenBank() and blastSequences(), above. These are different data objects:

str(BbGbkBLAST)
## 'data.frame':    45 obs. of  28 variables:
##  $ Iteration_iter-num : chr  "1" "1" "1" "1" ...
##  $ Iteration_query-ID : chr  "Query_3933" "Query_3933" "Query_3933" "Query_3933" ...
##  $ Iteration_query-def: chr  "No definition line" "No definition line" "No definition line" "No definition line" ...
##  $ Iteration_query-len: chr  "1536" "1536" "1536" "1536" ...
##  $ Iteration_hits     : chr  "1gi|1230874590|ref|NR_148750.1|Borreliella bissettii DN127 16S ribosomal RNA, complete sequenceNR_1487501536127"| __truncated__ "1gi|1230874590|ref|NR_148750.1|Borreliella bissettii DN127 16S ribosomal RNA, complete sequenceNR_1487501536127"| __truncated__ "1gi|1230874590|ref|NR_148750.1|Borreliella bissettii DN127 16S ribosomal RNA, complete sequenceNR_1487501536127"| __truncated__ "1gi|1230874590|ref|NR_148750.1|Borreliella bissettii DN127 16S ribosomal RNA, complete sequenceNR_1487501536127"| __truncated__ ...
##  $ Iteration_stat     : chr  "79991036677239413458000.410.6250.78" "79991036677239413458000.410.6250.78" "79991036677239413458000.410.6250.78" "79991036677239413458000.410.6250.78" ...
##  $ Hit_num            : chr  "1" "2" "2" "3" ...
##  $ Hit_id             : chr  "gi|1230874590|ref|NR_148750.1|" "gi|342222012|gb|CP002746.1|" "gi|342222012|gb|CP002746.1|" "gi|218164353|gb|CP001205.1|" ...
##  $ Hit_def            : chr  "Borreliella bissettii DN127 16S ribosomal RNA, complete sequence" "Borreliella bissettii DN127 chromosome, complete genome" "Borreliella bissettii DN127 chromosome, complete genome" "Borrelia burgdorferi ZS7, complete genome" ...
##  $ Hit_accession      : chr  "NR_148750" "CP002746" "CP002746" "CP001205" ...
##  $ Hit_len            : chr  "1536" "900755" "900755" "906707" ...
##  $ Hsp_num            : chr  "1" "1" "2" "1" ...
##  $ Hsp_bit-score      : chr  "2771.26" "2771.26" "269.087" "2757.74" ...
##  $ Hsp_score          : chr  "3072" "3072" "297" "3057" ...
##  $ Hsp_evalue         : chr  "0" "0" "1.86887e-66" "0" ...
##  $ Hsp_query-from     : chr  "1" "1" "1100" "1" ...
##  $ Hsp_query-to       : chr  "1536" "1536" "1509" "1536" ...
##  $ Hsp_hit-from       : chr  "1" "444940" "442850" "442210" ...
##  $ Hsp_hit-to         : chr  "1536" "443405" "442447" "440675" ...
##  $ Hsp_query-frame    : chr  "1" "1" "1" "1" ...
##  $ Hsp_hit-frame      : chr  "1" "-1" "-1" "-1" ...
##  $ Hsp_identity       : chr  "1536" "1536" "321" "1533" ...
##  $ Hsp_positive       : chr  "1536" "1536" "321" "1533" ...
##  $ Hsp_gaps           : chr  "0" "0" "24" "0" ...
##  $ Hsp_align-len      : chr  "1536" "1536" "419" "1536" ...
##  $ Hsp_qseq           : chr  "ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGT"| __truncated__ "ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGT"| __truncated__ "CAACCCTTGTTATCTGTTACCAGCATGTAATGGTGGGGACTCAGATAAGACTGCCGGTGATAAGTCGGAGGAAGGTGAGGATGACGTCAAATCATCATGGCCCTTATGTCC"| __truncated__ "ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGT"| __truncated__ ...
##  $ Hsp_hseq           : chr  "ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGT"| __truncated__ "ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGT"| __truncated__ "CAACTCTTGTTATC-GTTACCACTATGTAATGATAGGGATTTGTATAAGATTTTGGCT-ATAAAGCGGAGG---GTAAGGCTTGTGTTTAAG--TTATTGCTCTTATGTCC"| __truncated__ "ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGT"| __truncated__ ...
##  $ Hsp_midline        : chr  "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"| __truncated__ "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"| __truncated__ "|||| ||||||||| |||||||  |||||||| | |||| |   |||||| |   | | ||||  ||||||   || ||| |   ||  ||   | || || |||||||||"| __truncated__ "|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||"| __truncated__ ...
str(BbHitSeqs)
## List of 3
##  $ NR_148750: raw [1:1536] 88 18 88 88 ...
##  $ CP002746 : raw [1:900755] 18 88 88 88 ...
##  $ CP002746 : raw [1:900755] 18 88 88 88 ...
##  - attr(*, "class")= chr "DNAbin"
##  - attr(*, "description")= chr [1:3] "NR_148750.1 Borreliella bissettii DN127 16S ribosomal RNA, complete sequence" "CP002746.1 Borreliella bissettii DN127 chromosome, complete genome" "CP002746.1 Borreliella bissettii DN127 chromosome, complete genome"
##  - attr(*, "species")= chr [1:3] "Borreliella_bissettii_DN127" "Borreliella_bissettii_DN127" "Borreliella_bissettii_DN127"

Note that this may also take a while to run.

Different functions often return the same data, but in different formats. Using the data can require changing it’s format, as we’ll see when we try to align the sequences.

MUSCLE

Now that we have sequences we can try to align them. MUSCLE (MUltiple Sequence Comparison by Log-Expectation) is a popular alignment program. To use it, we have to understand what the input type looks like.

library(muscle)
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'muscle'
## The following object is masked from 'package:ape':
## 
##     muscle
?muscle
## starting httpd help server ...
##  done

You will most likely be prompted with two help entries for MUSCLE, make sure you choose the one from the muscle package and not the ape package. Each package uses a different type of input data.

DNASbin to DNAStringSet object

Looking at the help file, it looks like we need an object of class DNAStringSet.To understand what this class of object looks like, we can type ?XStringSet and scroll down to the bottom of the help file to take a look at the examples.

However, the previous object that we had created with the ape package is a DNAbin object (check the ape package to determine how these two objects differ). We will use dplyr to convert it to a DNAStringSet for an alignment. Let’s look at an example: The built in woodmouse data object from the ape package is a DNAbin object, which we can dissect to see how it works.

First, let’s take the DNAbin and convert to a string-based object and see it’s dimensions

data(woodmouse)
dim(as.character(woodmouse))
## [1]  15 965

It’s a 15 x 965 object, not unlike a data.frame object:

as.character(woodmouse)[1:10,1:10]
##         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## No305   "n"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No304   "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No306   "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No0906S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No0908S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No0909S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No0910S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No0912S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No0913S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"  
## No1103S "a"  "t"  "t"  "c"  "g"  "a"  "a"  "a"  "a"  "a"

So it just looks like a data.frame, with row names specifying the sequence name and columns for each base in the sequence. Let’s now convert to a DNAStringSet:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Biostrings':
## 
##     collapse, intersect, setdiff, setequal, union
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## The following object is masked from 'package:XVector':
## 
##     slice
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Biostrings)
WoodmouseDNAstring <- woodmouse %>% as.character %>% lapply(.,paste0,collapse="") %>% unlist %>% DNAStringSet

Examine the DNAStringSet object you just created

class(WoodmouseDNAstring)
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"
str(WoodmouseDNAstring)
## Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
##   ..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
##   .. .. ..@ xp_list                    :List of 1
##   .. .. .. ..$ :<externalptr> 
##   .. .. ..@ .link_to_cached_object_list:List of 1
##   .. .. .. ..$ :<environment: 0x000000002200b340> 
##   ..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
##   .. .. ..@ group          : int [1:14475] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..@ start          : int [1:14475] 1 2 3 4 5 6 7 8 9 10 ...
##   .. .. ..@ width          : int [1:14475] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..@ NAMES          : NULL
##   .. .. ..@ elementType    : chr "ANY"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ elementType    : chr "DNAString"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

Now Let’s try with our data:

as.character(BbHitSeqs)

Output not shown (too long)

This might work as input, but one big problem with aligning multiple sequences is that big sequences can take a very long time to align, especially if there are more than one. And we have some big sequences:

length(BbHitSeqs$CP002746)
## [1] 900755

In contrast, the sequences from our earlier BLAST hit only keep the part of the ‘hit’ that aligns with the ‘query’:

BbHitsDF[BbHitsDF$ID=="CP002746",]
##         ID
## 2 CP002746
## 3 CP002746
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Seq
## 2 ATAACGAAGAGTTTGATCCTGGCTTAGAACTAACGCTGGCAGTGCGTCTTAAGCATGCAAGTCAAACGGGATGTAGCAATACATTCAGTGGCGAACGGGTGAGTAACGCGTGGATGATCTACCTATGAGATGGGGATAACTATTAGAAATAGTAGCTAATACCGAATAAAGTCAATTAATTTGTTAATTGATGAAAGGAAGCCTTTAAAGCTTCGCTTGTAGATGAGTCTGCGTCTTATTAGCTAGTTGGTAGGGTAAATGCCTACCAAGGCGATGATAAGTAACCGGCCTGAGAGGGTGAACGGTCACACTGGAACTGAGATACGGTCCAGACTCCTACGGGAGGCAGCAGCTAAGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCGACACTGCGTGAATGAAGAAGGTCGAAAGATTGTAAAATTCTTTTATAAATGAGGAATAAGCTTTGTAGGAAATGACAAAGTGATGACGTTAATTTATGAATAAGCCCCGGCTAATTACGTGCCAGCAGCCGCGGTAATACGTAAGGGGCGAGCGTTGTTCGGGATTATTGGGCGTAAAGGGTGAGTAGGCGGATATATAAGTCTATGCATAAAATACCACAGCTCAACTGTGGACCTATGTTGGAAACTATATGTCTAGAGTCTGATAGAGGAAGTTAGAATTTCTGGTGTAAGGGTGGAATCTGTTGATATCAGAAAGAATACCGGAGGCGAAGGCGAACTTCTGGGTCAAGACTGACGCTGAGTCACGAAAGCGTAGGGAGCAAACAGGATTAGATACCCTGGTAGTCTACGCTGTAAACGATGCACACTTGGTGTTAACTAAAAGTTAGTACCGAAGCTAACGTGTTAAGTGTGCCGCCTGGGGAGTATGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCAGGGCTTGACATATATAGGATATAGTTAGAGATAATTATTCCCCGTTTGGGGTCTATATACAGGTGCTGCATGGTTGTCGTCAGCTCGTGCTGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTTATCTGTTACCAGCATGTAATGGTGGGGACTCAGATAAGACTGCCGGTGATAAGTCGGAGGAAGGTGAGGATGACGTCAAATCATCATGGCCCTTATGTCCTGGGCTACACACGTGCTACAATGGCCTGTACAAAGCGAGGCGAAACAGTGATGTGAAGCAAAACGCATAAAGCAGGTCTCAGTCCGGATTGAAGTCTGAAACTCGACTTCATGAAGTTGGAATCGCTAGTAATCGTATATCAGAATGATACGGTGAATACGTTCTCGGGCCTTGTACACACCGCCCGTCACACCACCCGAGTTGAGGATACCCGAAGCTATTATTCTAACCCGTAAGGGAGGAAGGTATTTAAGGTATGTTTAGTGAGGGGGGTGAAGTCGTAACAAGGTAGCCGTACTGGAAAGTGCGGCTGGATCACCTCCTTT
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CAACTCTTGTTATC-GTTACCACTATGTAATGATAGGGATTTGTATAAGATTTTGGCT-ATAAAGCGGAGG---GTAAGGCTTGTGTTTAAG--TTATTGCTCTTATGTCCTACGC---ACATGTT-TACAATGTCCTGTACAAATCGAAGAGAAACGGTGATTTGAAGCAAAA-GCTTAAAGCGGGTTTCAGCATGGACTGCATTTTGAAACTCGACTTCATGAAGTTGTAATCGCTAGTAATCGTATATTGGGATGATACGATGAGTACGTTCTCAGGCACTTGTACACCGCTTGCCCGTTACATCGTCCGAGTATGGGGATGGTTCTCCGAAG-TATTATCCTAACCCGTAAGGGA-GAAGGTATTTAAGGTCCGTTTGGGGAGAGAAATGAGGTCGTAACAA-GAAGCTCGTACT

The next step is to convert to a new object using our sequence. We can do this with the pipe function %>% from the dplyr package.

BbHitsDNAstring <- BbHitsDF$Seq %>% # Start with the sequences
  as.character %>% # Be sure to convert to strings
  lapply(.,paste0,collapse="") %>% # Collapse each sequence to a single string
  unlist %>% # Flatten list to a vector
  DNAStringSet # Convert vector to DNAStringSet object

We should also give each sequence a unique names. Since some are repeated (hits to different parts of the same genome), we can add an index number

names(BbHitsDNAstring)<-paste(1:nrow(BbHitsDF),BbHitsDF$ID,sep="_")

Now we can run muscle() on our DNAStringSet object:

library(muscle)
BbAlign<-muscle::muscle(stringset=BbHitsDNAstring, quiet=T)
## Warning in file.remove(tempIn, tempOut): cannot remove file 'C:
## \Users\rob_c\AppData\Local\Temp\Rtmpg501zx\file446c7dafbd8.afa', reason
## 'Permission denied'

Note we want the muscle funcrion from the muscle package, not the one from the ape package.

Inspect

The Alignment file is a DNAMultipleAlignment object from the Biostrings package we can take a look at the alignment:

if we want to look at the whole alingment we can use de function detail()

BbAlign
## DNAMultipleAlignment with 45 rows and 1548 columns
##       aln                                                   names               
##  [1] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 1_NR_148750
##  [2] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 2_CP002746
##  [3] -------------------------...------------------------- 3_CP002746
##  [4] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 4_CP001205
##  [5] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 5_CP031412
##  [6] -------------------------...------------------------- 6_CP031412
##  [7] -------------------------...------------------------- 7_CP031412
##  [8] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 8_CP019767
##  [9] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 9_CP019916
##  ... ...
## [37] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 37_CP059009
## [38] -------------------------...------------------------- 38_CP059009
## [39] ATGACTAAGGGTTTGGCCCTTACCC...------------------------- 39_CP059009
## [40] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 40_CP017201
## [41] -------------------------...------------------------- 41_CP017201
## [42] -------------------------...------------------------- 42_CP017201
## [43] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 43_CP003866
## [44] -------------------------...------------------------- 44_CP003866
## [45] ATGACTAAGGGTTTGGCCCTTACCC...------------------------- 45_CP003866

One big problem here is the number of large gaps. These will create a big problem for the analysis. Let’s try removing sequences with too many gaps (denoted by - in the DNAStringSet file):

SeqLen<-as.numeric(lapply(BbHitsDNAstring,length))
library(ggplot2)
qplot(SeqLen)+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the distribution, it looks like 1000-1500bp is a good cutoff:

We can subset our DNAStringSet object and re run the alignment

KeepSeq<-SeqLen>1000
BbSubset<-BbHitsDNAstring[KeepSeq,]

We should re run the alignment as there are some gaps in all of the sequences – these were created by aligning other sequences that had insertions that our new subset of sequences don’t have. Therefore we should redo our alignment. Even if we didn’t have gaps it would be a good idea to do this, since wonky sequences can throw off our alignment algorithms.

BbSubAlign<-muscle(BbSubset,quiet=T)
## Warning in file.remove(tempIn, tempOut): cannot remove file 'C:
## \Users\rob_c\AppData\Local\Temp\Rtmpg501zx\file446c63fa832.afa', reason
## 'Permission denied'
BbSubAlign
## DNAMultipleAlignment with 20 rows and 1536 columns
##       aln                                                   names               
##  [1] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 1_NR_148750
##  [2] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 2_CP002746
##  [3] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 4_CP001205
##  [4] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 5_CP031412
##  [5] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 8_CP019767
##  [6] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 9_CP019916
##  [7] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 10_CP019844
##  [8] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 11_CP074054
##  [9] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 14_CP005925
##  ... ...
## [12] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 21_CP002312
## [13] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 24_AE000783
## [14] --------GAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACC------ 25_NR_102956
## [15] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 26_CP028861
## [16] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 30_CP018744
## [17] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 34_CP007564
## [18] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 37_CP059009
## [19] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 40_CP017201
## [20] ATAACGAAGAGTTTGATCCTGGCTT...AAAGTGCGGCTGGATCACCTCCTTT 43_CP003866

let’s double check the sequence lenght:

SeqLen2<-as.numeric(lapply(BbSubset,length))
qplot(SeqLen2)+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Analyze

If the alignments look reasonable, then we can look at the evolutionary relationships among the sequences. The first step is to make a pairwise distance matrix. That’s an n x n symmetrical matrix (where n = # sequences) of distances among sequences. The distance is a function of the number of differences and the model of evolution (e.g. 3rd codon substitutions evolve faster than 1st codon).

Distance matrix

The dist.dna() function from the ape package estimates a pairwise distance matrix from the sequence data. A model must be specified. There are lots of models to choose from, as outlined in the help ?dist.dna.

Since we are using the ape package again we have to convert our DNAMultipleAlignment object to a DNABin object using the as.DNAbin() function

BBSubAlignBin <- as.DNAbin(BbSubAlign)
BbDM<-dist.dna(BBSubAlignBin, model="K80")
class(BbDM)
## [1] "dist"
length(BbDM)
## [1] 190

We need to rearrange from an n x n matrix to a ‘linear’ matrix, which we can do using reshape() in baseR or the more intuitive melt() from the reshape2 package. To understand what this means, compare the structure of the data before and after.

BbDMmat<-as.matrix(BbDM)
dim(BbDMmat)
## [1] 20 20
View(BbDMmat)
library(reshape2)
PDat<-melt(BbDMmat)
dim(PDat)
## [1] 400   3
View(PDat)
ggplot(data = PDat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

Visualizing the matrix

The blue is ugly and hard to see the range of similarity, let’s try a custom palette:

ggplot(data = PDat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()+scale_fill_gradientn(colours=c("white","blue","green","red"))

We can see there are two sequences that are very different from the rest. It’s hard to read the labels, but we could inspect the linearized matrix for values > 0.2:

head(PDat[PDat$value>0.2,])
## [1] Var1  Var2  value
## <0 rows> (or 0-length row.names)

But it’s a bit confusing to find the two that sequences that are most repeated. An alternative is to adjust the x-axis labels on the graph so that we can read them.

ggplot(data = PDat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()+scale_fill_gradientn(colours=c("white","blue","green","red")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

We might want to remove these so that we can see the variation among the other sequences better. We could use grep to remove them from the matrix, but an easier way is just to replace the large values with NA:

PDat2<-PDat
PDat2$value[PDat2$value>0.2]<-NA
ggplot(data = PDat2, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()+scale_fill_gradientn(colours=c("white","blue","green","red")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

Tree building

The distance matrix is the main component needed to build a phylogenetic tree. There are several options but they basically fall into three categories:

  1. The Neighbour-Joining (NJ) approach
  2. The Minimum Evolution (ME) approach
  3. The Hierarchical Clustering approach

We’ll just look at NJ for now but you should practice running all three methods.

BbTree<-nj(BbDM)
str(BbTree)
## List of 4
##  $ edge       : int [1:37, 1:2] 21 21 21 22 23 24 25 26 27 28 ...
##  $ edge.length: num [1:37] 1.32e-03 -9.57e-09 9.57e-09 9.57e-09 1.28e-08 ...
##  $ tip.label  : chr [1:20] "1_NR_148750" "2_CP002746" "4_CP001205" "5_CP031412" ...
##  $ Nnode      : int 18
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"
class(BbTree)
## [1] "phylo"

Now we have a phylogenetic tree! We can’t plot it using ggplot2, but there is a package called ggtree that will let us do that

library(ggtree)
## ggtree v3.2.1  For help: https://yulab-smu.top/treedata-book/
## 
## If you use ggtree in published research, please cite the most appropriate paper(s):
## 
## 1. Guangchuang Yu. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 2. Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution. 2018, 35(12):3041-3043. doi:10.1093/molbev/msy194
## 3. Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:Biostrings':
## 
##     collapse
## The following object is masked from 'package:ape':
## 
##     rotate
## The following object is masked from 'package:IRanges':
## 
##     collapse
## The following object is masked from 'package:S4Vectors':
## 
##     expand
ggtree(BbTree)
## Warning: The tree contained negative edge lengths. If you want to ignore the edges,
## you can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

Now we can see those two sequences that are very different from the rest, making everything else hard to see. The branch lengths in the above graph are based on the pairwise distance matrix, so the two very different sequences are being ‘pushed’ out onto their own branch. This is a phenomenon called long branch attraction (LBA).

We have a couple of options to deal with this.

First, we can remove the data from the two very different sequences. We could try

BbDM2<-BbDM
BbDM2[BbDM2>0.2]<-NA
BbTree2<-njs(BbDM2)

This doesn’t work because the distance matrix was calculated from the original alignment data. So we have to go all the way back to our subsetted sequences and removing the two problematic sequences, then re-run the alignment, distance matrix calculation, and tree build functions:

KeepSeq2<-grep("63_|66_",names(BbSubset),invert=T)
BbSubset2<-BbSubset[KeepSeq2,]
BbSubAlign2<-muscle(BbSubset2,quiet=T)
## Warning in file.remove(tempIn, tempOut): cannot remove file 'C:
## \Users\rob_c\AppData\Local\Temp\Rtmpg501zx\file446c25fe2dd4.afa', reason
## 'Permission denied'
BbDM2<-dist.dna(as.DNAbin(BbSubAlign2), model="K80")
BbTree2<-nj(BbDM2)
ggtree(BbTree2)
## Warning: The tree contained negative edge lengths. If you want to ignore the edges,
## you can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

Alternatively, we can remove the branch length info to focus on the relationships:

ggtree(BbTree,branch.length='none')

Finally, we can play around with ggtree to change the appearance and aBb some useful information:

Try layout= 'rectangular', 'slanted', 'fan', 'circular', 'radial', 'equal_angle' or 'daylight'

ggtree(BbTree2,layout="rectangular") + geom_tiplab()
## Warning: The tree contained negative edge lengths. If you want to ignore the edges,
## you can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

ggtree(BbTree2,layout="circular")
## Warning: The tree contained negative edge lengths. If you want to ignore the edges,
## you can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

If you are having trouble reading the labels, try exporting to a pdf file, adjusting the width and height as needed:

pdf(width=8,height=4)
  ggtree(BbTree2,layout="radial") + geom_tiplab()
dev.off()

Saving Trees

Tree files are just text files that can be saved with the write.tree() function from the ape library, typically with a .tre extension to indicate what it is, even though it will open with any text editor:

write.tree(BbTree2,"Borrelia_burg_tree.tre")

Next Level

More information on ggtree() on the ggplot website

Once you are comfortable with this tutorial, there are additional steps to have a robust phylogentic analysis. There is a nice short tutorial with additional R tools on this page from the Molecular Ecologist Website

If you are looking at populations where different individuals can have the same allele (e.g. mitochondrial genes in humans), then you might want to produce a haplotype network from your distance matrix. There is a nice tutorial on this by John B Horne