Install the Bioconductor library.
Bioconductor is a package, or more accurately a group of packages in R for bioinformatics analysis. For more information, check out the bioconductor website. There are now so many components to Bioconductor, that it isn’t practical to install the whole thing at once. Instead, we install the Bioconductor Manager package, then the specific bioconductor sub-packages that we are going to work with: sangerseqR
and annotate
.
NOTE: When it asks Update all/some/none [a/s/n]
– choose n
or else R may spend a very long time updating.
install.packages("BiocManager")
library(BiocManager)
install(c("sangerseqR","annotate"))
Note,
install()
is a function from theBioconductor
package, whereasinstall.packages()
is a base function in R
Fasta files are just regular text files with a .fa
or .fasta
extension on the filename to indicate that the text follows a specific format. Go ahead and try opening a file with a text editor and see what it looks like. The text format is used to store any kind of sequence including protein sequences, coding and non-coding DNA sequences (CDS), RNA sequences. A fasta file may contain more than one sequence record, with each record having the same format:
> Header information goes here
Sequence data
For example:
> gene_0002737 length=268; type=dna
TCGCGGTGGCAGATTGCATCGAGAGAGCGCGCTCCAGGGAGCCTTTCTGCGTATGAAAT
GGCTGCGGCAACGCTCCCGGTTCTAGTAGATGCGACCAGGTGCGACCTGATTGCAGCTC
TGGTTTTTTGCCAGCTGTTCCACGACGCGCGGACCCTTGCGTATGATCAAGGTACGCAT
AGTCTATAATGAGTTAAACGCACACGGAAGAGGATTGCAGGCCAGACAAATCGCTAGGA
TACTATATGGGGACTCCAAGTGTAAAATACGC
> gene_0002738 length=361; type=dna
CCGTCAGCAGAGGGTAGCCATGTTCTAGTGCGGAGCGCCCATTGTGCACTCGCTTAGTA
AAGAACCTAAGCCACGGTCGGCGGAGGCTTACGAAGGGAATACCATGGAAATGGCATCC
TAGTTTTGCAGTCCGGTGCTCTCCTGGGTTTCTCGTGCACCAGGGGATTAGTGCTAGTC
ACCCGTAACGTCGTCGACCAGCGCGCTACCCCTACAAGCCCTTTTGGAGGTTCACAACC
TACGAAAAACAGCGCTTATTGTTCGACGCCCCCGCAGTCGTCCGTATGAACTTTCAAGC
AGATGCCCTGATGGATTTGAAGGCCGTGGGCGTAACCTTGCCGGCTACTATATGTGTCG
AAATGTA
> gene_0002739 length=1021; type=dna
AGTCTCCGGAGCAAGCGCGCATCTATATACGAAAGTCCTCCTCCTGTCTAATGCATCAC
GAGTATTCCGGTGCGTCCCGGGATGACCACCTCCCTTGTCACTGCTAGCGTTGTCTGTT
TGCGCGCAGTTCAGGACTGTTGTCGCGGCTCTGCGGCTCGAGAGACCTTATTAGCTATG
GGATAGACTAGCTTATGCGAACTCCCCCATTTCATTACGTATCATCATCCTGGGAGACG
GGCTAGCTGTCGTCGACCAGCAGTATTGAAGTCAGGCCAACACAAATAAACTAAATTCC
GAGTAGGAGTTAACCTGTACTCTTCTTGTATAGAGGATCAGATCCTATGCTGTTCGATC
TTGGCAGTTTACCTAGTCTGCAACCTTACGTAGATTGGATACGAATCACCCAAATGGAG
AATACGAATGAGATACTCGATCAGCTAATCGATGGATAGTCGCGGCTGATTGTCACCCA
GTCAAAGGCGAATCTTAGAGGTATACACCTATCAATTCATGCGACGTGCCTGGAGCCAG
GAGTTCGCGGAACGACTAACTTGAACTACACCGCCGGGTATTCTGGACTGCAGGAGGTG
GCTTTTGACACCATTTTGTCGTCTGATTACACTGCTCTAACTCGCTACCTAAAGCGCTG
TTAATGACTAATTCTAGCCACTGATTCTGAATAAAGCAACCAGACAGCCCACGACGAGG
TCCTATACGGGGATGATCAAGGTACTGATAAATTTGTTTGCGCAACGTATCCCTAGTAA
CGTGCTATAGCTTTGTGTACACCACCAGAATCATAATTGGTGGTAAATCCGGATCATTA
CGTTTGCTAACAGGTGCTGAAACCAGCCCCGCGTTACGTAAAGTAAGCACAAGTGCAAT
GGTCCAACTCGACCTGCTCGAGAAGTAGTCAATATTGAAATAGAGATCCAGCCCTTTAA
CACGATTCCTGGCAGAAGAAAGAGTAGCCAGTCAGCGTCCAGCCAGGTAGATGCCCTAA
TCTGATAGTGATTCAAGT
All of the above examples have only A, G, C or T, but sometimes you might want to include degenerate (a.k.a. ambiguous) bases. For example, you may have a mixture of genotypes in a sample, or you may have an ambiguous reading from a sequencer. There is a standard set of codes used by the International Union of Pure and Applied Chemistry (IUPAC) that is useful for sequence data:
Base | Code |
---|---|
A | Adenine |
G | Guanine |
C | Cytosine |
T | Thymine |
R | Purine (A or G) |
Y | Pyramidine (C or T) |
S | G or C |
N | Any/unknown (A, T, G or C) |
This tutorial works with real sequence data collected as part of a 2018 field course Field Methods in Ecological and Environmental Genomics. It demonstrates the use of ‘DNA barcodes’ for species identification. Each student collected and sequenced an unknown sample using primers targeting the rbcL and ITS loci. Read the protocol for more details.
Draw a flowchart of the protocol and compare with your group members. Imagine you are trying to give a general overview for a textbook figure. Try to distill the protocol down to its most essential elements.
Let’s make a folder called Data in your project working directory. Look for the link to 1Ipos_F_P1815443_064.ab1
and then download it to a folder called DNA_Barcoding in your Data directory. Note the file code:
First, inspect the sequencing statistics. What data do you think each of the columns are showing?
SeqStats<-read.csv("./Data/BarcodePlateStats.csv")
head(SeqStats)
## Well Chromatogram Sample DNA.Type Submission.Lot
## 1 A01 1r1_F_P1815443_015.ab1 1r1_F PURIFIED_PCR 213604
## 2 A02 1r2_F_P1815443_016.ab1 1r2_F PURIFIED_PCR 213604
## 3 A03 1r4_F_P1815443_031.ab1 1r4_F PURIFIED_PCR 213604
## 4 A04 1r5_F_P1815443_032.ab1 1r5_F PURIFIED_PCR 213604
## 5 A05 1I1_F_P1815443_047.ab1 1I1_F PURIFIED_PCR 213604
## 6 A06 1I2_F_P1815443_048.ab1 1I2_F PURIFIED_PCR 213604
## Project Primer DNA.Type...Template.Size Comment Length Score
## 1 PCRFieldCourse2018 rFW Purified PCR-750 NA 608 46.4194
## 2 PCRFieldCourse2018 rFW Purified PCR-750 NA 670 62.5612
## 3 PCRFieldCourse2018 rFW Purified PCR-750 NA 663 62.7753
## 4 PCRFieldCourse2018 rFW Purified PCR-750 NA 671 62.0477
## 5 PCRFieldCourse2018 IFW Purified PCR-550 NA 435 51.9425
## 6 PCRFieldCourse2018 IFW Purified PCR-550 NA 34 30.0000
## Capilary Ok
## 1 15 TRUE
## 2 16 TRUE
## 3 31 TRUE
## 4 32 TRUE
## 5 47 TRUE
## 6 48 FALSE
Next, we will use an R package called sangerseqR
inspect the chromatograph, sometimes called the ‘trace’ file. To make sense of the chromatograph you should review how Sanger sequencing works in the sequencing slides.
The sangerseqR
package is also part of the bioconductor and can be used to work with sanger sequence data.
Load the library and read in the sequence for the target sample:
library(sangerseqR)
## Warning: no function found corresponding to methods exports from 'XVector' for:
## 'concatenateObjects'
ITS<-read.abif("./Data/DNA_Barcoding/1Ipos_F_P1815443_064.ab1")
We convert the ab1 file to a Sangerseq object using the sangerseq()
function and inspect the structure of the data file
ITSseq <- sangerseq(ITS)
str(ITSseq)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
## ..@ primarySeqID : chr "From ab1 file"
## ..@ primarySeq :Formal class 'DNAString' [package "Biostrings"] with 5 slots
## .. .. ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. .. .. ..@ xp :<externalptr>
## .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208>
## .. .. ..@ offset : int 0
## .. .. ..@ length : int 1244
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ secondarySeqID: chr "From ab1 file"
## ..@ secondarySeq :Formal class 'DNAString' [package "Biostrings"] with 5 slots
## .. .. ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. .. .. ..@ xp :<externalptr>
## .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208>
## .. .. ..@ offset : int 0
## .. .. ..@ length : int 0
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ traceMatrix : int [1:14869, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ peakPosMatrix : num [1:1244, 1:4] 194 204 214 224 235 245 256 267 279 290 ...
## ..@ peakAmpMatrix : logi [1, 1:4] NA NA NA NA
This is an example of a class object. You can see the @
to denote different elements of the object. This is similar to the way that $
denotes a column in a data.frame object. These are called ‘slots’ and can be used to subset the class object the same way $
can subset a data.frame
object.
ITSseq@primarySeqID
## [1] "From ab1 file"
ITSseq@primarySeq
## 1244-letter "DNAString" instance
## seq: CGGTNANGAGTCTTTGACGCAGTTGCGCCCCAAGGC...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Here is an overview of the main slot elements:
Now we can graph the trace.
chromatogram(ITSseq, width = 250, height = 2, showcalls = "both")
## Warning in chromatogram(ITSseq, width = 250, height = 2, showcalls = "both"):
## Secondary basecalls missing
You can use the trim5
and trim3
parameters to trim the unsequenced base pairs. We can use it with the width
parameter to ‘zoom in’. Let’s look closer at the early part of the sequence:
chromatogram(ITSseq, width = 50, height = 2, trim3=1100, showcalls = "both")
## Warning in chromatogram(ITSseq, width = 50, height = 2, trim3 = 1100, showcalls
## = "both"): Secondary basecalls missing
What do you notice about the early part of the sequence?
For comparison, go back and try the same code with a ‘messy’ sequence, like 1I2_F_P1815443_048.ab1
.
Play around with the parameters until you have a good sequence. You may want to use the filename="graph.pdf"calcac
parameter to save the graph to a pdf file that makes it easier to zoom in and inspect the graph.
A quick and easy method is to use the MakeBaseCalls()
functions to ‘call’ the base pairs based on the strongest signals. This cuts out all the ambiguous peaks that probably represent noise in the fluoresence detectors.
SeqX<-makeBaseCalls(ITSseq)
print(SeqX)
## Number of datapoints: 14869
## Number of basecalls: 676
##
## Primary Basecalls: CGGTCTGAGTCTTACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAACGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTTGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTCTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAAAGCCCAGCTTGAAAATCGGATGTCTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACACCCCCTTTGTACATAGCTGGTAATCCCATCGTCCACACTTTTCATTTTCCAATAAAAATCGCACTTCCGCCCCCCGGCTTCTTCGGTGGAACTCCCGGTTGAGATTACTCAAATGCCTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATAATATCTTTTTATCTTTAACCCCCCTTTAATCCCCATTGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
##
## Secondary Basecalls: CTGTACGAGTCTGACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAAAGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTCGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTTTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAGAGCCCAGCTTGAAAATCGGATGTTTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACATCCCCTTTGTACAAAGCTGGTAATCCCATCGTCCACACTGTTCATGTTCCAATAAAATTCGCACTACCGCCCCCCTCCTTCTTCGGGGGAACTCCCGGTTGAGATTTCTCAAATGCTTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATATTATCTTTTTATCTTTAACCCCCCTTTAATCCCCATGGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
Note that there is a primary and secondary sequence. Sometimes these are the same, but they can be different at a few locations if there is more than one peak at the same location in the chromatogram.
What might cause more than one peak?
BLAST stand for Basic Local Alignment Search Tool, and there is a nice web interface for it at the address https://blast.ncbi.nlm.nih.gov/Blast.cgi
Take a moment to note the different types of BLAST in the figure below. Why do you think there are so many different types?
Figure credit: Shawn O'Neil: A Primer for Computational Biology
Recall that nucleotides translate to proteins using a three-letter tRNA code. This means that a any particular DNA sequence, like:
ATGCTGCTAGAT
Can be translated into protein in three reading frames:
ATGCTGCTAGAT
TGCTGCTAGAT
GCTGCTAGAT
That gives 3x possible translations of the same DNA sequence.
Where do the other 3x come from?
First take the reverse complement of the sequence, and then do the same thing:
ATCTAGCAGCAT
TCTAGCAGCAT
CTAGCAGCAT
We are using genes with fairly low mutation rates, so BLASTn
is appropriate for finding a species-level match.
You can simply copy and paste the sequence above into the sequence search box and you are ready to go. However, this quickly becomes impractical when you have more than a few sentences. That’s were the command line of R (or Python or Unix) comes in handy. The annotate
package from Bioconductor has some tools for conducting a BLAST search in R.
Then load the library:
library(annotate)
## Warning: package 'XML' was built under R version 3.5.2
If we wanted to automate the process, we should create a script that works for one sequence, then put it into a for loop to iterate through multiple sequences. First, pull together all of the lines of code that we need to get to the sequence and put them together in one place:
ITS<-read.abif("./Data/DNA_Barcoding/1Ipos_F_P1815443_064.ab1") # Read
ITSseq <- sangerseq(ITS) # Extract
SeqX<-makeBaseCalls(ITSseq) # Call
TEST: How could you modify and adapt these three lines of code to loop through each file in the ./Data/DNA_Barcoding folder?
Hint: You can use
list.files()
to return a vector of names of files in a particular folder.
At this point, we have a string, but it’s not yet the right format for the BLAST search. We have to get it down to a single sequence in FASTA format. Looking at the structure of the SeqX
S4 class:
str(SeqX)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
## ..@ primarySeqID : chr "sangerseq package primary basecalls"
## ..@ primarySeq :Formal class 'DNAString' [package "Biostrings"] with 5 slots
## .. .. ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. .. .. ..@ xp :<externalptr>
## .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208>
## .. .. ..@ offset : int 0
## .. .. ..@ length : int 676
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ secondarySeqID: chr "sangerseq package secondary basecalls"
## ..@ secondarySeq :Formal class 'DNAString' [package "Biostrings"] with 5 slots
## .. .. ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. .. .. ..@ xp :<externalptr>
## .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208>
## .. .. ..@ offset : int 0
## .. .. ..@ length : int 676
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ traceMatrix : int [1:14869, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ peakPosMatrix : num [1:676, 1:4] 196 NA NA NA 236 NA NA 282 NA 304 ...
## ..@ peakAmpMatrix : num [1:676, 1:4] 77 0 0 0 338 0 0 201 0 9 ...
We can get the primary sequence with the slice:
SeqX@primarySeq
## 676-letter "DNAString" instance
## seq: CGGTCTGAGTCTTACGCAGTTGCGCCCCATGCCTTC...TGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
compare with the original ‘raw’ sequence:
ITSseq@primarySeq
## 1244-letter "DNAString" instance
## seq: CGGTNANGAGTCTTTGACGCAGTTGCGCCCCAAGGC...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Note the difference in length.
Now we just pass the ‘trimmed’ sequence to our BLAST function. It will take a few seconds to run in the online database. It may seem like a long time, but think about what is happening here. Your sequence is uploaded to the BLAST interface, where it is then compared to several millions to billions of sequences.
SeqXBlastDF<-blastSequences(paste(SeqX@primarySeq),as='data.frame')
## estimated response time 39 seconds
NOTE: The sequence(s) input into the BLAST search is(are) called the query sequence(s)
Looking at the number of rows and the headings gives shows a lot of detail in the BLAST results:
nrow(SeqXBlastDF)
## [1] 10
head(SeqXBlastDF)
(Output is very long; not shown here)
The object is the first 10 sequences that matched our ‘query’ sequence. Some of the key parameters are outlined on the BLAST glossary. The main columns are:
Inspecting the scores and evalues shows that all 10 hits are good matches. We may want to consider increasing the number of hits to retain by modifying parameters in the blastSequences()
function.