Setup

Install the Bioconductor library.

Bioconductor is a package, or more accurately a group of packages in R for bioinformatics analysis. For more information, check out the bioconductor website. There are now so many components to Bioconductor, that it isn’t practical to install the whole thing at once. Instead, we install the Bioconductor Manager package, then the specific bioconductor sub-packages that we are going to work with: sangerseqR and annotate.

NOTE: When it asks Update all/some/none [a/s/n] – choose n or else R may spend a very long time updating.

install.packages("BiocManager")
library(BiocManager)
install(c("sangerseqR","annotate")) 

Note, install() is a function from the Bioconductor package, whereas install.packages() is a base function in R

FASTA

Fasta files are just regular text files with a .fa or .fasta extension on the filename to indicate that the text follows a specific format. Go ahead and try opening a file with a text editor and see what it looks like. The text format is used to store any kind of sequence including protein sequences, coding and non-coding DNA sequences (CDS), RNA sequences. A fasta file may contain more than one sequence record, with each record having the same format:

  • Row 1: > Header information goes here
  • Row 2+: Sequence data
  • 1+ BLANK LINE: Separates individual sequences (records)

For example:

> gene_0002737 length=268; type=dna
TCGCGGTGGCAGATTGCATCGAGAGAGCGCGCTCCAGGGAGCCTTTCTGCGTATGAAAT
GGCTGCGGCAACGCTCCCGGTTCTAGTAGATGCGACCAGGTGCGACCTGATTGCAGCTC
TGGTTTTTTGCCAGCTGTTCCACGACGCGCGGACCCTTGCGTATGATCAAGGTACGCAT
AGTCTATAATGAGTTAAACGCACACGGAAGAGGATTGCAGGCCAGACAAATCGCTAGGA
TACTATATGGGGACTCCAAGTGTAAAATACGC

> gene_0002738 length=361; type=dna
CCGTCAGCAGAGGGTAGCCATGTTCTAGTGCGGAGCGCCCATTGTGCACTCGCTTAGTA
AAGAACCTAAGCCACGGTCGGCGGAGGCTTACGAAGGGAATACCATGGAAATGGCATCC
TAGTTTTGCAGTCCGGTGCTCTCCTGGGTTTCTCGTGCACCAGGGGATTAGTGCTAGTC
ACCCGTAACGTCGTCGACCAGCGCGCTACCCCTACAAGCCCTTTTGGAGGTTCACAACC
TACGAAAAACAGCGCTTATTGTTCGACGCCCCCGCAGTCGTCCGTATGAACTTTCAAGC
AGATGCCCTGATGGATTTGAAGGCCGTGGGCGTAACCTTGCCGGCTACTATATGTGTCG
AAATGTA

> gene_0002739 length=1021; type=dna
AGTCTCCGGAGCAAGCGCGCATCTATATACGAAAGTCCTCCTCCTGTCTAATGCATCAC
GAGTATTCCGGTGCGTCCCGGGATGACCACCTCCCTTGTCACTGCTAGCGTTGTCTGTT
TGCGCGCAGTTCAGGACTGTTGTCGCGGCTCTGCGGCTCGAGAGACCTTATTAGCTATG
GGATAGACTAGCTTATGCGAACTCCCCCATTTCATTACGTATCATCATCCTGGGAGACG
GGCTAGCTGTCGTCGACCAGCAGTATTGAAGTCAGGCCAACACAAATAAACTAAATTCC
GAGTAGGAGTTAACCTGTACTCTTCTTGTATAGAGGATCAGATCCTATGCTGTTCGATC
TTGGCAGTTTACCTAGTCTGCAACCTTACGTAGATTGGATACGAATCACCCAAATGGAG
AATACGAATGAGATACTCGATCAGCTAATCGATGGATAGTCGCGGCTGATTGTCACCCA
GTCAAAGGCGAATCTTAGAGGTATACACCTATCAATTCATGCGACGTGCCTGGAGCCAG
GAGTTCGCGGAACGACTAACTTGAACTACACCGCCGGGTATTCTGGACTGCAGGAGGTG
GCTTTTGACACCATTTTGTCGTCTGATTACACTGCTCTAACTCGCTACCTAAAGCGCTG
TTAATGACTAATTCTAGCCACTGATTCTGAATAAAGCAACCAGACAGCCCACGACGAGG
TCCTATACGGGGATGATCAAGGTACTGATAAATTTGTTTGCGCAACGTATCCCTAGTAA
CGTGCTATAGCTTTGTGTACACCACCAGAATCATAATTGGTGGTAAATCCGGATCATTA
CGTTTGCTAACAGGTGCTGAAACCAGCCCCGCGTTACGTAAAGTAAGCACAAGTGCAAT
GGTCCAACTCGACCTGCTCGAGAAGTAGTCAATATTGAAATAGAGATCCAGCCCTTTAA
CACGATTCCTGGCAGAAGAAAGAGTAGCCAGTCAGCGTCCAGCCAGGTAGATGCCCTAA
TCTGATAGTGATTCAAGT

IUPAC Codes

All of the above examples have only A, G, C or T, but sometimes you might want to include degenerate (a.k.a. ambiguous) bases. For example, you may have a mixture of genotypes in a sample, or you may have an ambiguous reading from a sequencer. There is a standard set of codes used by the International Union of Pure and Applied Chemistry (IUPAC) that is useful for sequence data:

Base Code
A Adenine
G Guanine
C Cytosine
T Thymine
R Purine (A or G)
Y Pyramidine (C or T)
S G or C
N Any/unknown (A, T, G or C)

Background

This tutorial works with real sequence data collected as part of a 2018 field course Field Methods in Ecological and Environmental Genomics. It demonstrates the use of ‘DNA barcodes’ for species identification. Each student collected and sequenced an unknown sample using primers targeting the rbcL and ITS loci. Read the protocol for more details.

Draw a flowchart of the protocol and compare with your group members. Imagine you are trying to give a general overview for a textbook figure. Try to distill the protocol down to its most essential elements.

Download

Let’s make a folder called Data in your project working directory. Look for the link to 1Ipos_F_P1815443_064.ab1 and then download it to a folder called DNA_Barcoding in your Data directory. Note the file code:

  • 1-4 is a group number
  • I tells us it’s the ITS gene (r for rbcL)
  • F and R tell us whether the gene was sequenced with the Forward or Reverse primer.
  • The P1815443 is a code for the specific project run
  • The final 3-digit number is the well code (from a 96-well plate)
  • The .ab1 is the file type, which we’ll explore below

Quality Control

First, inspect the sequencing statistics. What data do you think each of the columns are showing?

SeqStats<-read.csv("./Data/BarcodePlateStats.csv")
head(SeqStats)
##   Well           Chromatogram Sample     DNA.Type Submission.Lot
## 1  A01 1r1_F_P1815443_015.ab1  1r1_F PURIFIED_PCR         213604
## 2  A02 1r2_F_P1815443_016.ab1  1r2_F PURIFIED_PCR         213604
## 3  A03 1r4_F_P1815443_031.ab1  1r4_F PURIFIED_PCR         213604
## 4  A04 1r5_F_P1815443_032.ab1  1r5_F PURIFIED_PCR         213604
## 5  A05 1I1_F_P1815443_047.ab1  1I1_F PURIFIED_PCR         213604
## 6  A06 1I2_F_P1815443_048.ab1  1I2_F PURIFIED_PCR         213604
##              Project Primer DNA.Type...Template.Size Comment Length   Score
## 1 PCRFieldCourse2018    rFW         Purified PCR-750      NA    608 46.4194
## 2 PCRFieldCourse2018    rFW         Purified PCR-750      NA    670 62.5612
## 3 PCRFieldCourse2018    rFW         Purified PCR-750      NA    663 62.7753
## 4 PCRFieldCourse2018    rFW         Purified PCR-750      NA    671 62.0477
## 5 PCRFieldCourse2018    IFW         Purified PCR-550      NA    435 51.9425
## 6 PCRFieldCourse2018    IFW         Purified PCR-550      NA     34 30.0000
##   Capilary    Ok
## 1       15  TRUE
## 2       16  TRUE
## 3       31  TRUE
## 4       32  TRUE
## 5       47  TRUE
## 6       48 FALSE
  • Well – The well (i.e. location) on the 96-well plate
  • Chromatogram – File name of the .ab1 file
  • Sample – code for the sample (see above)
  • DNA Type – purified PCR product or distilled water
  • Submission Lot – A code number for the entire submission group
  • Project – A name for the sequencing project
  • Primer – The name of the sequencing primer used (I for ITS, r for rbcL, followd by F or R for forward or reverse, respectively)
  • DNA Type – Type of sample, with template size in # of base pairs
  • Comment
  • Length – Length of the sequence
  • Score – Overall quality score
  • Capilary – Number of the capillary used in electrophoresis
  • Ok – Did the sample pass quality check?

Next, we will use an R package called sangerseqR inspect the chromatograph, sometimes called the ‘trace’ file. To make sense of the chromatograph you should review how Sanger sequencing works in the sequencing slides.

The sangerseqR package is also part of the bioconductor and can be used to work with sanger sequence data.

Load the library and read in the sequence for the target sample:

library(sangerseqR)
## Warning: no function found corresponding to methods exports from 'XVector' for:
## 'concatenateObjects'
ITS<-read.abif("./Data/DNA_Barcoding/1Ipos_F_P1815443_064.ab1")

We convert the ab1 file to a Sangerseq object using the sangerseq() function and inspect the structure of the data file

ITSseq <- sangerseq(ITS)
str(ITSseq)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
##   ..@ primarySeqID  : chr "From ab1 file"
##   ..@ primarySeq    :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 1244
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ secondarySeqID: chr "From ab1 file"
##   ..@ secondarySeq  :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 0
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ traceMatrix   : int [1:14869, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ peakPosMatrix : num [1:1244, 1:4] 194 204 214 224 235 245 256 267 279 290 ...
##   ..@ peakAmpMatrix : logi [1, 1:4] NA NA NA NA

This is an example of a class object. You can see the @ to denote different elements of the object. This is similar to the way that $ denotes a column in a data.frame object. These are called ‘slots’ and can be used to subset the class object the same way $ can subset a data.frame object.

ITSseq@primarySeqID
## [1] "From ab1 file"
ITSseq@primarySeq
##   1244-letter "DNAString" instance
## seq: CGGTNANGAGTCTTTGACGCAGTTGCGCCCCAAGGC...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Here is an overview of the main slot elements:

  • primarySeqID – Identification of the primary sequence
  • primarySeq – The most likely sequence based on the florescence pattern
  • secondarySeqID – Secondary sequence ID. Secondary calls may not be present, but can occur when there is a signal for more than one base pair
  • secondarySeq – The secondary sequence
  • traceMatrix – A matrix containing the normalized signals for A,C,G,T.
  • peakPosMatrix – A matrix containing the position of the maximum peak values for each base.
  • peakAmpMatrix A matrix containing the maximum peak amplitudes for each base.

Now we can graph the trace.

chromatogram(ITSseq, width = 250, height = 2, showcalls = "both")
## Warning in chromatogram(ITSseq, width = 250, height = 2, showcalls = "both"):
## Secondary basecalls missing

You can use the trim5 and trim3 parameters to trim the unsequenced base pairs. We can use it with the width parameter to ‘zoom in’. Let’s look closer at the early part of the sequence:

chromatogram(ITSseq, width = 50, height = 2, trim3=1100, showcalls = "both")
## Warning in chromatogram(ITSseq, width = 50, height = 2, trim3 = 1100, showcalls
## = "both"): Secondary basecalls missing

What do you notice about the early part of the sequence?

For comparison, go back and try the same code with a ‘messy’ sequence, like 1I2_F_P1815443_048.ab1.

Play around with the parameters until you have a good sequence. You may want to use the filename="graph.pdf"calcac parameter to save the graph to a pdf file that makes it easier to zoom in and inspect the graph.

A quick and easy method is to use the MakeBaseCalls() functions to ‘call’ the base pairs based on the strongest signals. This cuts out all the ambiguous peaks that probably represent noise in the fluoresence detectors.

SeqX<-makeBaseCalls(ITSseq)
print(SeqX)
## Number of datapoints: 14869
## Number of basecalls: 676
## 
## Primary Basecalls: CGGTCTGAGTCTTACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAACGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTTGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTCTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAAAGCCCAGCTTGAAAATCGGATGTCTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACACCCCCTTTGTACATAGCTGGTAATCCCATCGTCCACACTTTTCATTTTCCAATAAAAATCGCACTTCCGCCCCCCGGCTTCTTCGGTGGAACTCCCGGTTGAGATTACTCAAATGCCTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATAATATCTTTTTATCTTTAACCCCCCTTTAATCCCCATTGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
## 
## Secondary Basecalls: CTGTACGAGTCTGACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAAAGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTCGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTTTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAGAGCCCAGCTTGAAAATCGGATGTTTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACATCCCCTTTGTACAAAGCTGGTAATCCCATCGTCCACACTGTTCATGTTCCAATAAAATTCGCACTACCGCCCCCCTCCTTCTTCGGGGGAACTCCCGGTTGAGATTTCTCAAATGCTTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATATTATCTTTTTATCTTTAACCCCCCTTTAATCCCCATGGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT

Note that there is a primary and secondary sequence. Sometimes these are the same, but they can be different at a few locations if there is more than one peak at the same location in the chromatogram.

What might cause more than one peak?

BLAST

BLAST stand for Basic Local Alignment Search Tool, and there is a nice web interface for it at the address https://blast.ncbi.nlm.nih.gov/Blast.cgi

Take a moment to note the different types of BLAST in the figure below. Why do you think there are so many different types?

Figure credit: Shawn O'Neil: A Primer for Computational Biology

Recall that nucleotides translate to proteins using a three-letter tRNA code. This means that a any particular DNA sequence, like:

ATGCTGCTAGAT

Can be translated into protein in three reading frames:

Frame 1:

ATGCTGCTAGAT

Frame 2:

TGCTGCTAGAT

Frame 3:

GCTGCTAGAT

That gives 3x possible translations of the same DNA sequence.

Where do the other 3x come from?

First take the reverse complement of the sequence, and then do the same thing:

Frame 4:

ATCTAGCAGCAT

Frame 5:

TCTAGCAGCAT

Frame 6:

CTAGCAGCAT

BLASTn

We are using genes with fairly low mutation rates, so BLASTn is appropriate for finding a species-level match.

You can simply copy and paste the sequence above into the sequence search box and you are ready to go. However, this quickly becomes impractical when you have more than a few sentences. That’s were the command line of R (or Python or Unix) comes in handy. The annotate package from Bioconductor has some tools for conducting a BLAST search in R.

Then load the library:

library(annotate)
## Warning: package 'XML' was built under R version 3.5.2

If we wanted to automate the process, we should create a script that works for one sequence, then put it into a for loop to iterate through multiple sequences. First, pull together all of the lines of code that we need to get to the sequence and put them together in one place:

ITS<-read.abif("./Data/DNA_Barcoding/1Ipos_F_P1815443_064.ab1") # Read
ITSseq <- sangerseq(ITS) # Extract
SeqX<-makeBaseCalls(ITSseq) # Call 

TEST: How could you modify and adapt these three lines of code to loop through each file in the ./Data/DNA_Barcoding folder?

Hint: You can use list.files() to return a vector of names of files in a particular folder.

At this point, we have a string, but it’s not yet the right format for the BLAST search. We have to get it down to a single sequence in FASTA format. Looking at the structure of the SeqX S4 class:

str(SeqX)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
##   ..@ primarySeqID  : chr "sangerseq package primary basecalls"
##   ..@ primarySeq    :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 676
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ secondarySeqID: chr "sangerseq package secondary basecalls"
##   ..@ secondarySeq  :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x000000001f6a7208> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 676
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ traceMatrix   : int [1:14869, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ peakPosMatrix : num [1:676, 1:4] 196 NA NA NA 236 NA NA 282 NA 304 ...
##   ..@ peakAmpMatrix : num [1:676, 1:4] 77 0 0 0 338 0 0 201 0 9 ...

We can get the primary sequence with the slice:

SeqX@primarySeq
##   676-letter "DNAString" instance
## seq: CGGTCTGAGTCTTACGCAGTTGCGCCCCATGCCTTC...TGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT

compare with the original ‘raw’ sequence:

ITSseq@primarySeq
##   1244-letter "DNAString" instance
## seq: CGGTNANGAGTCTTTGACGCAGTTGCGCCCCAAGGC...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Note the difference in length.

Now we just pass the ‘trimmed’ sequence to our BLAST function. It will take a few seconds to run in the online database. It may seem like a long time, but think about what is happening here. Your sequence is uploaded to the BLAST interface, where it is then compared to several millions to billions of sequences.

SeqXBlastDF<-blastSequences(paste(SeqX@primarySeq),as='data.frame')
## estimated response time 39 seconds

NOTE: The sequence(s) input into the BLAST search is(are) called the query sequence(s)

Looking at the number of rows and the headings gives shows a lot of detail in the BLAST results:

nrow(SeqXBlastDF)
## [1] 10
head(SeqXBlastDF)

(Output is very long; not shown here)

The object is the first 10 sequences that matched our ‘query’ sequence. Some of the key parameters are outlined on the BLAST glossary. The main columns are:

  • Hit_accession – The accession code for each match
  • Hit_len – The length of the matching sequence, in number of base pairs
  • Hit_def – The name of the accession matching our query sequence
  • Hsp_score – A score representing the strength of the match. HSP stands for ‘High-scoring Segment Pair’. The details of how the score is calculated are a bit complicated, but the key is that the higher number represents a stronger match.
  • Hsp_evalue – An e-value is another way to measure the strength of the hit. It represents the probability of getting a match just by chance alone. Therefore smaller values generally indicate stronger matches.
  • Hsp_gaps – The number of gaps between the query sequence and its match.
  • Hsp_qseq – The query sequence (same for each hit)
  • Hsp_hseq – The ‘hit’ sequence

Inspecting the scores and evalues shows that all 10 hits are good matches. We may want to consider increasing the number of hits to retain by modifying parameters in the blastSequences() function.