Gel

Here is a picture of a gel testing the success of the PCR amplification from the 2018 field course. Note that a lot of bands seem to be missing, due to problems with the extraction, PCR, or loading of the samples. Gel Image

Quality Control

First, inspect the sequencing statistics. What data do you think each of the columns are showing?

  • Well
  • Chromatogram
  • Sample
  • DNA Type
  • Submission Lot
  • Project
  • Primer
  • DNA Type - Template Size
  • Comment
  • Length
  • Score
  • Capilary
  • Ok

Next, we will use an R package called sangerseqR inspect the chromatograph, sometimes called the ‘trace’ file. To make sense of the chromatograph you should review how Sanger Sequencing works.

The sangerseqR package is part of the bioconductor package for R, which has a lot of great tools for analyzing ‘omics’ data. For more information, check out the bioconductor website.

There are a lot of packages, so we’ll just install the sangerseqR package rather than trying to install everything in bioconductor.

source("https://bioconductor.org/biocLite.R")
biocLite(c("sangerseqR"))

Now load the library and read in the sequence for your sample:

library(sangerseqR)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
rbcL<-read.abif("./Data/DNA_Barcoding/ITS/1Ipos_F_P1815443_064.ab1")

We convert the ab1 file to a Sangerseq object using the sangerseq() function and inspect the structure of the data file

rbcLseq <- sangerseq(rbcL)
str(rbcLseq)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
##   ..@ primarySeqID  : chr "From ab1 file"
##   ..@ primarySeq    :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x0000000022598c78> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 1244
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ secondarySeqID: chr "From ab1 file"
##   ..@ secondarySeq  :Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   .. .. ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. .. .. ..@ xp                    :<externalptr> 
##   .. .. .. .. ..@ .link_to_cached_object:<environment: 0x0000000022598c78> 
##   .. .. ..@ offset         : int 0
##   .. .. ..@ length         : int 0
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   ..@ traceMatrix   : int [1:14869, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ peakPosMatrix : num [1:1244, 1:4] 194 204 214 224 235 245 256 267 279 290 ...
##   ..@ peakAmpMatrix : logi [1, 1:4] NA NA NA NA

This is an example of a ‘class object’. You can see the @ to denote different elements of the object. This is similar to the way that $ denotes a column in a data.frame object. These are called ‘slots’ and can be used to subset the class object the same way $ can subset the data.fram object.

rbcLseq@primarySeqID
## [1] "From ab1 file"
rbcLseq@primarySeq
##   1244-letter "DNAString" instance
## seq: CGGTNANGAGTCTTTGACGCAGTTGCGCCCCAAG...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Here is an overview of the main slot elements:

  • primarySeqID – Identification of the primary sequence
  • primarySeq – The most likely sequence based on the florescence pattern
  • secondarySeqID – Secondary sequence ID. Secondary calls may not be present, but can occur when there is a signal for more than one base pair
  • secondarySeq – The secondary sequence
  • traceMatrix – A matrix containing the normalized signals for A,C,G,T.
  • peakPosMatrix – A matrix containing the position of the maximum peak values for each base.
  • peakAmpMatrix A matrix containing the maximum peak amplitudes for each base.

Now we can graph the trace.

chromatogram(rbcLseq, width = 200, height = 2, showcalls = "both")
## Warning in chromatogram(rbcLseq, width = 200, height = 2, showcalls =
## "both"): Secondary basecalls missing

You can use the trim5 and trim3 parameters to trim the unsequenced base pairs. We can use it with the width parameter to 1 ‘or to ’zoom in’. We can alsoLet’s look at the early part of the sequence:

chromatogram(rbcLseq, width = 50, height = 2, trim3=1100, showcalls = "both")
## Warning in chromatogram(rbcLseq, width = 50, height = 2, trim3 = 1100,
## showcalls = "both"): Secondary basecalls missing

What do you notice about the early part of the sequence?

Play around with the parameters until you have a good sequence. You may want to use the filename=graph.pdf parameter to save the graph to a pdf file that makes it easier to zoom in and inspect the graph.

A quick and easy method is to use the MakeBaseCalls() functions to ‘call’ the base pairs based on the strongest signals.

makeBaseCalls(rbcLseq)
## Number of datapoints: 14869
## Number of basecalls: 676
## 
## Primary Basecalls: CGGTCTGAGTCTTACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAACGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTTGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTCTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAAAGCCCAGCTTGAAAATCGGATGTCTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACACCCCCTTTGTACATAGCTGGTAATCCCATCGTCCACACTTTTCATTTTCCAATAAAAATCGCACTTCCGCCCCCCGGCTTCTTCGGTGGAACTCCCGGTTGAGATTACTCAAATGCCTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATAATATCTTTTTATCTTTAACCCCCCTTTAATCCCCATTGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
## 
## Secondary Basecalls: CTGTACGAGTCTGACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAAAGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTCGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTTTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAGAGCCCAGCTTGAAAATCGGATGTTTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACATCCCCTTTGTACAAAGCTGGTAATCCCATCGTCCACACTGTTCATGTTCCAATAAAATTCGCACTACCGCCCCCCTCCTTCTTCGGGGGAACTCCCGGTTGAGATTTCTCAAATGCTTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATATTATCTTTTTATCTTTAACCCCCCTTTAATCCCCATGGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT

BLAST

BLAST stand for Basic Local Alignment Search Tool, and there is a nice web interface for it at the address https://blast.ncbi.nlm.nih.gov/Blast.cgi

Take a moment to note the different types of BLAST. Why do you think there are so many different types?

We are using genes with fairly low mutation rates, so BLASTn is appropriate for finding a species-level match.

Simply copy and paste the sequence above into the sequence search box and you are ready to go!