Here is a picture of a gel testing the success of the PCR amplification from the 2018 field course. Note that a lot of bands seem to be missing, due to problems with the extraction, PCR, or loading of the samples.
A list of sequences is below. Let’s start by looking for the link to 1Ipos_F_P1815443_064.ab1
and then download it to your working directory. Note the file code:
First, inspect the sequencing statistics. What data do you think each of the columns are showing?
Next, we will use an R package called sangerseqR
inspect the chromatograph, sometimes called the ‘trace’ file. To make sense of the chromatograph you should review how Sanger Sequencing works.
The sangerseqR
package is part of the bioconductor package for R, which has a lot of great tools for analyzing ‘omics’ data. For more information, check out the bioconductor website.
There are a lot of packages, so we’ll just install the sangerseqR package rather than trying to install everything in bioconductor.
source("https://bioconductor.org/biocLite.R")
biocLite(c("sangerseqR"))
Now load the library and read in the sequence for your sample:
library(sangerseqR)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
rbcL<-read.abif("./Data/DNA_Barcoding/ITS/1Ipos_F_P1815443_064.ab1")
We convert the ab1 file to a Sangerseq object using the sangerseq()
function and inspect the structure of the data file
rbcLseq <- sangerseq(rbcL)
str(rbcLseq)
## Formal class 'sangerseq' [package "sangerseqR"] with 7 slots
## ..@ primarySeqID : chr "From ab1 file"
## ..@ primarySeq :Formal class 'DNAString' [package "Biostrings"] with 5 slots
## .. .. ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. .. .. ..@ xp :<externalptr>
## .. .. .. .. ..@ .link_to_cached_object:<environment: 0x0000000022598c78>
## .. .. ..@ offset : int 0
## .. .. ..@ length : int 1244
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ secondarySeqID: chr "From ab1 file"
## ..@ secondarySeq :Formal class 'DNAString' [package "Biostrings"] with 5 slots
## .. .. ..@ shared :Formal class 'SharedRaw' [package "XVector"] with 2 slots
## .. .. .. .. ..@ xp :<externalptr>
## .. .. .. .. ..@ .link_to_cached_object:<environment: 0x0000000022598c78>
## .. .. ..@ offset : int 0
## .. .. ..@ length : int 0
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ traceMatrix : int [1:14869, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ peakPosMatrix : num [1:1244, 1:4] 194 204 214 224 235 245 256 267 279 290 ...
## ..@ peakAmpMatrix : logi [1, 1:4] NA NA NA NA
This is an example of a ‘class object’. You can see the @
to denote different elements of the object. This is similar to the way that $
denotes a column in a data.frame object. These are called ‘slots’ and can be used to subset the class object the same way $
can subset the data.fram object.
rbcLseq@primarySeqID
## [1] "From ab1 file"
rbcLseq@primarySeq
## 1244-letter "DNAString" instance
## seq: CGGTNANGAGTCTTTGACGCAGTTGCGCCCCAAG...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Here is an overview of the main slot elements:
Now we can graph the trace.
chromatogram(rbcLseq, width = 200, height = 2, showcalls = "both")
## Warning in chromatogram(rbcLseq, width = 200, height = 2, showcalls =
## "both"): Secondary basecalls missing
You can use the trim5
and trim3
parameters to trim the unsequenced base pairs. We can use it with the width
parameter to 1 ‘or to ’zoom in’. We can alsoLet’s look at the early part of the sequence:
chromatogram(rbcLseq, width = 50, height = 2, trim3=1100, showcalls = "both")
## Warning in chromatogram(rbcLseq, width = 50, height = 2, trim3 = 1100,
## showcalls = "both"): Secondary basecalls missing
What do you notice about the early part of the sequence?
Play around with the parameters until you have a good sequence. You may want to use the filename=graph.pdf
parameter to save the graph to a pdf file that makes it easier to zoom in and inspect the graph.
A quick and easy method is to use the MakeBaseCalls()
functions to ‘call’ the base pairs based on the strongest signals.
makeBaseCalls(rbcLseq)
## Number of datapoints: 14869
## Number of basecalls: 676
##
## Primary Basecalls: CGGTCTGAGTCTTACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAACGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTTGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTCTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAAAGCCCAGCTTGAAAATCGGATGTCTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACACCCCCTTTGTACATAGCTGGTAATCCCATCGTCCACACTTTTCATTTTCCAATAAAAATCGCACTTCCGCCCCCCGGCTTCTTCGGTGGAACTCCCGGTTGAGATTACTCAAATGCCTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATAATATCTTTTTATCTTTAACCCCCCTTTAATCCCCATTGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
##
## Secondary Basecalls: CTGTACGAGTCTGACGCAGTTGCGCCCCATGCCTTCTGGCCGAGGGCACGTCTGCCTGGGTGTCACAAATCGTCGTCCCCCAATCCTCATGAGGATATGGGACGGAAGCTGGTCTCCCGTGTTGTACCGAAAGCGGTTGGCCAAAATCCGAGCTAAGGACGCCAGGAGCGTCTCGACATGCGGTGGTGAATTCAAGCCTCTTGATTTTGTCGGTCGCTCTTGTCCGGAAGCTTTTGATGACCCAAAGTCCTCAATGCGACCCCAGGTCAGGCGGGATCACCCGCTGAGTTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACTAGGATTCCCTTAGTAACGGCGAGCGAACCGGGAAGAGCCCAGCTTGAAAATCGGATGTTTTTGGCGTTCGAATTGTAGTCTGAGAGCGTCATTTTGGCGGCTCATGTGCAACATCCCCTTTGTACAAAGCTGGTAATCCCATCGTCCACACTGTTCATGTTCCAATAAAATTCGCACTACCGCCCCCCTCCTTCTTCGGGGGAACTCCCGGTTGAGATTTCTCAAATGCTTCCCAATATCCTATCTTGGTTTCTAATTCGAGGTATATTATCTTTTTATCTTTAACCCCCCTTTAATCCCCATGGCTTTATCTCTGTGGCTTAATTACCCCCCCTCTCTGGCTTCCCTTT
BLAST stand for Basic Local Alignment Search Tool, and there is a nice web interface for it at the address https://blast.ncbi.nlm.nih.gov/Blast.cgi
Take a moment to note the different types of BLAST. Why do you think there are so many different types?
We are using genes with fairly low mutation rates, so BLASTn is appropriate for finding a species-level match.
Simply copy and paste the sequence above into the sequence search box and you are ready to go!