Note: Installation of packages may take several minutes.
See transcriptomics presentation
A typical transcriptome analysis will begin with raw sequences that must be processed and aligned to a reference genome. This is typically done on a high-performance computer, so we will start with our aligned sequences for this tutorial.
Transcriptome sequencing is very often used to compare expression profiles across samples, under the reasonable assumption that the number of reads (sequences) is proportional to the abundance of RNA in the initial sample. Thus, we can identify genes that are upregulated (higher abundance) or downregulated (lower abundance) relative to a control or other reference sample(s). This comparison is called a differential expression analysis.
For example, we might compare genes or isoforms in different experimental treatments or test groups. In addition to comparing two samples, we can also look at changes in gene expression over time. Examples might include expression profiling to identify how genes differ among samples or respond to different external stimula. For example:
There are multiple tools for analysis of RNA expression data such as the Tuxedo pipeline covered in this tutorial, which requires a reference genome.
Programs such as kallisto and salmon need prior transcriptome asssemblies, but are extremely fast.
Other pipelines like Trinity do not need a reference genome or a priori transcriptome information.
Steps 1 to 4 are not coverd in this tutorial. Instead we will focus on counting and visualizing differentially expressed genes using DESeq2. Most of the material is taken from the DESeq2 vignette. If you are interested in learning about the upstream sequence processing, see this Colautti Lab alignment tutorial.
NOTE: When asked to update all/some/none? [a/s/n] choose: n
BiocManager::install("IRanges")
# Main analysis
BiocManager::install("DESeq2")
# for heatmap visualization
BiocManager::install("pheatmap")
# for GO analysis
BiocManager::install("org.Hs.eg.db")
library(DESeq2)
We will be using a prepared dataset from Recount containing RNAseq data on human liver transcriptomes. We will be using this data to compare expression profiles between human males and females. The source paper for this data is Blekman et al. 2010
The counts and treatment information is all included in the RData file as an ExpressionSet
object – a class of object in R created by the ExpressionSet()
function.
Make a new project and create a folder called Data
. Then, download the file and savit it to the Data
folder in your local working directory. This data is an S4 object, so we import it with the load
function – this is the equivalent of read.csv
for a data.frame object.
load("Data/gilad_eset.RData")
Note that we aren’t assigning an object in this case. However, something was added to our R environment. We can use the ls()
function to see:
ls()
## [1] "gilad.eset"
We can also look in the ‘Environment’ tab in R Studio.
Let’s take a look at the structure:
str(gilad.eset)
## Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
## ..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots
## .. .. ..@ name : chr ""
## .. .. ..@ lab : chr ""
## .. .. ..@ contact : chr ""
## .. .. ..@ title : chr ""
## .. .. ..@ abstract : chr ""
## .. .. ..@ url : chr ""
## .. .. ..@ pubMedIds : chr ""
## .. .. ..@ samples : list()
## .. .. ..@ hybridizations : list()
## .. .. ..@ normControls : list()
## .. .. ..@ preprocessing : list()
## .. .. ..@ other : list()
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 2
## .. .. .. .. .. ..$ : int [1:3] 1 0 0
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ assayData :<environment: 0x00000000259e6ea0>
## ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 3 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr [1:3] NA NA NA
## .. .. ..@ data :'data.frame': 6 obs. of 3 variables:
## .. .. .. ..$ sample.id : Factor w/ 6 levels "SRX014818and9",..: 1 2 3 4 5 6
## .. .. .. ..$ num.tech.reps: num [1:6] 2 2 2 2 2 2
## .. .. .. ..$ gender : Factor w/ 2 levels "F","M": 1 1 1 2 2 2
## .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 1 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr NA
## .. .. ..@ data :'data.frame': 52580 obs. of 1 variable:
## .. .. .. ..$ gene: Factor w/ 52580 levels "ENSG00000000003",..: 1 2 3 4 5 6 7 8 9 10 ...
## .. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ annotation : chr(0)
## ..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
## .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable:
## .. .. .. ..$ labelDescription: chr(0)
## .. .. ..@ data :'data.frame': 6 obs. of 0 variables
## .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns"
## .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. .. .. ..@ .Data:List of 1
## .. .. .. .. .. ..$ : int [1:3] 1 1 0
## ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
## .. .. ..@ .Data:List of 4
## .. .. .. ..$ : int [1:3] 2 13 0
## .. .. .. ..$ : int [1:3] 2 12 1
## .. .. .. ..$ : int [1:3] 1 3 0
## .. .. .. ..$ : int [1:3] 1 0 0
Alternatively, click the litle bubble next to gilad.eset
in the ‘Environment’ tab. Note the large file size – why do you think it’s so big (what are the main data stored here)?
All the different slots beginning with @
contain information about the experiment. You can access many (but not all) of these directly by applying the name as a function. For example, look for @phenoData
, which has a subslot called @dimLabels
, which itself has two levels: sampleNames
and sampleColumns
. Try running these as functions:
phenoData(gilad.eset)
## An object of class 'AnnotatedDataFrame'
## sampleNames: SRX014818and9 SRX014820and1 ... SRX014828and9 (6
## total)
## varLabels: sample.id num.tech.reps gender
## varMetadata: labelDescription
dimLabels(gilad.eset)
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'dimLabels' for signature '"ExpressionSet"'
sampleNames(gilad.eset)
## [1] "SRX014818and9" "SRX014820and1" "SRX014822and3" "SRX014824and5"
## [5] "SRX014826and7" "SRX014828and9"
This contains the names of our samples, but what do these names mean? The SRX code suggests it’s a short-read sequence dataset, but why are there two numbers (e.g. SRX01418and9). Some details in the paper might give us a hint.
We can also find the experimental data in the phenoData
slot.
pData(gilad.eset)
## sample.id num.tech.reps gender
## SRX014818and9 SRX014818and9 2 F
## SRX014820and1 SRX014820and1 2 F
## SRX014822and3 SRX014822and3 2 F
## SRX014824and5 SRX014824and5 2 M
## SRX014826and7 SRX014826and7 2 M
## SRX014828and9 SRX014828and9 2 M
Note that each sample has a number 2 for the column called ‘num.tech.reps’. This is probably the number of technical replicates, which means that the same sample was prepared twice, and then pooled (mixed together) before sequencing.
Note also that the first three samples are females and the last three are males.
The transcript counts can be viewed with exprs
.
head(exprs(gilad.eset))
## SRX014818and9 SRX014820and1 SRX014822and3 SRX014824and5
## ENSG00000000003 60 60 16 9
## ENSG00000000005 0 0 0 0
## ENSG00000000419 25 9 15 15
## ENSG00000000457 32 19 21 31
## ENSG00000000460 1 3 0 5
## ENSG00000000938 6 7 8 4
## SRX014826and7 SRX014828and9
## ENSG00000000003 56 37
## ENSG00000000005 0 0
## ENSG00000000419 26 11
## ENSG00000000457 28 28
## ENSG00000000460 1 1
## ENSG00000000938 4 1
What does each column represent? What about each row?
To analyze and graph the data using the DESeq2
package, we have to change the data format from an ExpressionSet
to a DESeqDataSet
object and identify the factor in the experimental design that we want to compare for our differential expression analysis. This requires a few steps.
First, convert the ExpressionSet
to a SummarizedExperiment
expr_se <- SummarizedExperiment(exprs(gilad.eset))
head(expr_se)
## class: SummarizedExperiment
## dim: 6 6
## metadata(0):
## assays(1): ''
## rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000000460
## ENSG00000000938
## rowData names(0):
## colnames(6): SRX014818and9 SRX014820and1 ... SRX014826and7
## SRX014828and9
## colData names(0):
Second, we associate the count data with each of the treatments based on the sample.id:
colData(expr_se) <- DataFrame(pData(gilad.eset))
Third, convert from SummarizedExperiment
to DESeq
expr_de <- DESeqDataSet(expr_se, design = ~ gender)
## renaming the first element in assays to 'counts'
Note: it is also possible to use design = ~1 above and set the design afterwards with design(expr_de). This could be useful if you had different comparisons you wanted to make on the same data.
Finally, we are ready to analyze the data by running the DESeq
function on the DESeqDataSet
object. The model below will take a few seconds to run. The details of this analysis are a bit complicated, but worth reading in detail if you want to do this for a project or publication (use ?DESeq
and look up the references therein).
The analysis is complicated because sequencing is essentially a process of random sampling, so two genes may look like they are differentially expressed between two samples just because one sample has more overall RNA or because one sample has another gene ‘hogging’ all the sequence reads. So it’s not possible to compare one gene across samples without also considering all the other genes in those samples.
de_results <- DESeq(expr_de)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
Once the analysis is done, we can extract a table of results using the results()
function. This is useful for other downstream analyses including visualization.
DEdat <- results(de_results)
head(DEdat)
## log2 fold change (MLE): gender M vs F
## Wald test p-value: gender M vs F
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 38.4861655660068 -0.636080405143175 0.6338736763103
## ENSG00000000005 0 NA NA
## ENSG00000000419 16.531492591005 -0.0459820068355181 0.609659174149594
## ENSG00000000457 26.4013679241789 0.1718183111848 0.476002672833249
## ENSG00000000460 1.95259420862927 0.805281830848979 1.60261038773118
## ENSG00000000938 5.09740492589779 -1.324866931362 0.943224228567802
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 -1.00348133849905 0.315628676476054 0.99997903848561
## ENSG00000000005 NA NA NA
## ENSG00000000419 -0.0754224799448935 0.939878573738684 0.99997903848561
## ENSG00000000457 0.36096081175786 0.718128742512602 0.99997903848561
## ENSG00000000460 0.502481349811428 0.615328968180959 0.99997903848561
## ENSG00000000938 -1.40461503345148 0.16013578379046 0.99997903848561
We can select the genes that are most different between samples from the DESeq2 results above and investigate their function. For example, genes with padj less than 10%.
DE_list <- subset(DEdat, padj < 0.1)
DE_list
## log2 fold change (MLE): gender M vs F
## Wald test p-value: gender M vs F
## DataFrame with 4 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000049239 610.15410116007 1.48873274180955 0.314138613529478
## ENSG00000110244 252.915548249407 -2.32420844706005 0.541939365028577
## ENSG00000143546 192.604210422822 -3.0667559246525 0.703937091501114
## ENSG00000244734 94.9769083984912 -2.9350691228982 0.704277431397419
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000049239 4.73909502904789 2.14674816076066e-06 0.0225794971548806
## ENSG00000110244 -4.28868725366997 1.79732237088683e-05 0.0630141223232922
## ENSG00000143546 -4.35657669084148 1.32112417442602e-05 0.0630141223232922
## ENSG00000244734 -4.16748995786288 3.07972075559841e-05 0.0809812572684602
To figure out what genes these are, we need the annotated reference genome. Genomic annotations on model organisms like humans have been prepared and packaged for R so they can be easily downloaded from Bioconductor.
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
##
geneinfo <- AnnotationDbi::select(org.Hs.eg.db, keys=rownames(DE_list),
columns=c("ENSEMBL","SYMBOL","GENENAME"),
keytype="ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
geneinfo
## ENSEMBL SYMBOL
## 1 ENSG00000049239 H6PD
## 2 ENSG00000110244 APOA4
## 3 ENSG00000143546 S100A8
## 4 ENSG00000244734 HBB
## GENENAME
## 1 hexose-6-phosphate dehydrogenase/glucose 1-dehydrogenase
## 2 apolipoprotein A4
## 3 S100 calcium binding protein A8
## 4 hemoglobin subunit beta
While DESeq2 gives a list of differentially expressed (DE) genes, it is rather difficult to detect patterns and meaningful DE genes because there are many DE genes and the names don’t tell us how the different genes are related. Also, because we are looking at so many genes, our padj values take a big penalty, which can make DE genes difficult to detect. A common way to deal with this is to somehow bin genes into groups.
Gene Ontology or GO is a system for classifying genes by their function. A GO analysis can be used to discover which groups or types of genes are differentially expressed. A GO analysis in R can be performed using the topGO
or goseq
packages.
We can visualize the log-fold change in expression with a MA plot. An MA plot compares the log-ratio (or log-fold change or M) to the average normalized reads.
plotMA(DEdat)
The points in red have p < 0.005.
We can visualize differential expression across samples with a heatmap.
library(pheatmap)
# need transformed values
vsd <- vst(de_results, blind=FALSE) # Variance stabilizing transformation
rld <- rlog(de_results, blind=FALSE) # Controls for rows with few counts
ntd <- normTransform(de_results) # Transforms data for plotting
top20 <- order(rowMeans(counts(de_results,normalized=TRUE)),
decreasing=TRUE)[1:20] # Average counts for each gene and Select top 20 most abundant genes
pheatmap(assay(ntd)[top20,])
Note how the samples and genes are clusted by similar expression profiles, similar to the metabarcoding tutorial.
We can also add extra annotation on the plot for the treatments, in this case gender.
annot_df <- data.frame(gender = colData(expr_se)$gender, row.names = row.names(colData(expr_se)))
pheatmap(assay(ntd)[top20,], annotation_col = annot_df)
We can also use a PCA in ggplot2 to look at how similar samples are to each other. This is similar in principal to the NMDS analysis we did in the metabarcoding tutorial.
plotPCA(vsd, intgroup="gender")
That looks suspiciously like a ggplot2 formatted graph. Let’s try modifying it with a ggplot layer:
library(ggplot2)
plotPCA(vsd, intgroup="gender") + theme_bw()