Introduction

Note: Installation of packages may take several minutes.

A typical transcriptome analysis will begin with raw sequences that must be processed and aligned to a reference genome. This is typically done on a high-performance computer, so we will start with our aligned sequences for this tutorial.

Transcriptome sequencing is very often used to compare expression profiles across samples, under the reasonable assumption that the number of reads (sequences) is proportional to the abundance of RNA in the initial sample. Thus, we can identify genes that are upregulated (higher abundance) or downregulated (lower abundance) relative to a control or other reference sample(s). This comparison is called a differential expression analysis.

For example, we might compare genes or isoforms in different experimental treatments or test groups. In addition to comparing two samples, we can also look at changes in gene expression over time. Examples might include expression profiling to identify how genes differ among samples or respond to different external stimula. For example:

genes involved in cancer progression
plants responses to drought, herbivory, or other stresses
genes affected by a knockout mutation

Differential Expression Analysis

Extract RNA from a sample
Enrich for sequences of interest (e.g. ribosome depletion)
Synthesize cDNA (coding DNA) from RNA using reverse transcriptase
Align reads to a reference genome
Count the number of reads in genic regions (i.e. parts of the genome with annotated genes)
Perform statistical tests comparing read “counts” for differential expression between samples

There are multiple tools for analysis of RNA expression data such as the Tuxedo pipeline covered in this tutorial, which requires a reference genome.

Programs such as kallisto and salmon need prior transcriptome asssemblies, but are extremely fast.

Other pipelines like Trinity do not need a reference genome or a priori transcriptome information.

Steps 1 to 4 are not coverd in this tutorial. Instead we will focus on counting and visualizing differentially expressed genes using DESeq2. Most of the material is taken from the DESeq2 vignette. If you are interested in learning about the upstream sequence processing, see this Colautti Lab alignment tutorial.

Downloading and examining data

NOTE: When asked to update all/some/none? [a/s/n] choose: n

BiocManager::install("IRanges")
# Main analysis
BiocManager::install("DESeq2")
# for heatmap visualization
BiocManager::install("pheatmap")
# for GO analysis
BiocManager::install("org.Hs.eg.db")

library(DESeq2)

We will be using a prepared dataset from Recount containing RNAseq data on human liver transcriptomes. We will be using this data to compare expression profiles between human males and females. The source paper for this data is Blekman et al. 2010

The counts and treatment information is all included in the RData file as an ExpressionSet object – a class of object in R created by the ExpressionSet() function.

Make a new project and create a folder called Data. Then, download the file and savit it to the Data folder in your local working directory. This data is an S4 object, so we import it with the load function – this is the equivalent of read.csv for a data.frame object.

load("Data/gilad_eset.RData")

Note that we aren’t assigning an object in this case. However, something was added to our R environment. We can use the ls() function to see:

ls()

## [1] "gilad.eset"

We can also look in the ‘Environment’ tab in R Studio.

Let’s take a look at the structure:

str(gilad.eset)

## Formal class 'ExpressionSet' [package "Biobase"] with 7 slots
##   ..@ experimentData   :Formal class 'MIAME' [package "Biobase"] with 13 slots
##   .. .. ..@ name             : chr ""
##   .. .. ..@ lab              : chr ""
##   .. .. ..@ contact          : chr ""
##   .. .. ..@ title            : chr ""
##   .. .. ..@ abstract         : chr ""
##   .. .. ..@ url              : chr ""
##   .. .. ..@ pubMedIds        : chr ""
##   .. .. ..@ samples          : list()
##   .. .. ..@ hybridizations   : list()
##   .. .. ..@ normControls     : list()
##   .. .. ..@ preprocessing    : list()
##   .. .. ..@ other            : list()
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 2
##   .. .. .. .. .. ..$ : int [1:3] 1 0 0
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ assayData        :<environment: 0x00000000259e6ea0> 
##   ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  3 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr [1:3] NA NA NA
##   .. .. ..@ data             :'data.frame':  6 obs. of  3 variables:
##   .. .. .. ..$ sample.id    : Factor w/ 6 levels "SRX014818and9",..: 1 2 3 4 5 6
##   .. .. .. ..$ num.tech.reps: num [1:6] 2 2 2 2 2 2
##   .. .. .. ..$ gender       : Factor w/ 2 levels "F","M": 1 1 1 2 2 2
##   .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ featureData      :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  1 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr NA
##   .. .. ..@ data             :'data.frame':  52580 obs. of  1 variable:
##   .. .. .. ..$ gene: Factor w/ 52580 levels "ENSG00000000003",..: 1 2 3 4 5 6 7 8 9 10 ...
##   .. .. ..@ dimLabels        : chr [1:2] "featureNames" "featureColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ annotation       : chr(0) 
##   ..@ protocolData     :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
##   .. .. ..@ varMetadata      :'data.frame':  0 obs. of  1 variable:
##   .. .. .. ..$ labelDescription: chr(0) 
##   .. .. ..@ data             :'data.frame':  6 obs. of  0 variables
##   .. .. ..@ dimLabels        : chr [1:2] "sampleNames" "sampleColumns"
##   .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. .. .. ..@ .Data:List of 1
##   .. .. .. .. .. ..$ : int [1:3] 1 1 0
##   ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slot
##   .. .. ..@ .Data:List of 4
##   .. .. .. ..$ : int [1:3] 2 13 0
##   .. .. .. ..$ : int [1:3] 2 12 1
##   .. .. .. ..$ : int [1:3] 1 3 0
##   .. .. .. ..$ : int [1:3] 1 0 0

Alternatively, click the litle bubble next to gilad.eset in the ‘Environment’ tab. Note the large file size – why do you think it’s so big (what are the main data stored here)?

All the different slots beginning with @ contain information about the experiment. You can access many (but not all) of these directly by applying the name as a function. For example, look for @phenoData, which has a subslot called @dimLabels, which itself has two levels: sampleNames and sampleColumns. Try running these as functions:

phenoData(gilad.eset)

## An object of class 'AnnotatedDataFrame'
##   sampleNames: SRX014818and9 SRX014820and1 ... SRX014828and9 (6
##     total)
##   varLabels: sample.id num.tech.reps gender
##   varMetadata: labelDescription

dimLabels(gilad.eset)

## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'dimLabels' for signature '"ExpressionSet"'

sampleNames(gilad.eset)

## [1] "SRX014818and9" "SRX014820and1" "SRX014822and3" "SRX014824and5"
## [5] "SRX014826and7" "SRX014828and9"

This contains the names of our samples, but what do these names mean? The SRX code suggests it’s a short-read sequence dataset, but why are there two numbers (e.g. SRX01418and9). Some details in the paper might give us a hint.

We can also find the experimental data in the phenoData slot.

pData(gilad.eset)

##                   sample.id num.tech.reps gender
## SRX014818and9 SRX014818and9             2      F
## SRX014820and1 SRX014820and1             2      F
## SRX014822and3 SRX014822and3             2      F
## SRX014824and5 SRX014824and5             2      M
## SRX014826and7 SRX014826and7             2      M
## SRX014828and9 SRX014828and9             2      M

Note that each sample has a number 2 for the column called ‘num.tech.reps’. This is probably the number of technical replicates, which means that the same sample was prepared twice, and then pooled (mixed together) before sequencing.

Note also that the first three samples are females and the last three are males.

The transcript counts can be viewed with exprs.

head(exprs(gilad.eset))

##                 SRX014818and9 SRX014820and1 SRX014822and3 SRX014824and5
## ENSG00000000003            60            60            16             9
## ENSG00000000005             0             0             0             0
## ENSG00000000419            25             9            15            15
## ENSG00000000457            32            19            21            31
## ENSG00000000460             1             3             0             5
## ENSG00000000938             6             7             8             4
##                 SRX014826and7 SRX014828and9
## ENSG00000000003            56            37
## ENSG00000000005             0             0
## ENSG00000000419            26            11
## ENSG00000000457            28            28
## ENSG00000000460             1             1
## ENSG00000000938             4             1

What does each column represent? What about each row?

DESeq2 analysis

To analyze and graph the data using the DESeq2 package, we have to change the data format from an ExpressionSet to a DESeqDataSet object and identify the factor in the experimental design that we want to compare for our differential expression analysis. This requires a few steps.

First, convert the ExpressionSet to a SummarizedExperiment

expr_se <- SummarizedExperiment(exprs(gilad.eset))
head(expr_se)

## class: SummarizedExperiment 
## dim: 6 6 
## metadata(0):
## assays(1): ''
## rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000000460
##   ENSG00000000938
## rowData names(0):
## colnames(6): SRX014818and9 SRX014820and1 ... SRX014826and7
##   SRX014828and9
## colData names(0):

Second, we associate the count data with each of the treatments based on the sample.id:

colData(expr_se) <- DataFrame(pData(gilad.eset))

Third, convert from SummarizedExperiment to DESeq

expr_de <- DESeqDataSet(expr_se, design = ~ gender)

## renaming the first element in assays to 'counts'

Note: it is also possible to use design = ~1 above and set the design afterwards with design(expr_de). This could be useful if you had different comparisons you wanted to make on the same data.

Finally, we are ready to analyze the data by running the DESeq function on the DESeqDataSet object. The model below will take a few seconds to run. The details of this analysis are a bit complicated, but worth reading in detail if you want to do this for a project or publication (use ?DESeq and look up the references therein).

The analysis is complicated because sequencing is essentially a process of random sampling, so two genes may look like they are differentially expressed between two samples just because one sample has more overall RNA or because one sample has another gene ‘hogging’ all the sequence reads. So it’s not possible to compare one gene across samples without also considering all the other genes in those samples.

de_results <- DESeq(expr_de)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

Once the analysis is done, we can extract a table of results using the results() function. This is useful for other downstream analyses including visualization.

DEdat <- results(de_results)
head(DEdat)

## log2 fold change (MLE): gender M vs F 
## Wald test p-value: gender M vs F 
## DataFrame with 6 rows and 6 columns
##                         baseMean      log2FoldChange             lfcSE
##                        <numeric>           <numeric>         <numeric>
## ENSG00000000003 38.4861655660068  -0.636080405143175   0.6338736763103
## ENSG00000000005                0                  NA                NA
## ENSG00000000419  16.531492591005 -0.0459820068355181 0.609659174149594
## ENSG00000000457 26.4013679241789     0.1718183111848 0.476002672833249
## ENSG00000000460 1.95259420862927   0.805281830848979  1.60261038773118
## ENSG00000000938 5.09740492589779     -1.324866931362 0.943224228567802
##                                stat            pvalue             padj
##                           <numeric>         <numeric>        <numeric>
## ENSG00000000003   -1.00348133849905 0.315628676476054 0.99997903848561
## ENSG00000000005                  NA                NA               NA
## ENSG00000000419 -0.0754224799448935 0.939878573738684 0.99997903848561
## ENSG00000000457    0.36096081175786 0.718128742512602 0.99997903848561
## ENSG00000000460   0.502481349811428 0.615328968180959 0.99997903848561
## ENSG00000000938   -1.40461503345148  0.16013578379046 0.99997903848561

baseMean – this is the average gene expression for the two groups.
log2FoldChange – raise this number to the exponent 2 to get the fractional difference. For example, -2.32 means a ratio of 2^(-2.32) or 0.2 (Male vs Female), so the gene would be expressed higher in the male than the female.
lfcSE – this is the standard error of the estimate of the log2FoldChange
pvalue – this is the probability of getting the observed log2FoldChange just by randomly sampling sequences. This depends on the magnitude of the difference relative to the total number of sequences.
padj – this is an adjusted p-value. The p-value must be adjusted to account for the fact that we are looking at many genes. For example, if we have 10,000 genes, so we expect to get P < 0.05 about 5% of the time, so we would get 500 genes with P < 0.05 just by chance (i.e. 0.05 * 10,000).

We can select the genes that are most different between samples from the DESeq2 results above and investigate their function. For example, genes with padj less than 10%.

DE_list <- subset(DEdat, padj < 0.1)
DE_list

## log2 fold change (MLE): gender M vs F 
## Wald test p-value: gender M vs F 
## DataFrame with 4 rows and 6 columns
##                         baseMean    log2FoldChange             lfcSE
##                        <numeric>         <numeric>         <numeric>
## ENSG00000049239  610.15410116007  1.48873274180955 0.314138613529478
## ENSG00000110244 252.915548249407 -2.32420844706005 0.541939365028577
## ENSG00000143546 192.604210422822  -3.0667559246525 0.703937091501114
## ENSG00000244734 94.9769083984912  -2.9350691228982 0.704277431397419
##                              stat               pvalue               padj
##                         <numeric>            <numeric>          <numeric>
## ENSG00000049239  4.73909502904789 2.14674816076066e-06 0.0225794971548806
## ENSG00000110244 -4.28868725366997 1.79732237088683e-05 0.0630141223232922
## ENSG00000143546 -4.35657669084148 1.32112417442602e-05 0.0630141223232922
## ENSG00000244734 -4.16748995786288 3.07972075559841e-05 0.0809812572684602

To figure out what genes these are, we need the annotated reference genome. Genomic annotations on model organisms like humans have been prepared and packaged for R so they can be easily downloaded from Bioconductor.

library(org.Hs.eg.db)

## Loading required package: AnnotationDbi

##

geneinfo <- AnnotationDbi::select(org.Hs.eg.db, keys=rownames(DE_list),
                   columns=c("ENSEMBL","SYMBOL","GENENAME"), 
                   keytype="ENSEMBL")

## 'select()' returned 1:1 mapping between keys and columns

geneinfo

##           ENSEMBL SYMBOL
## 1 ENSG00000049239   H6PD
## 2 ENSG00000110244  APOA4
## 3 ENSG00000143546 S100A8
## 4 ENSG00000244734    HBB
##                                                   GENENAME
## 1 hexose-6-phosphate dehydrogenase/glucose 1-dehydrogenase
## 2                                        apolipoprotein A4
## 3                          S100 calcium binding protein A8
## 4                                  hemoglobin subunit beta

While DESeq2 gives a list of differentially expressed (DE) genes, it is rather difficult to detect patterns and meaningful DE genes because there are many DE genes and the names don’t tell us how the different genes are related. Also, because we are looking at so many genes, our padj values take a big penalty, which can make DE genes difficult to detect. A common way to deal with this is to somehow bin genes into groups.

Gene Ontology or GO is a system for classifying genes by their function. A GO analysis can be used to discover which groups or types of genes are differentially expressed. A GO analysis in R can be performed using the topGO or goseq packages.

Visualization

MA Plot

We can visualize the log-fold change in expression with a MA plot. An MA plot compares the log-ratio (or log-fold change or M) to the average normalized reads.

plotMA(DEdat)

The points in red have p < 0.005.

Heatmap

We can visualize differential expression across samples with a heatmap.

library(pheatmap)
# need transformed values
vsd <- vst(de_results, blind=FALSE) # Variance stabilizing transformation
rld <- rlog(de_results, blind=FALSE) # Controls for rows with few counts
ntd <- normTransform(de_results) # Transforms data for plotting
top20 <- order(rowMeans(counts(de_results,normalized=TRUE)),
                decreasing=TRUE)[1:20] # Average counts for each gene and Select top 20 most abundant genes
pheatmap(assay(ntd)[top20,])

Note how the samples and genes are clusted by similar expression profiles, similar to the metabarcoding tutorial.

We can also add extra annotation on the plot for the treatments, in this case gender.

annot_df <- data.frame(gender = colData(expr_se)$gender, row.names = row.names(colData(expr_se)))

pheatmap(assay(ntd)[top20,], annotation_col = annot_df)

PCA Plot

We can also use a PCA in ggplot2 to look at how similar samples are to each other. This is similar in principal to the NMDS analysis we did in the metabarcoding tutorial.

plotPCA(vsd, intgroup="gender")

That looks suspiciously like a ggplot2 formatted graph. Let’s try modifying it with a ggplot layer:

library(ggplot2)
plotPCA(vsd, intgroup="gender") + theme_bw()

RNA-seq Tutorial

Yihan Wu

February 14, 2019