NOTE: This is being updated with newer programs (e.g. HISAT2 instead of BOWTIE)
In this tutorial you will learn about the TUXEDO SUITE of tools for sequence alignment and RNA expression analysis. This involves several tools:
Let’s begin by mapping reads from our Illumina MiSeq short reads to our chloroplast reference genome. Start by loading the BOWTIE2 package and then taking a look at the help file.
$ use bowtie2
$ bowtie2 --help
Take a second to think about the computation required to align a few hundred million short reads, each requiring a match of a few hundred bases long, to a reference geneome that may be hundreds of millions to billions of bases long. To keep memory requirements from getting out of control (i.e. GB instead of TB), Bowtie uses something called the Burrows Wheeler Transform LINK.
BOWTIE2 also requires the reference genome to be indexed. Indices for well-studied organisms are available online, but in our case we’ll have to create an index for our reference chloroplast genome. This can take a while for larger genomes but only has to be done once to create the index files. Since several files are created, it is helpful to keep the genome and all the associated indices in a separate folder, which should already been created in the tuxedo.tar.gz file:
$ ls -l ./Ap_genome
total 791
-rw-r--r-- 1 hpc3183 hpcg1540 400153 Feb 20 08:44 Ap_cpDNA_annotation.gff
-rw-r--r-- 1 hpc3183 hpcg1540 153259 Feb 20 08:44 Ap_cpDNA_genome.fa
Note the fasta file with the .fa extension, which contains the draft chloroplast (cp) sequence as a single contig.
To index the cp genome, run the bowtie2-build program. Since the genome is small, we could do this from the command line. However, for most genomes (100 Mb+) you would want to put this into a bash file, so let’s do that.
Create a new bash file called genindex.sh text file with your favorite text editor:
nano genindex.sh
Now enter the usual header of the bash script:
#! /bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q abaqus.q
#$ -l qname=abaqus.q
#$ -m e
Then choose the names of the output files and tell the scheduler to load the BOWTIE2 commands:
#$ -e genindex.err
#$ -o genindex.out
use bowtie2
Exit and save, then take a quick look at the bowtie2-build -h file. You’ll see that, at a minimum, we need to specify the location of the genome file, and set a file prefix for the output (index) files. It’s important to remember that the cp genome is in a folder called Ap_genome. Since we set -cwd in the bash file, we have to be careful to either use the path that is relative to the directory that we will submit the script from, OR use an absolute path. Let’s re-open the genindex.sh script and add the relative path in the bash script:
bowtie2-build ./Ap_genome/Ap_cpDNA_genome.fa ./Ap_genome/Ap_cpDNA_genome
Exit and save changes to the bash script. In this case, we want to be sure to submit the script from the parent directory that contains the Ap_genome directory. Be sure to exit and save your changes. Then, submit the script using qsub:
$ qsub genindex.sh
After the job is completed, there should be output files (.err and .out), which we can inspect to make sure that everything ran smoothly. Running logs with any warning and error messages generated by the programs we run in our bash script will appear here:
$ nano genindex.out
$ nano genindex.err
There should also be several new index files in our genome folder:
$ ls -l ./Ap_genome
Now that we have an indexed reference genome, we can use BOWTIE2 to align our MiSeq DNA sequences to it. We’ll do this for the same MiSeq data that we used in our de novo genome assembly tutorial. Let’s start with another bash script that we’ll call MiSeqAlign.sh:
$ nano MiSeqAlign.sh
Enter the usual first few lines, with custom output file names, then load the BOWTIE2 commands:
#! /bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q abaqus.q
#$ -l qname=abaqus.q
#$ -m e
#$ -e MiSeqAlign.err
#$ -o MiSeqAlign.out
use bowtie2
Now we add the bowtie2 command, specifying:
We’ll also break up one long line into several lines using the backslash, to make the script more readable:
bowtie2 -x ./Ap_genome/Ap_cpDNA_genome \
-1 ../deNovo/MiSeqF.fq -2 ../deNovo/MiSeqR.fq \
-S MiSeqAligned.sam
Note that we are using the short-read sequences from the de novo tutorial, which in a different directory. Using the ../ tells the scheduler to go to the parent of the current directory, then the deNovo/ tells it to go into the deNovo folder.
We won’t run this script yet because it will take a little while to run. Even with the small genome size, the index, and the B-W transform, the run time is still long.
Why does this script take a while to run?
Many common programs that work with next-generation sequencing data involve iterations that repeat the same function or analysis over and over again (e.g. aligning a bunch of sequences). Many of these programs can take advantage of multiple processors to decrease the runtime.
For example, let’s search the bowtie2 help file for a parameter that allows us to specify the number of threads (computer processors). If you have a very long help file, sometimes you can search for a string using the grep command:
$ bowtie2 --help | grep 'thread'
Note the vertical line | or pipe character in the command above. This is a very useful UNIX command that says ‘take whatever is output from the command on the left side of the pipe and use it as input for whatever is on the right side’. So in this case, it is taking the output from the help command and inputting it into the grep command to search for the string ‘thread’.
The grep command uses Regular Expressions or REGEX, which is a very powerful set of syntax rules for searching text. We don’t have time to go into this in detail, but there are many tutorials online. I’ve also written one for R: LINK
After running the above command you should see this line:
-p/--threads <int> number of alignment threads to launch (1)
If you look at the output from the command above, you’ll see a single line from the help file, telling us that we can use -p or –threads as a parameter in the bowtie2 command to run in parallel. If you recall from the de novo assembly tutorial, we do this by adding a line to the header of the bash script (#$ -pe shm.pe 8), where 8 is the number of threads we want to use. Then we use $NSLOTS with the -p parameter in the actual script. Do this and then use qsub to submit the script. It may take a few minutes to run.
Run time may be 20+ min. Take the time to ask questions.
When it is done running, use ls -l to inspect the output. You’ll see a very large SAM file (>1GB):
-rw-r--r-- 1 hpc3183 hpcg1540 1047343144 Feb 20 09:46 MiSeqAligned.sam
We were introduced to SAM files in the de novo assembly tutorial, but let’s take a more detailed look at this file format. Recall the warning from the de novo assembly tutorial: if we try to use a standard text editor we are going to use up a lot of memory and may crash our computer. Instead, we should us head, tail or less to inspect the file. Let’s use head to look at the first few lines:
$ less MiSeqAligned.sam
The SAM file is just like a large spreadsheet, with tabs delimiting the columns. You can use the right arrow to scroll across to get a better sense of this.
The first three lines begin with the @ symbol and specify the header of the file:
The fourth line shows the data for the first alignment:
The last set of columns are optional TAGS. For more detail, see the official SAM specification info. LINK
In the de novo assembly tutorial, we talked about the difference between text-based SAM file and the binary BAM version. While SAM files are human-readable, BAM files are much smaller and faster for the computer to work with. We can convert them using the SAMTOOLS program:
$ use samtools
$ samtools sort MiSeqAligned.sam > MiSeqAligned.bam
It will take a few minutes to run. After it is complete, compare the file sizes.
How much smaller is the BAM file compared to the SAM file?
The SAMTOOLS program is actually a collection of several very useful programs (aka tools), just like the PICARD program we looked at briefly in the de novo assembly tutorial. We can use these to inspect our alignment. The flagstat tool gives us a report based on the FLAG column in the SAM file (Col #2)
$ samtools flagstat MiSeqAligned.bam
1599134 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1552016 + 0 mapped (97.05% : N/A)
1599134 + 0 paired in sequencing
799567 + 0 read1
799567 + 0 read2
1520076 + 0 properly paired (95.06% : N/A)
1537634 + 0 with itself and mate mapped
14382 + 0 singletons (0.90% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Note that we have no real errors since we were aligning sequences that were already filtered for quality and alignment to the chloroplast genome.
Another thing we might want to do is to index our BAM file so that it can be used with a genome browswer like IGV LINK. This creates a BAI or ‘bam index’ file with the .bai extension.
$ samtools index MiSeqAligned.bam
The alignments can’t be viewed directly on the cluster, but the following files can be copied to a computer with a visual interface for use with a genome browser like IGV:
. When opened with IGV, it looks something like this:
Our coverage is very high so it’s easier to see if we scroll down to the end of our aligned sequences:
Each little grey bar shows the alignment and orientation of one of the short reads. Coloured vertical bars show SNP variants relative to the reference genome.
Now that we have some experience aligning sequences, let’s get some RNA sequence data from the NCBI sequence read archive (SRA) database. Searching for ‘chloroplast RNA sequence’yields a few hits, one of which is an experiment using enriched transcriptome sequences from Arabidopsis thaliana. This species is a well-studied ’model organism’ that also happens to be in the same taxonomic family (Brassicaceae) as Alliaria petiolata – the source of our cpDNA genome! Here is a link to the BioProject info: LINK
The word enriched usually implies some sort of lab method(s) that increases the representation of something of interest. In this case, whole-RNA extractions would contain RNA transcripts from the nucleus and the mitochondrion as well as the chloroplast. Based on the project description, it looks like the authors used centrifugation with a glucose gradient to increase the number of chloroplast organelles in the sample. Then, a second step used streptavidin-coated magnetic beads used to reduce the amount of ribosomal RNA (rRNA) from the RNA sample.
The chloroplast transcripts come from two genotypes, the wild-type col0 line and an rnc3/4 double mutant. In this mutant, the nonfunctional RNC3 and RNC4 genes affect transcription regulation, so we should see a difference in the transcript profiles.
The sequences are very large files that we have been downloaded as files with prefix At followed by the genotype code col0 or rnc34, then the SRA ID code and .fq extension:
ls -l *.fq
-rw-r--r--. 1 hpc3183 hpcg1540 4651870426 Feb 21 10:05 At_col0_SRR1722713.fq
-rw-r--r--. 1 hpc3183 hpcg1540 5078852188 Feb 21 10:06 At_rnc34_SRR1724089.fq
How big are these files?
To download these sequences from the ncbi SRA database, I used the fastq-dump commmand from the sra-tools program. LINK
Let’s align them to our chloroplast genome. Take a look at the alignRNA.sh script:
#! /bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q abaqus.q
#$ -l qname=abaqus.q
#$ -m e
#$ -e alignRNA.err
#$ -o alignRNA.out
#$ -pe shm.pe 8
use tophat2
tophat2 -p $NSLOTS -o col0Aligned \
./Ap_genome/Ap_cpDNA_genome ./At_col0_SRR1722713.fq
tophat2 -p $NSLOTS -o rnc34Aligned \
./Ap_genome/Ap_cpDNA_genome ./At_rnc34_SRR1724089.fq
We won’t run it because it takes a while, but let’s consider what it would do if we ran it. Note in the first tophat2 command:
The second tophat2 command is similar but uses the rnc3/4 RNASeq data and an output directory with the rnc34.
The output creates a BAM file called accepted_hits.bam inside each output directory for each of the RNA-Seq experiments. To avoid confusion, let’s rename these:
$ mv ./col0Aligned/accepted_hits.bam ./col0Aligned/col0Aligned.bam
$ mv ./rnc34Aligned/accepted_hits.bam ./rnc34Aligned/rnc34Aligned.bam
Next we should should index the files, just like we did for the genome fasta file earlier with the genindex.sh script. In fact, we can simply copy and then modify this script, rather than writing a new one from scratch:
$ cp genindex.sh RNAindex.sh
$ nano RNAindex.sh
Inside the text editor, we need to change the error (-e) and output (-o) files in the bash header:
#$ -e RNAindex.err
#$ -o RNAindex.out
Then we change the bowtie2 command to samtools to index the RNA BAM file, and repeat for each genotype:
use samtools
samtools index ./col0Aligned/col0Aligned.bam
samtools index ./rnc34Aligned/rnc34Aligned.bam
Go ahead and submit the script using qsub.
Take some time to explore files in the col0Aligned and rnc34Aligned directories.
What happens if you inspect the BAM and BAI files with the head or tail command?
Now that we have aligned the RNASeq data to the genome and indexed the reads, we need to construct a master annotation of all the RNA transcript sites. The number of sequences that map to each of these locations (i.e. read depth) is an estimate of the expression level:
Number of RNA transcript sequences = rate of transcription - rate of degredation.
As usual, we should start by looking at the help file
use cufflinks
cufflinks --help
As you can see, there are a lot of options here. Some of the options are very important and specific to the details of the RNASeq library prep methods. Different methods have different biases in representation, which are corrected for by many of these parameters. Of particular note is the library type, which varies with the method of RNA library preparation. One of the most popular is Illumina’s TruSeq library prep; this is also the default for cufflinks (fr-unstranded). PDF LINK
First we need to assemble the RNA sequences, many of which may be longer than our sequencing length of 100bp. We need to do this separately for both of the genotypes, then use cuffmerge to merge and map to the reference genome, before we can compare the transcription differences. Examine the file cufflinks.sh:
cufflinks -p $NSLOTS -o cuffcol0 ./col0Aligned/col0Aligned.bam
mv ./cuffcol0/transcripts.gtf cuffcol0/col0.gtf
Notice that the cufflinks command occurs twice; once for each genotype. The mv command again renames the output into something more descriptive
This is followed by two lines that create a text file containing the paths to the two gtf files. This is used as input in the final command, which annotates the transcript data with the genome:
cuffmerge -s ./Ap_genome/ RNAseq.txt
This script takes a while to run but produces an annotated genome that be viewed in a genome browser along with coverage variation from the RNASeq data:
The top panel shows the reference genome, with the red box indicating the part that is ‘zoomed in’. The second panel shows the transcript levels for the rnc3/4 genotype. The third panel shows transcript levels for the col0 genotype. The final panel shows the annotation for all of the mapped transcripts.