Setup

Login to your CAC account
If you haven’t already done the de novo assembly tutorial, you will need to copy and unpack the biol432.tar.
After you unpack the biol432 directory, there will be another compressed folder called RNAseqTutorial.tar.gz. Make sure you are in the directory containing this file when you go through the tutorial

Introduction

There are multiple tools for analysis of RNA expression data such as the Tuxedo pipeline, which is covered in this tutorial and requires a reference genome

Other pipelines such as Trinity do not need any a priori knowledge.

Extremely fast software programs such as kallisto and salmon are better for large datasets but need pre-existing transcriptome information.

For this tutorial, we will be using a dataset on small RNA to walkthrough the steps of the pipeline. Then we’ll analyze a full dataset for visualization in R.

A typical differential expression experiment

Usually, differential experiments are run to compare expression of genes or isoforms in different tissues, cell types, or individuals in different environments. The goal is to identify which genes are differentially expressed between the groups. The Tuxedo pipeline only works on two groups but other software can analyze multiple groups.

The toy data set in this walkthrough comes from The Arabidopsis Information Resoruce TAIR, and contains RNA-sequence data for Arabidopsis thaliana plants at two time points during development – Day 8 and Day 16 after germination.

Data files

To keep memory consumption down and not affect the performance of the login cluster, it is good practice to ask for an allocation. 8 gigabytes should be enough to perform the tutorial.

$ salloc --mem 8G

Next, decompress the tar.gz file:

$ tar -xzvf RNAseqTutorial.tar.gz

This will produce a directory called tutorial.

Move into this directory:

$ cd tutorial

You can list the files and their detail with ls -al.

$ ls
drwxr-xr-x  3 hpc3183 hpcg1540     4096 Aug  2 15:21 .
drwxr-x--- 13 hpc3183 hpcg1540     4096 Aug  2 15:15 ..
-rw-r--r--  1 hpc3183 hpcg1540   500073 May 15 17:06 athal_chr.fa
-rw-r--r--  1 hpc3183 hpcg1540    94110 May 15 17:06 athal_genes.gtf
-rw-r--r--  1 hpc3183 hpcg1540 10263308 May 15 17:06 Day16.fastq
-rw-r--r--  1 hpc3183 hpcg1540 11255870 May 15 17:06 Day8.fastq
drwxr-xr-x  4 hpc3183 hpcg1540     4096 May 17 16:14 final_output

The final_output directory contains the results of each step in the walkthrough.

File Formats

Sequence Data

Clusters of DNA sequences read on Illumina machines produce “short-reads” or nucleotide sequences of 75-250 base pairs in length. Each base pair has an associated quality score, and the sequences are usually stored with their quality scores in a file format called “FASTQ.”

Inside each *.fastq or *.fq file is a four-line format used to store information.

Line 1: always starts with a @ sign. Contains the name of the sequencer and has a sequence identifier.
Line 2: the actual nucleotide sequence.
Line 3: a “+” connecting lines 2 and 4, may also contain the same information from line 1.
Line 4: Information on the quality of the bases in line 2.

We can look at one of the fastq files with the head command. This command will display the first 10 lines of a file by default. You can always use the -h or --help flags with a command (ie head -h) for help.

Inspect the first few lines of the Day8.fastq file:

$ head Day8.fastq

You should see the resulting lines.

@SRR1660397.66 HWI-ST992:140:C2134ACXX:1:1101:5890:2171 length=50
GTTCAAAGATGAACTAGAAGACAGAAATACTGTTCAGGAGTATGGCTGTC
+
@@@DDDDDHHHHHGGGGIDHGCBHIIIFHIHIH?GHIJFG?DFGGIJGFG
@SRR1660397.770 HWI-ST992:140:C2134ACXX:1:1101:15403:2280 length=50
CTCAAACCAACGAAGCAAACGCTAATACTCTTCGAACAAACCTAGACCAA
+
CCCFFFFFGGHHHJJJJJIJJJJJJJJJJIJJJIJIIJJJJJJJJJJJJI
@SRR1660397.948 HWI-ST992:140:C2134ACXX:1:1101:20965:2381 length=50
CCGGTGACAACTTTACCGATGAAAGCTTCAACGGTGAAATCTCCACACAA

The files here have been cleaned and trimmed for quality. Sequences should be checked for low quality bases, and adapter sequences to be trimmed if necessary. The steps are usually:

Quality check by a program such as FastQC
Adapter removal and trimming by a program such as cutadapt or trimmomatic
Quality check again to confirm removal of adapters and low quality bases

The sister to the FASTQ format is the FASTA format, which stores sequence information without any quality metrics. FASTA is usually used for storing nucleotide information such as genomes or amino acid sequences.

Reference Genome

We can peek at the reduced reference “genome” file athal_chr.fa with the less command.

$ less athal_chr.fa

The above command should produce a screen starting with

>4 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated: 2009-02-02
AGTACATGTTTAATCGTGTGTGCTTATCAATATGCAACTTTGTGGTCTCTTATATGCATTCTGCTACTTTGT

less only shows as much as what can fit on a screen. You can scroll through the file with the up and down arrow keys, the space bar, or your mouse. To exit the reader, press Q on the keyboard.

Genome Features

The last important file to look at is the athal_genes.gtf file. This file contains information on where genomic features such as genes, exons, and introns are located along the genome based on past RNA sequencing experiments. A “gene transfer format” has nine fields for each line and is another version of a general feature format (.gff).

From Ensembl:

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on.
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

The Tuxedo Pipeline

In this tutorial, we will be using several pieces of software:

BOWTIE2 – A very fast, memory-efficient short-read aligner. Use for aligning short DNA sequences to a reference genome. This is the first step in variant detection for GWAS, QTL mapping, and a host of population genetics analyses. Since we are working with RNA sequences, we only use BOWTIE2 to generate an index file for the genome. BOWTIE2 manual
TOPHAT2 – Maps RNA sequences to a reference genome to assist with annotation and identify spliced sequences. Spliced sequences are RNA sequences that may lack parts of the original DNA (e.g. introns). So the same DNA sequence can produce several different RNA sequences. TOPHAT2 considers this, but BOWTIE2 does not. TOPHAT2 manual. See also Alternative Splicing on Wikipedia.
CUFFLINKS – a set of tools for analysis of RNA transcripts. CUFFLINKS manual
3a. cufflinks – assemble transcripts
3b. cuffcompare – compare assemblies to annotation
3c. cuffmerge – merge two or more transcript assemblies
3d. cuffdiff – identify loci with differential expression; detect alternative splicing and promoter regions
R and the package ggplot2.

On the cluster, load the necessary software. Some of the modules that we are loading (nixpkgs/16.09 and gcc/5.4.0) allow us to upload the older version of tophat tophat/2.1.1. Sometimes, in order to run certain pipelines or to replicate analyses from published papers we need to revert to older versions of a package. Two common cases for this are when we get an error with a newer, less polished version of the software, or when we want to reproduce a pipeline by using the same software version.

$ module load bowtie2
$ module load nixpkgs/16.09
$ module load gcc/5.4.0
$ module load tophat/2.1.1
$ module load cufflinks
$ module load samtools

To keep the output of the pipeline organized, make file directories from the main directory. Copy the entire box below into the command line to create an index directory, and a results directory with sub-folders.

mkdir indexes
mkdir results
mkdir results/Tophat
mkdir results/Tophat/Test8
mkdir results/Tophat/Test16
mkdir results/Cufflinks
mkdir results/Cufflinks/Test8
mkdir results/Cufflinks/Test16
mkdir results/Cuffmerge
mkdir results/Cuffdiff

Map short reads to genome: Bowtie2 and Tophat

The aim of the pipeline is to reconstruct transcripts and estimate the quantity by mapping short reads back to the genome, and counting the number of overlapping reads.

The first step is to index the genome to make it faster to search. We will first move the genome file athal_chr.fa into the indexes directory.

$ mv athal_chr.fa indexes

We can then move into the indexes directory and index the FASTA file with bowtie2.

$ cd indexes
$ bowtie2-build athal_chr.fa ./athal_chr

We can then move back out of the directory. The following commands will map the short reads in Day8.fastq to the genome using the transcriptome information from athal_genes.gtf and put the output into the Test8 directory inside the results directory.

$ cd ..
$ tophat2 -o results/Tophat/Test8 -G athal_genes.gtf  --transcriptome-index indexes indexes/athal_chr Day8.fastq

The command should only take a few minutes at most to run. The process should be repeated with the other FASTQ file.

$ tophat2 -o results/Tophat/Test16 -G athal_genes.gtf  --transcriptome-index indexes indexes/athal_chr Day16.fastq

We can move into the results folder to look at the output produced.

$ cd results/Tophat/Test8
$ ls

There are .bed and .bam files along with other files in the directory. *.bed files are used with genome viewers such as the Integrative Genomics Viewer and UCSC Genome Browser.

The .bam file is a binary version of a .sam file (Sequence Alignment Map). To view accepted_hits.bam:

samtools view accepted_hits.bam | head -4

Only the reads inside accepted_hits.bam will be used in the following steps.

The first three lines begin with the @ symbol and specify the header of the file:

@HD – The header line
- VN: Version number of the alignment (not the BOWTIE2 program)
- SO: Sorting order (usorted)
@SQ – Reference sequence dictionary
- SN: Reference sequence name (taken from first line of FASTA file)
- LN: Length of the reference sequence
@PG – Program info
- ID: Program ID
- PN: Program name
- VN: Program version
- CL: The command line used to generate the alignment

The fourth line shows the data for the first alignment:

QNAME – The name of the sequence; from the FASTQ file
FLAG – a bit-score FLAG; explained on Wikipedia page LINK
RNAME – Name of the reference alignment (from FASTA file)
POS – Mapping position (location along reference)
MAPQ – Quality score for the mapped sequence (probability of correct match)
CIGAR – A code specifying things like mismatches, insertions and deletions. Note the = represents a perfect match
RNEXT – Name of the mate pair read
PNEXT – Position of the mate pair read
TLEN – Length of the template
SEQ – Actual sequence of the mapped read
QUAL – Quality the original sequence (same Q-score from from FASTQ file)

The last set of columns are optional TAGS. For more detail, see the official SAM specification info. LINK

Return to the main directory

$ cd ../../..

Make transcripts: Cufflinks and Cuffmerge

The next step takes the mapped short reads and tries to make longer transcripts from those.

$ cufflinks -o results/Cufflinks/Test8 -g athal_genes.gtf results/Tophat/Test8/accepted_hits.bam
$ cufflinks -o results/Cufflinks/Test16 -g athal_genes.gtf results/Tophat/Test16/accepted_hits.bam

See the results by moving into the results folder. There are four outputs, two .gtf files and two fpkm_tracking files. The transcripts.gtf file has information on the gene the transcript is most likely associated with, and also its abundance as measured by FKPM (fragments per kilobase per million reads). Due to the differences in length between genes, bigger genes may have more transcripts than smaller genes even if they are expressed in the same amount. FKPM tries to control for this.

$ cd results/Cufflinks/Test16
$ ls
$ head transcripts.gtf

Right now, we have transcripts information for two different samples but we need to compare between these two samples.

To do this, we will use Cuffmerge to merge the two transcripts.gtf into one. First, move back into the main directory.

$ cd ../../..

Now we will make a text file called merge.txt which will contain the paths to the *.gtf files we want to merge using Cuffmerge.

$ nano merge.txt

Paste the following into merge.txt

results/Cufflinks/Test8/transcripts.gtf
results/Cufflinks/Test16/transcripts.gtf

Save the file and exit. (Ctrl+X and follow prompts at bottom of screen)

Cuffmerge should already be loaded as part of Cufflinks so use the command directly.

$ cuffmerge -g athal_genes.gtf -o results/Cuffmerge merge.txt

We can look at the output produced by cuffmerge. There should be a log directory and a merged.gtf file.

$ ls results/Cuffmerge/
$ less results/Cuffmerge/merged.gtf

Comparing differences: Cuffdiff

Our last step on the cluster is to test for differential expression between the samples.

$ cuffdiff -o results/Cuffdiff results/Cuffmerge/merged.gtf results/Tophat/Test8/accepted_hits.bam results/Tophat/Test16/accepted_hits.bam

The results should be in results/Cuffdiff. The *.diff files contain the results of tests for differences in expression of genes, transcripts and other features.

We can examine the gene_exp.diff file. Significance is indicated by the last column called significant. We can also search for all rows that are significant with a grep command.

$ cd results/Cuffdiff
$ head gene_exp.diff
$ grep -i "yes" gene_exp.diff

Simple visualization using base R

The next step in the Tuxedo pipeline is visualization. There is a specific package called cummRbund used for analyzing the results of Cuffdiff and other steps. However, cummRbund is a time-consuming process so we will use more common R packages.

The next steps are easier to do in R Studio on your laptop.

Download gene_exp.diff to your computer and put it into your working directory. You can do this in MobaXTerm, or using a free file transfer protocol (FTP) program like FileZilla. Or, you can just download the file from this link: gene_exp.diff

In R, your working directory can be found with the command getwd(). Alternatively you can set your working directory to where the file is by setwd(path_to_your_file)

Read in the file. The columns in gene_exp.diff is separated by tabs so our separator is \t. We also have column names so we put header as TRUE. Replace the path in read.table with your own.

gene_diff<-read.table("Data/gene_exp.diff", sep = "\t", header=T)

We can examine the data frame.

head(gene_diff)

##       test_id     gene_id        gene         locus sample_1 sample_2 status
## 1 XLOC_000001 XLOC_000001   AT4G19200     4:109-722       q1       q2     OK
## 2 XLOC_000002 XLOC_000002   AT4G19210   4:1199-5217       q1       q2     OK
## 3 XLOC_000003 XLOC_000003   AT4G19230 4:21378-23961       q1       q2     OK
## 4 XLOC_000004 XLOC_000004   AT4G19250 4:31503-32190       q1       q2 NOTEST
## 5 XLOC_000005 XLOC_000005   AT4G19260 4:33295-36262       q1       q2 NOTEST
## 6 XLOC_000006 XLOC_000006 AT4G19305.1 4:45251-45710       q1       q2 NOTEST
##      value_1  value_2 log2.fold_change. test_stat p_value  q_value significant
## 1 152652.000 12075.90         -3.660040 -2.975600 0.01255 0.146417          no
## 2  40327.100 20620.90         -0.967647 -0.604066 0.55940 0.975825          no
## 3    444.199  2310.71          2.379060  1.595950 0.11505 0.805350          no
## 4      0.000     0.00          0.000000  0.000000 1.00000 1.000000          no
## 5      0.000     0.00          0.000000  0.000000 1.00000 1.000000          no
## 6      0.000     0.00          0.000000  0.000000 1.00000 1.000000          no

summary(gene_diff)

##    test_id            gene_id              gene              locus          
##  Length:132         Length:132         Length:132         Length:132        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    sample_1           sample_2            status             value_1        
##  Length:132         Length:132         Length:132         Min.   :     0.0  
##  Class :character   Class :character   Class :character   1st Qu.:     0.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :   196.7  
##                                                           Mean   :  8981.0  
##                                                           3rd Qu.:  3612.7  
##                                                           Max.   :249268.0  
##                                                                             
##     value_2          log2.fold_change.   test_stat            p_value       
##  Min.   :     0.00   Min.   :   -Inf   Min.   :-4.443330   Min.   :0.00005  
##  1st Qu.:     0.00   1st Qu.:-0.5865   1st Qu.:-0.001481   1st Qu.:0.54679  
##  Median :    40.37   Median : 0.0000   Median : 0.000000   Median :0.94830  
##  Mean   :  4903.61   Mean   :    NaN   Mean   :-0.106537   Mean   :0.74266  
##  3rd Qu.:  4013.09   3rd Qu.: 0.1374   3rd Qu.: 0.029255   3rd Qu.:1.00000  
##  Max.   :109982.00   Max.   :    Inf   Max.   : 1.595950   Max.   :1.00000  
##                                        NA's   :4                            
##     q_value        significant       
##  Min.   :0.00175   Length:132        
##  1st Qu.:0.97583   Class :character  
##  Median :0.99085   Mode  :character  
##  Mean   :0.93772                     
##  3rd Qu.:1.00000                     
##  Max.   :1.00000                     
##

There are 14 columns and 132 rows.

You can see from the summary on the “status” column that there are 62 NOTEST and 70 OK. NOTEST means there weren’t enough sequences aligned to do a statistical test. These should be filtered out of our analysis.

gene_diff<- gene_diff[gene_diff$status=="OK",]

Now there are only 70 rows in gene_diff.

We can look at the changes between samples by graphing the log2.fold_change.

hist(gene_diff$log2.fold_change.)

This is $log_2(\frac{Sample2}{Sample1})$. In other words, we divide the number of reads of one sample divided by the other, and then take the log (base-2) so that negative numbers represent higher expression in the denominator, and positive values represent higher expression in the numerator.

Sample 1 is the first sample we entered into cufflinks (Test8), and Sample 2 is the second sample (Test16) in the cufflinks code above.

The histogram shows a peak near 0. We can see some genes increased in expression and some decreased. However, if we look at the column indicating significance, there are only two genes which are significant, which we already know from our grep command.

A common visualization used for differential expression is a heat map. We can make our own using the 5 genes with the lowest p-values.

top_genes<-head(gene_diff[order(gene_diff$p_value),], n=5L)

We can unpack the command above by running what is inside each bracket. First, the order command ranks the p_values in ascending order.

order(gene_diff$p_value)

##  [1] 53 68 44 54 17  1 40 21 61  3 38 52 29 64 28 46 23  5 37  8  7 24 70 25  4
## [26] 30 18 22 10 67 48 15 59 31  2 32 43 16 35 63 69 26 60 33 58 27 66  9 41 56
## [51] 14 42 51 62 57 50 39 19 12 11 47 36 65 49 55  6 34 45 13 20

Then gene_diff is rearranged based on the order of p_value and then we take only the first 5 rows of the resulting data frame using the head command.

A heat map requires a matrix with column and row names. We have to modify our top_genes data frame for the heat map.

First, we only take the columns we need: gene, value_1 and value_2.

mat_gene<-top_genes[,c("gene", "value_1", "value_2")]

Now we make the gene in each row the row name and drop the gene column. Then we make the data frame into a matrix

row.names(mat_gene)<-mat_gene$gene
mat_gene<-mat_gene[,-1]
mat_gene<-as.matrix(mat_gene)
mat_gene

##             value_1  value_2
## AT4G19810  5187.590    0.000
## AT4G20240  6453.970    0.000
## AT4G19430     0.000 1947.840
## AT4G19820   974.736    0.000
## AT4G19840 77050.900  419.423

Now we can make our heat map. For people using R on desktop, a graphical window will pop up. On the cluster, if there is no image forwarding, use the commented out commands to save to file, and then download to view the image.

# jpeg("heatmap.jpg")
heatmap(mat_gene, Rowv = NA, Colv = NA, cexCol = 1, cexRow = 1, main="Heatmap", col=rev(terrain.colors(8)))

# dev.off()

Another common plot to make is a volcano plot, named so because of the shape that is produced by the points.

We will use our entire data set. First plot the points and add a title.

gene_diff<-read.table("Data/gene_exp.diff", sep = "\t", header=T)
# jpeg("volcano_plot.jpg")
with(gene_diff, plot(log2.fold_change., -log10(p_value)))
title(main = "Volcano Plot")

Note, use ?with to read about the with function, and see some examples.

We can color the points that have status OK.

with(subset(gene_diff, status=="OK"), points(log2.fold_change., -log10(p_value), col="green"))

We can also add a legend.

legend("topright", pch = 1, col = c("black", "green"), legend = c("No Test", "Tested"))
# dev.off()

Here is the code to create the volcano plot together.

Here is another volcano plot for a different analysis (Garlic Mustard, Alliaria petiolata). With this data you can really see why it’s called a volcano plot.

Make sure you understand these figures. The x-axis shows the $log_2(\frac{Sample2}{Sample1})$ as described above. The y-axis is $-log_10(p)$, which is the log of the p-value multiplied by negative 1. A larger value indicates a smaller p-value. For example p = 0.001 would be 3, p=0.0001 would be 4, etc.

Kallisto and other faster methods

kallisto, because it doesn’t require time-consuming mapping to the genome, is much faster than alignment-based software such as TopHat. For kallisto, a transcriptome fasta file is needed.

We can put all the commands into a script to be run on the cluster in a few minutes.

Important Note: To run this new script you will have to start a new CAC session, since some packages that we used for tophat (nixpkgs/16.09 and gcc/5.4.0) conflict with the 2020 standard environment StdEnv/2020. You can certainly try to follow every package that needs to be unloaded and reloaded in order to run the new script. However, sometimes it is faster and easier to just quit the current session, start a new terminal window, restart your CAC session and reload only the necessary modules.

First, start up a bash script.

$ nano kallisto.sh

With any script, use the first shebang line to write which shell to use. Here, we are using the bash shell.

#! /bin/bash

Now we can add more options for the scheduler to use when running this job.

#SBATCH --partition=standard
#SBATCH -c 1
#SBATCH --mem 2G 
#SBATCH -t 00:05:00

The commands above request 1 core with 2 gigabytes of memory for this job and the job will have a time limit of 5 minutes.

We will download the A. thaliana transcriptome from the National Center for Biotechnology Information.. You can search at the top of the NCBI homepage for “Arabidopsis thaliana” and click the “Genome” link under the “Genome” section of the results page. Copy the link for downloading transcript beside “Download sequences in FASTA format” in the box at the top of the page.

The link is ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_rna.fna.gz.

Add the following line to the script to download the file

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_rna.fna.gz

kallisto also requires an index. We will load kallisto and make the index using the downloaded file.

module load gcc/9.3.0
module load StdEnv/2020
module load kallisto/0.46.1
kallisto index --index=A_thaliana GCF_000001735.3_TAIR10_rna.fna.gz

kallisto requires directories for output to be made beforehand.

We can make a results directory and sub-directories for each FASTQ folder.

mkdir kallisto_output
mkdir kallisto_output/Day8
mkdir kallisto_output/Day16

Then we add the commands for counting transcripts.

kallisto quant --single -l 200.0 -s 30.0 -i A_thaliana Day8.fastq -o kallisto_output/Day8
kallisto quant --single -l 200.0 -s 30.0 -i A_thaliana Day16.fastq -o kallisto_output/Day16

In the end, the script kallisto.sh will look like this.

#!/bin/bash
#SBATCH --job-name=kallisto
#SBATCH --time 00:30:00
#SBATCH -c 1
#SBATCH --mem=2G

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_rna.fna.gz

module load gcc/9.3.0
module load StdEnv/2020
module load kallisto/0.46.1

kallisto index --index=A_thaliana GCF_000001735.3_TAIR10_rna.fna.gz

mkdir kallisto_output
mkdir kallisto_output/Day8
mkdir kallisto_output/Day16

kallisto quant --single -l 200.0 -s 30.0 -i A_thaliana Day8.fastq -o kallisto_output/Day8
kallisto quant --single -l 200.0 -s 30.0 -i A_thaliana Day16.fastq -o kallisto_output/Day16

We can submit kallisto.sh to the scheduler with:

$ sbatch kallisto.sh

and check on its status using

$ squeue -u your_account

When it is finished, in each output folder, there will be three files.

$ cd kallisto_output/Day8
$ ls

abundance.h5  abundance.tsv run_info.json

abundance.tsv contains the number of each transcript aligned. To find all the lines where the estimated count is above zero, we can use awk, another tool in UNIX for searching text.

$ awk '($4 > 0 ) ' abundance.tsv > check

This line uses the awk command to search the text file called abundance.tsv, using information defined in the quotation marks. This awk command is similar to grep or gsub in R, but more complicated. The important thing to know is that $4 represents the fourth field in the file, or the column. We want all the values greater than zero, using > 0. And we want to output to a file called check using >.

We can see how many lines there are in check using the wc command and check inside the file.

$ wc -l check
$ head check

If you find that the column names are gone, try modifying the awk command:

$ awk '(NR==1) || ($4 > 0 ) ' abundance.tsv > check

The first part command puts line one (NR==1) OR (||) any line where the fourth field is greater than zero ($4 > 0) from abundance.tsv into a file called check.

Actual statistical analysis using kallisto output is conducted using the sleuth package in R.

https://pachterlab.github.io/sleuth_walkthroughs/trapnell/analysis.html

The next steps would be to use the sleuth package to explore the output created through kallisto. The reason we do it this way is that kallisto runs very fast but the command line is not so great for visualizing data. R is great for data visualization but it is very slow.

Using sleuth you can investigate treatment differences (aka ‘conditions’) with multiple samples replicated per treatment. You can look at overall gene expression differences across samples for all of the sequenced genes rather than focusing on specific genes.

Differential Expression Tutorial