Biotechvana

RNASEQ - STEP BY STEP MODE USAGE

2.1 - INTRODUCTION

In this section we will explain how to use RNASeq in Step-by-Step mode. This mode enables the user to run all the RNAseq analyses of a protocol as a workflow (step by step) where each individual step of the analysis can be completed independently from all other steps and where options and parameters are declared prior to launching the job. The workflow of steps is organized into an intuitive menu providing a tab per each step and scroll down per tab summarizing the list of command line interface (CLI) third party software available for each step. Every CLI tool has a specific interface with fields to declare the inputs, the outputs and the parameters and options. In that way, the Step-by-Step mode allows different State-of-the-Art protocols for differential expression analysis with or without a reference genome. For example, you can first run a sample quality analysis first (FastQC) and obtain a report on the quality and nature of your raw reads. Then, you may then run the sample preprocessing steps (sequence trimming, adapter removal, etc) one by one so you can see how the processing of your samples affects their quality after each “cleaning” step; you may then upload your processed (clean) reads for mapping along with your reference genome and obtain sam/bam files in the next steps for differential expression and enrichment analyses, and so on. The current RNASeq Step-by-Step menu contemplates two protocols:

Tophat/Hisat2 & Cufflinks Protocol (for those cases when a reference genome and an annotation GTF/GFF file are available)
Mapping & Counting Protocol (for those cases when a GTF/GFF file is not available)

2.2 - TOPHAT/HISAT2 & CUFFLINKS PROTOCOL

Recommended for RNA-seq studies when a reference genome and annotation GTF file are available, this protocol is based on Tophat (Trapnell et al. 2012;Kim et al. 2013) and Hisat2 (Kim, Langmead, and Salzberg 2015) for reference mapping; the Cufflinks package (Cummerbound included) for differential gene expression (Trapnell et al. 2012; Goff L, Trapnell C and Kelley D. 2013), and the GOseq package for GO enrichment and metabolic pathway analysis (Young et al. 2010). This workflow runs on the server, meaning that the user must upload the fastq files, reference fasta file, GTF and any other files needed for the analysis prior via the FTP browser or any other FTP that is linked to the user account in the server. To make use of this analysis protocol, go to:

[Transcripts Protocols → Step-by-Step → Tophat/Hisat2 & Cufflink Protocol]

A new submenu will appear (Fig. 5) in the workspace listing the steps of the RNA-seq analysis (preprocessing, mapping, transcriptome assembly, differential expression test and GOSeq for enrichment) as provided by the Tophat & Cufflinks Protocol.

Figure 5: Submenu to follow the Tophat & Cufflinks protocol in Step-to-step mode.

2.2.1 - QUALITY ANALYSIS AND PREPROCESSING

Raw data preprocessing is necessary to prepare the fastq libraries for mapping and this job involves several steps in which the raw reads are trimmed, cleaned or modified to remove adapter remains, sequencing artifacts, contaminations and/or low quality sequences. The “preprocessing” drop-down submenu (Fig. 6) provides access to the tools for quality analysis, demultiplex, and other sequence trimming & cleaning tools including those for adapter removal or filtering out low quality sequences. The following sections provide a brief description of each preprocessing tool.

Figure 6: Preprocessing drop down submenu listing the preprocessing options enabled in the RNA application.

- Quality Analysis: FastQC

FastQC (Andrews 2016) provides a simple way to perform quality checks on raw sequence data.FastQC provides a modular set of analyses that can indicate whether your data contain potential artifacts that require “cleaning” before beginning any analyses. Upon performing a FastQC quality check, you will obtain a complete sequence quality report provide hints on what form of filtering and processing your sample requires if any. For example, overrepresented sequences that correspond to the adapters used during sequencing may need to be removed from your fastq files. This will be shown in the “overrepresented sequences” section of the FastQC report. Further details on FastQC can be found in the FastQC manual at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

To run FastQC go to [ Preprocessing → Quality Analysis → FastQC ] and follow Fig. 7

Figure 7: Using the GPRO interface for FastQC.

- Demultiplex: FastqMidCleaner

FastqMidCleaner sorts and splits sequencing reads from fastq files into separate files according to predefined molecular identifiers (MIDs).

To run FastqMidCleaner go to [Preprocessing→Demultiplex → FastqMidCleaner] and follow Fig. 8.

Figure 8: Using the GPRO interface for FastQMidCleaner.

- Trimming & Cleaning: Cutadapt

Cutadapt (Martin 2011) finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequences from your sequencing reads. For more information on Cutadapt, see its manual at https://cutadapt.readthedocs.io/en/stable/guide.html

To run Cutadapt go to [ Preprocessing→ Trimming & Cleaning → Cutadapt ] and follow Fig. 9.

Figure 9: Using the GPRO interface for Cutadapt.

- Trimming & Cleaning: Prinseq

Prinseq (Schmieder and Edwards 2011) can be used to filter, reformat, or trim your sequencing reads. For further information see Prinseq manual at http://prinseq.sourceforge.net.

To run Prinseq go to [Preprocessing→Trimming & Cleaning → prinseq ] and follow Fig. 10.

Figure 10: Using the GPRO interface for Prinseq.

- Trimming & Cleaning: Trimmomatic

Trimmomatic (Bolger et al. 2014) is trimming tool specific for paired-end and single-end reads obtained via Illumina’s NGS technology that can perform a variety of trimming tasks. For more information see the trimmomatic manual at "http://www.usadellab.org/cms/?page=trimmomatic".

To run Trimmomatic go to [“Preprocessing→Trimming & Cleaning → trimmomatic”] and follow Fig. 11.

Figure 11: Using the GPRO interface for Trimmomatic.

- Trimming & Cleaning: FastxToolKit

FASTX-Toolkit (Hannon Lab 2016) is a collection of tools for the preprocessing of Fasta/Fastq files that include the following:

FASTA Formatter: For the formatting of FASTA files.
FASTA Clipping Histogram: Creates a Linker Clipping Information Histogram.
FASTA Nucleotides Changer: Coverts sequences from DNA to RNA and vice versa in FASTA files.
FASTQ Quality Chart: Plots Solexa Quality BoxPlots.
FASTQ Quality Filter: Filters FASTQ files.
FASTQ to FASTA : Converts fastq files into fasta files.
FASTX Artifacts Filter: Filters for artifacts in FASTA/FASTQ files.
FASTX Barcode Splitter: Reads FASTA/FASTQ file and splits it into several smaller files based on barcode matching.
FASTX Clipper: Clip adapter from FASTA/FASTQ files.
FASTX Collapser: Collapses FASTA/FASTQ files.
FASTX Nucleotide Distribution: Plots FASTA/Q Nucleotide Distribution.
FASTX Renamer: Renames sequences from FASTA/FASTQ files.
FASTX Reverse Complement: Creates the reverse complement of FASTA/FASTQ files.
FASTX Statistics:Generates statistics from FASTA/FASTQ files. If a FASTA file is given, only nucleotide distribution is calculated and no quality info is provided.
FASTX Trimmer: Trims sequences from FASTA/FASTQ files.

For more information, see the "Fastx-ToolKit" manual at "http://hannonlab.cshl.edu/fastx_toolkit/".

To run any of the FastxToolKit go to [“Preprocessing→ Trimming & Cleaning → Fastx-Toolkit”] and follow Fig. 12.

Figure 12: Example of GPRO interface for a tool (FASTX collapser) of Fastx-Toolkit.

- PrepSeq: FastqCollapser

FastqCollapser is used to remove duplicate reads from fastq files based on their sequence content.

To run FastqCollapser go to[ Preprocessing→ PrepSeq → FastqCollapser ] and proceed as shown in Fig. 13.

Figure 13: Using the GPRO interface for FastQCollapser.

- Trimming & Cleaning: FastqIntersect

FastqIntersect is a script that compares the information of two pair-end files that have been independently preprocessed and the information on both files to edit them keeping only those reads, and in the same order, that are present in both files (mate reads). This tool used when the number of reads obtained does not match the output of the execution of any preprocessing tool in each file individually the other. This is because assembly/mapping processes require that the files match in the number and the sort of reads. Please note that both Prinseq and Trimmomatic already have a function to intersect reads by ticking the ‘pair end files’ box. Thus, FastqIntersect will only need to be run in either those cases where the ’pair end files’ box has not been selected. FastqIntersect will also not need to be used when Cutadapt has been used, since this tool does not implement intersecting functions.

To run FastqIntersect go to [ Preprocessing→ PrepSeq → FastqIntersect ] and follow Fig. 14.

Figure 14: Using the GPRO interface fastqintersect.

2.2.2 - MAPPING

Mapping is required to align the preprocessed reads of the fastq libraries on the reference genome. The mapping drop-down submenu provides access to two mapping tools: Tophat and Hisat2.

- Mapping : Tophat

TopHat (Trapnell et al. 2012,Kim et al. 2013) ligns RNA-Seq reads to a reference genome identifying exon-exon splice junctions. For more information, see the Tophat manual at "https://daehwankimlab.github.io/hisat2/manual/".

To run Tophat go to[Mapping→ Tophat ] and follow Fig. 15.

Figure 15: Using the GPRO interface for TopHat.

- Mapping : Hisat2

HISAT2 (Kim, Langmead, and Salzberg 2015) is a tool for mapping of sequencing reads (both DNA and RNA). For more information, see the HISAT2 manual at "https://ccb.jhu.edu/software/hisat2/manual.shtml" for more information.

To run Hisat2 go to [ Mapping→ Hisat2]and follow Fig. 16.

Figure 16: Using the GPRO interface for Hisat2.

2.2.3 - TRANSCRIPTOME ASSEMBLY

Through this interface, we combine the Cufflinks and Cuffcompare tools (Trapnell et al. 2012) to assemble the set of transcript isoforms contained in each bam/sam file obtained after mapping, to then quantify the expression of the transcripts obtaining extra metrics. The interface also calls Gffread (https://github.com/gpertea/gffread) reconstructing the fasta sequences in each transcriptome sample. Optionally, it can also call Cuffmerge (also in Cufflinks) to merge two or more transcriptome assemblies in a merged consensus assembly. For more information, see the Cufflinks manual at "http://cole-trapnell-lab.github.io/cufflinks/cuffmerge/index.html".

- Transcriptome Assembly: Cufflinks

To run the Cufflinks protocol for transcriptome assembly go to[ Transcriptome Assembly→ Cufflinks ]and follow Fig. 17.

Figure 17: Using the GPRO interface for Cufflinks.

2.2.4 - TESTING DIFFERENTIAL EXPRESSION

Differential gene expression analyses in RNASeq are perfomed using Cuffdiff, a tool of the Cufflinks package (Trapnell et al. 2012) that works to find significant changes in gene expression, splicing, and promoter use. Cummerbound (Goff et al 2013) s also implemented in RNASeq’s Cuffdiff interface so metrics can be obtained from the results provided by Cufflinks. For more information, see the Cufflinks manual at "http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html".

- Differential Expression Test: Cuffdiff

To run Cuffdiff go to [Diff Expression Test→ Cuffdiff] and follow Fig. 18.

Figure 18: Using the GPRO interface for Cuffdiff.

2.2.5 - GOSEQ

This interface provides access to the use of GOseq (Young et al. 2010) a tool for the detection of the over/under-representation of Gene Ontology categories (GOs) as well as other user- defined categories (e.g. metabolic pahtways) in the results file obtained from a differential expression analysis. For more information, see the GOseq manual at"https://bioconductor.org/packages/release/bioc/html/goseq.html".

- GoSeq: GoSeq

GOseq allows to work modes, auto and custom, that can be selected at the top of the ‘Input’ dialog in the GOseq interface. Through the auto mode, enrichment analyses can be based on natively supported genomes (that is, those listed in Ensembl) while the ‘custom’ option is used for non-native genomes and transcriptomes. In this case, the user must prepare the necessary input material according to the results found in the differential expression analysis.

To run GOseq go to[ GOseq→ GOeq ] and follow Fig. 19.

Figure 19: Using the GPRO interface for GOSeq.

2.3 - MAPPING & COUNTING PROTOCOL

The mapping & counting protocol is the second type of RNA-seq analysis workflow implemented in the current RNASeq version. This protocol implements analysis steps that are very similar to the ones included in the Tophat/Hisat2 & Cufflinks protocol, differing from it in the way transcripts are quantified and the way differential expression analysis is performed. The mapping & counting protocol is oriented to the analysis of non-native genomes (those where a GTF reference is not available) and transcriptomes that have been reconstructed de novo where a consensus transcriptome is used as a reference.

To go to the mapping & counting protocol, go to [ Transcripts Protocols → Step-by-Step Mode → Mapping & Counting Protocol ]. Then a new submenu will appear in the workspace organizing the different steps that are required to perform an RNA-seq analysis without a reference, as follows [ Preprocessing -> Mapping -> Post processing -> Diff Expression Analysis -> Goseq ]

2.3.1 - PREPROCESSING

The tools and interfaces provided for data preprocessing are the same as those included in the Tophat & Cufflinks protocol. To see a full description of these tools, please refer to Section 2.2.1 of this manual.

2.3.2 - MAPPING

The protocol for mapping & counting includes three different mapping tools, namely Bowtie2 Langmead and Salzberg 2012), BWA (Li and Durbin 2010) and Hisat2 (Kim, Langmead, and Salzberg 2015).

- Mapping: Bowtie2

Bowtie2 is an ultrafast and memory-efficient tool for the aligning of sequencing reads onto long reference sequences. For further details on this tool, see the Bowtie2 manual at "http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml.

To run Bowtie2 go to[ Mapping→ Bowtie2 ] and follow Fig. 20.

Figure 20: Using the GPRO interface for Bowtie2.

- Mapping : BWA

BWA is a software package for the mapping of low-divergent sequences against large reference genomes. It consists of three different algorithms, namely BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is specifically designed for Illumina sequence reads up to 100bp in length, while the other two are implemented for sequences ranging from 70bp to 1Mbp. BWA-MEM and BWA-SW share features such as long-read support and split alignment. BWA-MEM however is generally recommended for high-quality queries, as it is faster and more accurate. For more information, please refer to the BWA manual at http://bio-bwa.sourceforge.net/bwa.shtml”.

To run BWA go to [ Mapping → Bwa ] and follow Fig. 21.

Figure 21: GPRO interface for BWA.

- Mapping : Hisat2

The interface provided for mapping with Hisat2 is the same as the one described for the Tophat & Cufflinks protocol. For more details, please refer to Section 2.2.2 of this manual.

2.3.3 - POST-PROCESSING

The workflow for mapping & counting receives this name due to the quantification step that is required prior to the start of the differential expression analysis. This step, that is based on clustering and counting, is implemented via two different tools, namely Corset (Davidson and Oshlack 2017) and HTseq (Anders et al 2014). A description of both tools is provided in the following sections.

- Postprocessing : Corset

Corset has been traditionally used to obtain gene-level counts of de novo-obtained transcriptome assemblies. To do so, Corset uses the reads that have been mapped to the transcriptome to hierarchically cluster them according to the proportion of shared reads and their expression patterns. Subsequently, the clusters and gene-level counts for each sample are reported. The output generated by Corset is the input that will be required by the counting-based tools EdgeR and DESeq to later perform the differential expression analysis. For more information, please refer to the Corset manual at https://github.com/Oshlack/Corset/wiki/InstallingRunningUsage.

To run Corset go to [ Postprocessing → Corset ] and follow Fig. 22.

Figure 22: Using the GPRO interface for Corset.

- Postprocessing : Htseq

For this post-processing option, RNASeq makes use of the ‘Htseq-count’ option of the Htseq tool. This tool counts those reads that are mapped to genomic features (exons, genes, etc) reporting a counts file at the selected feature-level (in RNA-Seq, typically genes). The output generated by Htseq-count will be the input required by the tools EdgeR and DESeq to perform the differential gene expression analysis. For more information, please refer to the Htseq manual at https://htseq.readthedocs.io/en/release_0.9.1/count.html .

To run Htseq go to[ Postprocessing→ Htseq ] and follow Fig. 23.

Figure 23: Using the GPRO interface for HtSeq.

2.3.4 - TESTING DIFFERENTIAL EXPRESSION

The RNASeq mapping & counting workflow implements two interfaces allowing two alternative count-based tools for differential expression analysis: EdgeR (Robinson, McCarthy, and Smyth 2010) and Dseq (Love, Huber, and Anders 2014).

- Diff Expression Analyses : DESeq

DESeq estimates variance-mean dependences in sequencing read count data to then test for differential gene expression using a model that is based on the negative binomial distribution. For more information please refer to the DESeq manual at https://bioconductor.org/packages/3.8/bioc/html/DESeq.html.

To run DESeq go to [ Diff Expression Analysis → DESeq ] and follow Fig. 24.

Figure 24: Using the GPRO interface for DeSeq.

Diff Expression Analyses : EdgeR

EdgeR performs differential gene expression analyses using an array of statistical methods that are based on negative binomial distributions including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. For more information please refer to the EdgeR manual at https://bioconductor.org/packages/release/bioc/html/edgeR.html.

To run EdgeR go to [ Diff Expression Analysis → EdgeR ] and follow Fig. 25.

Figure 25: Using the GPRO interface for EdgeR.

2.3.5 - GOSEQ

The interface provided for differential enrichment analysis with GOseq is the same as the one detailed in the Tophat & Cufflinks protocol. For more details, please refer to Section 2.2.5 of this manual.