Biotechvana

DENOVOSEQ - STEP BY STEP MODE USAGE

2.1 - INTRODUCTION

In this section we will explain how to use DeNovoSeq in Step-by-Step mode. This mode enables the user to run all the DeNovoSeq analyses of a protocol as a workflow (step by step) where each individual step of the analysis can be completed independently from all other steps and where options and parameters are declared prior to launching the job. The workflow is organized into an intuitive menu providing a tab per each step and scroll down per tab summarizing the list of command line interface (CLI) third party software available for each step. Every CLI tool has a specific interface with fields to declare the inputs, the outputs and the parameters and options. In that way, the Step-by-Step mode allows different State-of-the-Art protocols for de novo assembly and annotation of genomes, transcriptomes, metagenomes, and meta-transcriptomes. For example, you can first run a sample quality analysis first (FastQC) and obtain a report on the quality and nature of your raw reads. Then, you may then run the sample preprocessing steps (sequence trimming, adapter removal, etc) one by one so you can see how the processing of your samples affects their quality after each “cleaning” step; you may then upload your processed (clean) reads for de novo assembly and obtain the set of contigs or scaffolds and so on in the next steps (gene prediction, annotation and functional analysis).

The current Step-by-Step menu contemplates one protocol:

DeNovo Protocols (to assembly short nucleotide sequences into longer ones without the use of a reference genome)

2.2 - DENOVO PROTOCOLS

This protocol is based on De Novo assembly which is the methodology in studies oriented to characterize genomes or transcriptomes of which nothing is known (i.e. for the first time). For this task, DeNovoSeq provides distinct interface solutions to manage the most common de novo assemblers to build de novo high quality transcriptomes and/or genomes as well as additional tools to improve the quality and accuracy of obtained assemblies, extract the genes/ORFs and annotate and characterized them. This workflow runs on the server, meaning that the user must upload the fastq files and any other files needed for the analysis prior via the FTP browser or any other FTP that is linked to the user account in the server. To make use of this analysis protocol, go to:

[DeNovo Protocols → Step-by-Step → DeNovo Protocols]

A new submenu will appear (Fig.5) in the workspace listing the four distinct steps (preprocessing, DeNovo assembly, Gene Prediction, Annotation) of step-by-step workflow implemented by DeNovoSeq.

Figure 5: Submenu to follow a de novo analysis protocol in Step-by-step mode

2.2.1 - QUALITY ANALYSIS AND PREPROCESSING

Raw data preprocessing is necessary to prepare the fastq libraries for mapping and this job involves several steps in which the raw reads are trimmed, cleaned, or modified to remove adapter remains, sequencing artifacts, contaminations and/or low-quality sequences. The “preprocessing” drop-down submenu (Fig. 6) provides access to the tools for quality analysis, demultiplex, and other sequence trimming & cleaning tools including those for adapter removal or filtering out low quality sequences. The following sections provide a brief description of each preprocessing tool.

Figure 6: Preprocessing options enabled in DeNovoSeq.

- Quality Analysis: FastQC

FastQC (Andrews 2016) provides a simple way to perform quality checks on raw sequence data. It provides a modular set of analyses that can indicate whether your data contain potential artifacts that require “cleaning” before beginning any analyses. Upon performing a FastQC quality check, you will obtain a complete sequence quality report provide hints on what form of filtering and processing your sample requires if any. For example, overrepresented sequences that correspond to the adapters used during sequencing may need to be removed from your fastq files. This will be shown in the “overrepresented sequences” section of the FastQC report. Further details on FastQC can be found in the FastQC manual at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

To run FastQC go to [Preprocessing → Quality Analysis → FastQC] and follow Fig. 7

Figure 7: Using the GPRO interface for FastQC.

- Demultiplex: FastqMidCleaner

FastqMidCleaner sorts and splits sequencing reads from fastq files into separate files according to predefined molecular identifiers (MIDs).

To run FastqMidCleaner go to [Preprocessing → Demultiplex → FastqMidCleaner] and follow Fig. 8.

Figure 8: Using the GPRO interface for FastQMidCleaner.

- Trimming & Cleaning: Cutadapt

Cutadapt (Martin 2011) finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequences from your sequencing reads. For more information on Cutadapt see the Cutadapt manual at https://cutadapt.readthedocs.io/en/stable/guide.html

To run Cutadapt go to [Preprocessing → Trimming & Cleaning → Cutadapt] and follow Fig. 9.

Figure 9: Using the GPRO interface for Cutadapt.

- Trimming & Cleaning: Prinseq

Prinseq (Schmieder and Edwards 2011) can filter, reformat, or trim your sequencing reads. For further information, see the Prinseq manual at http://prinseq.sourceforge.net.

To run Prinseq go to [Preprocessing → Trimming & Cleaning → Prinseq] and follow Fig. 10.

Figure 10: Using the GPRO interface for Prinseq.

- Trimming & Cleaning: Trimmomatic

Trimmomatic (Bolger et al. 2014) is trimming tool specific for paired-end and single-end reads obtained via Illumina’s NGS technology that can perform a variety of trimming tasks. For more information see the trimmomatic manual at "http://www.usadellab.org/cms/?page=trimmomatic".

To run Trimmomatic go to [“Preprocessing → Trimming & Cleaning → Trimmomatic”] and follow Fig. 11.

Figure 11: Using the GPRO interface for Trimmomatic.

- Trimming & Cleaning: FastxToolKit

FASTX-Toolkit (Hannon Lab 2016) iis a set preprocessing tools for Fasta/Fastq files:

FASTA Formatter: For the formatting of FASTA files.
FASTA Clipping Histogram: Creates a Linker Clipping Information Histogram.
FASTA Nucleotides Changer: Coverts sequences from DNA to RNA and vice versa in FASTA files.
FASTQ Quality Chart: Plots Solexa Quality BoxPlots.
FASTQ Quality Filter: Filters FASTQ files.
FASTQ to FASTA : Converts fastq files into fasta files.
FASTX Artifacts Filter: Filters for artifacts in FASTA/FASTQ files.
FASTX Barcode Splitter: Reads FASTA/FASTQ file and splits it into several smaller files based on barcode matching.
FASTX Clipper: Clip adapter from FASTA/FASTQ files.
FASTX Collapser: Collapses FASTA/FASTQ files.
FASTX Nucleotide Distribution: Plots FASTA/Q Nucleotide Distribution.
FASTX Renamer: Renames sequences from FASTA/FASTQ files.
FASTX Reverse Complement: Creates the reverse complement of FASTA/FASTQ files.
FASTX Statistics:Generates statistics from FASTA/FASTQ files. If a FASTA file is given, only nucleotide distribution is calculated and no quality info is provided.
FASTX Trimmer: Trims sequences from FASTA/FASTQ files.

For more information, see the "Fastx-ToolKit" manual at "http://hannonlab.cshl.edu/fastx_toolkit/".

To run any of the FastxToolKit go to [“Preprocessing → Trimming & Cleaning → Fastx-Toolkit”] and follow Fig. 12.

Figure 12: Example of GPRO interface for a tool (FASTXTOOLKIT: FASTQ to FASTA) of Fastx-Toolkit.

- PrepSeq: FastqCollapser

FastqCollapser is used to remove duplicate reads from fastq files based on their sequence content.

To run FastqCollapser go to[Preprocessing → PrepSeq → FastqCollapser] and proceed as shown in Fig. 13.

Figure 13: Using the GPRO interface for FastQCollapser.

- Trimming & Cleaning: FastqIntersect

FastqIntersect is a script that compares the information of two pair-end files that have been independently preprocessed and the information on both files to edit them keeping only those reads, and in the same order, that are present in both files (mate reads). This tool used when the number of reads obtained does not match the output of the execution of any preprocessing tool in each file individually the other. This is because assembly/mapping processes require that the files match in the number and the sort of reads. Please note that both Prinseq and Trimmomatic already have a function to intersect reads by ticking the ‘pair end files’ box. Thus, FastqIntersect will only need to be run in either those cases where the ’pair end files’ box has not been selected. FastqIntersect will also not need to be used when Cutadapt has been used, since this tool does not implement intersecting functions.

To run FastqIntersect go to [Preprocessing → PrepSeq → FastqIntersect] and follow Fig. 14.

Figure 14: Using the GPRO interface Fastqintersect.

2.2.2 - DENOVO ASSEMBLY

De Novo assembly is the methodology most commonly used to characterize new genomic and transcriptomic sequences. For genomes, de novo assemblers join short or long nucleotide NGS reads and piece them together into longer sequences called contigs without the use of a reference sequence and no prior knowledge of their length, layout and/or composition. Contigs can be postprocessed to resolve sequence gaps (indeterminations normally represented as “Xs” or “Ns”) and/or for assembling the contigs into larger units called scaffolds. For transcriptomes, de novo assemblers assemble the NGS reads into transcripts (also, without the use of a reference sequence). DeNovoSeq provides distinct interface solutions to manage state-of-the-art assemblers these two kinds of approaches.

Input configuration file :

DeNovoSeq requires an An input configuration file to create a noew assembly project and guide the assembly process. To this end, users need to go this path [DeNovo Protocols → Input configuration file] for accessing an interface where the user either can download a previously existent configuration file with the experimental settings or may create a new one (indicating number and type of fastq libraries, libraries, type of sequencing, insert size, etc) as shown in Figure 15.

Figure 15: DeNovoSeq interface for configuring the input file needed by assemblers..

ASSEMBLY:

DeNovoSeq implements interface solutions for six debruijn graph Assemblers; two of them (assemble transcriptomes and the four others focus on the genomes assembly

- Transcriptomes: Oases

Oases (Schulz et al., 2012) is a de novo transcriptome assembler powered by a Velvet assembler core to resolve transcripts from short read and long sequencing reads in the absence of any genomic reference. For more information, see the Oases manual at "https://www.ebi.ac.uk/~zerbino/oases/OasesManual.pdf"

To run Oases go to [De novo assembly → Assembly → Transcriptomes → Oases]and follow Fig. 16.

Figure 16: DeNovoSeq interface to assemble short and long transcriptome reads into transcripts using Oases.

- Transcriptomes: SOAPdenovo-Trans

SOAPdenovo-Trans (Luo et al., 2012) is a de novo transcriptome assembler adapted from the SOAPdenovo framework to resolve transcripts (and alternative splicing and different expression levels) from short read sequencing reads in the absence of any genomic reference. For more details, see the SOAPdenovo-Trans manual at "https://github.com/aquaskyline/SOAPdenovo-Trans".

To run SOAPdenovo-Trans go to [De novo assembly → Assembly → Transcriptomes → SOAPdenovo-Trans]and follow Fig. 17.

Figure 17: DeNovoSeq interface to assemble short transcriptome reads into transcripts using with SOAPdenovo-Trans.

- Genomes: Velvet

Velvet (Zerbino and Birney, 2008) ) is a de novo genome assembler that takes short read sequences and resolves high quality contigs. For more information, see the Velvet manual at "https://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf".

To run Velvet go to [De novo assembly → Assembly → Genomes → Velvet]and follow Fig. 18.

Figure 18: DeNovoSeq interface to assemble short and long genome reads into new contigs using Velvet.

- Genomes: SOAPdenovo2

SOAPdenovo2 (Luo et al., 2012) ) s an assembler designed to assemble Illumina GA short reads. SOAPdenovo reduces memory consumption in graph construction resolving repeat regions in contig assembly, increasing coverage and length in scaffold construction and improving gap closing. See the SOAPdenovo manual at "https://github.com/aquaskyline/SOAPdenovo2" for more information.

To run SOAPdenovo2 go to [De novo assembly → Assembly → Genomes → SOAPdenovo2]and follow Fig. 19.

Figure 19: DeNovoSeq interface to assemble short genome reads into contigs using SOAPdenovo2.

- Genomes: CANU

CANU (Koren et al., 2017) is an assembler of the Celera Assembler designed for high-noise single-molecule sequencing such as the PacBio RSII or Oxford Nanopore MinION. CANU is a hierarchical assembly pipeline, which runs in four steps:

Detect overlaps in high-noise sequences
Generate corrected sequence consensus
Trim corrected sequences
Assemble trimmed corrected sequences

See "http://canu.readthedocs.io/en/latest/tutorial.html" for more information.

To run CANU go to [De novo assembly → Assembly → Genomes → CANU]and follow Fig. 20.

Figure 20: DeNovoSeq interface to assemble long genome reads into contigs using CANU.

- Genomes: SPAdes

SPAdes (Bankevich et al., 2012) is an assembler recommended to reconstruct small genomes (bacterial fungal and others). SPAdes supports paired-end reads, mate-pairs and unpaired reads. See the SPAdes manual at "http://cab.spbu.ru/software/spades/ " for more information.

To run SPAdes go to [De novo assembly → Assembly → Genomes → SPAdes]and follow Fig. 21.

Figure 21: DeNovoSeq interface to assemble bacterial genome reads into new contigs using SPAdes.

GAP FILLING:

- Gap filling: GapCloser

GapCloser (Luo et al., 2012) is a tool that closes gaps using the abundant pair-to-pair relationship of short reads. See the GapCloser manual at "https://vcru.wisc.edu/simonlab/bioinformatics/programs/soap/GapCloser_Manual.pdf " for more information.

To run GapCloser go to [De novo assembly → Gap filling → GapCloser]and follow Fig. 22.

Figure 22: DeNovoSeq interface for GapCloser.

SCAFFOLDING:

Results delivered by de novo assemblies are usually fragmented sets of genomic sequences (contigs) that can be re-ordered, edited and joined using the paired-end information in larger sequences called scaffolds. DeNovoSeq implements interfaces for two alternative scaffolders; BESST (Sahlin et al 2014) and OPERA (Gao et al., 2011).

- Scaffolding: BESST

BESST (Sahlin et al 2014) is a scaffolder that includes several tools to build a “contig graph” from available assembly information, obtaining scaffolds from this graph and accurate gap size information. See the BESST manual at "https://github.com/ksahlin/BESST" for more information.

To run BESST go to [De novo assembly → Scaffolding → BESST]and follow Fig. 23.

Figure 23: DeNovoseq interface to manage BESST scaffolder.

- Scaffolding: OPERA-LG long reads

OPERA (Gao et al., 2011) is a scaffolder that uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project. See the OPERA manual at "https://sourceforge.net/projects/operasf/files/OPERA-LG%20version%202.0.6/" for more information.

To run OPERA go to [De novo assembly → Scaffolding → OPERA-LG long reads]and follow Fig. 24.

Figure 24: DeNovoseq interface to manage OPERA scaffolder.

2.2.3 GENE PREDICTION:

Gene prediction refers to the set of methodologies and tools used to identify the genomic regions encoding for genes (protein-coding and non-coding) and other regulatory and functional elements. For prediction of prokaryotic genes, DeNovoSeq provides an ORF finder script also available in the SeqEditor application of the GPRO suite. For eukaryotic genomes, we provide an interface within DeNovoSeq to run AUGUSTUS (Stanke et al., 2008).

- Find ORFs:

Find ORFs searches and finds ORFs, simultaneously, in one or more fasta files with multiple sequences just specifying a minimum length and the open reading frames (both forward and reverse). Then, detected ORFs can be selected and exported or translated and exported as protein sequences. Find ORFs also exports annotation files with the coordinates of the ORFs.

To run Find ORFs go to [Gene Prediction → Prokaryotes → Find ORFs]and follow Fig. 25.

Figure 25: DeNovoSeq interface for using Find ORFs.

- AUGUSTUS

AUGUSTUS (Stanke et al., 2008) is an ab initio program that predicts genes from eukaryotic genome sequences based on a Generalized Hidden Markov probabilistic Model for a sequence and its gene intron-exon structure. The implementation of AUGUSTUS in DeNovoSeq is a workflow that contemplates three steps; “Training”, “Hints preparation” and “Prediction”. See the AUGUSTUS manual at "http://bioinf.uni-greifswald.de/augustus/" for more information

To run the Training step of AUGUSTUS go to [Gene prediction → Augustus → Training]and follow Fig. 26.

Figure 26: DeNovoSeq interface to train AUGUSTUS for predicting gene structures in your genome.

Users can also incorporate hints on the gene structure from extrinsic sources to improve the accuracy of the gene prediction. Some examples of allowed hints follow:

Proteins

If available, you can use a fasta file with refseq sequences to create protein hints.

Transposon and repeats

If available, us the output (.out extension) provided by RepeatMasker”. For more details about RepeatMasker and how to install and run it see http://www.repeatmasker.org

RNAseq

If you have RNAseq data aligned to your genome you can use the bam files to create hints from RNAseq data. Notice however that by now, DeNovoSeq only accepts bam files generated by Bowtie or Tophat.

ESTs

If available, you can use a fasta file with ESTs or transcript sequences to create transcriptome hints.

If you have the needed material to create the hints then go to [Gene prediction → Augustus → Hints]and follow Fig. 27.

Figure 27: DeNovoSeq interface to create the hints with AUGUSTUS

Once the training step has been completed and the hints (optional) have also been created, users can run the Predictions step going to [Gene prediction → Augustus → Prediction]and follow Fig. 28.

Figure 28: DeNovoSeq interface to perform the Gene prediction using AUGUSTUS.

2.2.4 - ANNOTATION

In deNovo protocols, annotation is the step (normally final) oriented to identify function, domains and biological roles of a set of predicted genes and transcripts (coding and not coding). The annotation is normally performed via sequence-to-sequence or sequence-to-profile alignment comparison to find statistically significant homologies between your query sequences and a RefSeq database. DeNovoSeq permits the user to perform the annotation either by manual or by automatic means using three of the most tools for automatic annotation; NCBI BLAST package Altschul et al., 1990, HHMER3 Mistry et al 2013

NCBI-BLAST

NCBI BLASTAltschul et al., 1990is a software package that finds regions of local similarity between a query sequence (or a file in fasta format with the query sequences) and the subjected sequence models or refseq database searched by the query/s. The package implements different tools to compare nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches:

blastp searches protein subject databases using protein queries
blastn searches nucleotide subject databases using nucleotide queries
blastx searches protein subject databases using translated nucleotide queries
tblastn searches nucleotide subject databases using protein queries
tblastx searches translated nucleotide database using translated nucleotide queries

For more information see the NCBI-BLAST manual at "https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs".

The step-by-step mode of DeNovoSeq presents a specific interface for each step of a typical BLAST analysis organized in the Annotation tab, as follows:

Format BLAST databases
Import RefSeq databases
BLAST search with fasta file query
Process BLAST outputs
Retrieve sequences from BLAST outputs

- Format BLAST databases

To perform a BLAST analysis the subject database must be formatted in BLAST format. For this task, DeNovoSeq implements an interface that accepts both protein and nucleotide fasta files as input formatting them for BLAST.

To do this go to [Annotation → NCBI-BLAST → Format databases] and follows what is stated in Fig.29 for formatting blast subjects.

Figure 29: Using the GPRO interface for formatting blast subjects.

- Import RefSeq databases

Big databases such as those provided by NCBI (NR, RefSeq) or Uniprot (Swissprot) are very difficult to process and occupy a significant part of the disk size in your PC or in your user account if you are working on the server side.

For users working on the server side, we facilitate a centralized repository for big databases whose links can be imported to user account going to [Annotation → NCBI-BLAST → Import BLAST databases] and doing as indicated in Fig.30 for importing RefSeq databases.

Figure 30: Using the GPRO interface to format fasta files as NCBI-BLAST reference databases or for importing them precompiled.

- BLAST search with fasta file query

DeNovo Seq provides a frame for the conventional web-based interface of the NCBI-BLAST allowing the users to make fast searches against their databases and multifasta files using one or multiples sequences queries and obtaining the typical alignment output provided by BLAST . Annotation of multifasta files with multiple sequences (for example genomes, transcriptomes o proteomes) may take a significant time to complete the process and deliver the results (hours or perhaps days depending on the number of sequences per file).

To execute the BLAST search for single queries go to [Annotation → NCBI-BLAST → BLAST search with one query] and follows what stated for single queries in Fig.31.

Figure 31: Using the GPRO interface to search blast-formatted databases with one or more queries using NCBI-BLAST package.

- Process BLAST output

The output of the BLAST search delivered by DeNovoSeq for multifasta files is a set of XML files (one per each sequence queried to the subject database) containing all matches obtained found in the database by each query. This thus means that if a multifasta file annotated via BLAST has 25000 sequences the BLAST search will report 25000 xml files. DeNovoSeq also provides an interface for an internal script of GPRO for automatic annotation. This script processes the XML outputs provided by BLAST and prints them into a human-readable annotation file in CSV format. The interface for this script provides filtering parameters to define an evalue cut-off, filter redundant matches, extract a given number of best hits per query and more.

To execute the process BLAST output script, go to [Annotation → NCBI-BLAST → Process BLAST outoput] and follows in Fig.32.

Figure 32: Using the GPRO interface for processing BLAST outputs and extracting annotation files from BLAST results.

- Retrieve sequences from BLAST outputs

DeNovoSeq also provides a parsing script for extracting sequences (that can be either from the query file or from the subject database) according to results provided by the BLAST search and create a new fasta file containing only with the sequences retrieved. The interface for this script also provides filtering parameters to extract full sequences or just the core of the query or subject sequence that aligns constituting the High-scoring Segment Pair (HSP). This last mode also permits to make extractions extending “n” nucleotides upstream and downstream of the HSP core.

To access the script for retrieving sequences from query or subject multifasta files according to the BLAST output, go to [Annotation → NCBI-BLAST → Retrieve sequences from BLAST outputs] and follows what stated in Fig.33 for this script.

Figure 33: Using the GPRO interface for processing BLAST outputs and extracting sequences files from BLAST results.

HMMER3

HMMER3Mistry et al 2013is a software for searching sequence homologs that makes comparisons between protein or nucleotide sequence queries and a user-made database of Hidden Markov Model (HMM) profiles (or vice versa). HMM profiles are probabilistic models capturing position-specific information in a set of aligned sequences (i.e. a multiple alignment) about the evolutionary changes occurred per alignment position.

The step-by-step mode of DeNovoSeq presents different interfaces to manage HMMER for creating and editing HMM databases from multiple alignments, creating consensus sequences or for performing comparative analyses. See the manual of HMMER at "http://eddylab.org/software/hmmer/Userguide.pdf" for more information.

- Create HMMER databases

DeNovoseq provides an interface to call a small pipeline executing the HMMER commands hmmbuild, hmmcalibrate and hmmpress to respectively:

Construct HMMs from multiple alignment files in fasta format
Calibrate the HMM profiles.
Generate a Majority Rule Consensus (MRC) Sequence

To execute the hmmbuild-hmmcalibrate-hmmpress pipeline go to [Annotation → HMMER → Create HMMER databases] and follows what stated in Fig.34 for creating HMMs.

Figure 34: Using the GPRO interface for building HHMER databases.

- Edit HMMER databases

For editing HMM databases DeNovoseq provides an interface that calls the HMMER commands hmmalign and hmmemit to respectively:

Add and align new sequences to the HMM database and then update the HMM
Generate an update MRC sequence

To execute the hmmalign-hmmemit pipeline go to [Annotation → HMMER → Edit HMMER databases] and follows what stated in Fig.35 for editing and updating HMMs.

Figure 35: Using the GPRO interface for editing HHMER databases.

- HMMER search fasta file query

DeNovo Seq provides a frame for a web-based interface to run fast searches against the HMM databases using one sequence or multiples queries and obtaining the typical alignment output provided by HMMER. Three kinds of HMM searches are allowed:

Protein queries against a HMM profile database using the “hmmscan” command of HMMER
Protein HMM queries vs a protein database using the “hmmsearch” command
DNA sequence queries against a database of DNA HMMs using the “nhmmscan” command

For annotation of multifasta files, DeNovoSeq provides a specific interface allowing the users to run multiple searches with HMMER3 executed in background process. For large query files the interface also permits the user to divide the input file in multiple subqueries to accelerate the process.

To execute a HMMER search for multifasta file queries, go to [Annotation → HMMER → HMMER search with fasta file query] and follows what stated for multifasta files searches in Fig.36.

Figure 36: Using the GPRO interface for searching against the HMM databases using one sequence or multiples queries and obtaining the typical alignment output provided by HMMER.