DENOVOSEQ - STEP BY STEP MODE USAGE


2.1 - INTRODUCTION

In this section we will explain how to use DeNovoSeq in Step-by-Step mode. This mode enables the user to run all the DeNovoSeq analyses of a protocol as a workflow (step by step) where each individual step of the analysis can be completed independently from all other steps and where options and parameters are declared prior to launching the job. The workflow of steps is organized into an intuitive menu providing a tab per each step and scroll down per tab summarizing the list of command line interface (CLI) third party software available for each step. Every CLI tool has a specific interface with fields to declare the inputs, the outputs and the parameters and options. In that way, the Step-by-Step mode allows different State-of-the-Art protocols for differential expression analysis with or without a reference genome. For example, you can first run a sample quality analysis first (FastQC) and obtain a report on the quality and nature of your raw reads. Then, you may then run the sample preprocessing steps (sequence trimming, adapter removal, etc) one by one so you can see how the processing of your samples affects their quality after each “cleaning” step; you may then upload your processed (clean) reads for mapping along with your reference genome and obtain sam/bam files in the next steps for differential expression and enrichment analyses, and so on. The current DenovoSeq Step-by-Step menu contemplates one protocol:


2.2 - DENOVO PROTOCOLS

This protocol is based on De Novo assembly which is the methodology most commonly used in studies oriented to characterize genomes or transcriptomes of which nothing is known (i.e. for the first time). For this task, DeNovoSeq provides distinct interface solutions to manage the most common de novo assemblers i.e Oases (Schulz et al., 2012 ) or CANU (Koren et al., 2017 ) to build de novo high quality transcriptomes and/or genomes as well as additional tools such as gap filling as gap closer (Luo et al., 2012) and scaffolding (BESST(Sahlin et al 2014) to improve the quality and accuracy of genome assemblies. This workflow runs on the server, meaning that the user must upload the fastq files and any other files needed for the analysis prior via the FTP browser or any other FTP that is linked to the user account in the server. To make use of this analysis protocol, go to:

[DeNovo Protocols → Step-by-Step → DeNovo Protocols]

A new submenu will appear (Fig.5) in the workspace listing the four distinct steps (preprocessing, DeNovo assembly, Gene Prediction, Annotation) of step-by-step workflow implemented by DeNovoSeq.

Figure 5

Figure 5: Submenu to follow a de novo analysis protocol in Step-by-step mode


2.2.1 - QUALITY ANALYSIS AND PREPROCESSING

Raw data preprocessing is necessary to prepare the fastq libraries for new assembly and this job involves several steps in which the raw reads are trimmed, cleaned or modified to remove adapter remains, sequencing artifacts, contaminations and/or low-quality sequences. The “preprocessing” drop-down submenu (Fig. 6) provides access to the tools for quality analysis, demultiplex, and other sequence trimming & cleaning tools including those for adapter removal or filtering out low quality sequences. The following sections provide a brief description of each preprocessing tool.

Figure 6

Figure 6: Preprocessing drop down submenu listing the preprocessing options enabled in the DeNovoSeq application.


- Quality Analysis: FastQC

FastQC (Andrews 2016) provides a simple way to perform quality checks on raw sequence data.FastQC provides a modular set of analyses that can indicate whether your data contain potential artifacts that require “cleaning” before beginning any analyses. Upon performing a FastQC quality check, you will obtain a complete sequence quality report provide hints on what form of filtering and processing your sample requires if any. For example, overrepresented sequences that correspond to the adapters used during sequencing may need to be removed from your fastq files. This will be shown in the “overrepresented sequences” section of the FastQC report. Further details on FastQC can be found in the FastQC manual at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

To run FastQC go to [Preprocessing → Quality Analysis → FastQC] and follow Fig. 7


Figure 7

Figure 7: Using the GPRO interface for FastQC.


- Demultiplex: FastqMidCleaner

FastqMidCleaner sorts and splits sequencing reads from fastq files into separate files according to predefined molecular identifiers (MIDs).

To run FastqMidCleaner go to [Preprocessing → Demultiplex → FastqMidCleaner] and follow Fig. 8.


Figure 8

Figure 8: Using the GPRO interface for FastQMidCleaner.



- Trimming & Cleaning: Cutadapt

Cutadapt (Martin 2011) finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequences from your sequencing reads. For more information on Cutadapt, see its manual at https://cutadapt.readthedocs.io/en/stable/guide.html

To run Cutadapt go to [Preprocessing → Trimming & Cleaning → Cutadapt] and follow Fig. 9.


Figure 9

Figure 9: Using the GPRO interface for Cutadapt.



- Trimming & Cleaning: Prinseq

Prinseq (Schmieder and Edwards 2011) can be used to filter, reformat, or trim your sequencing reads. For further information see Prinseq manual at http://prinseq.sourceforge.net.

To run Prinseq go to [Preprocessing → Trimming & Cleaning → Prinseq] and follow Fig. 10.


Figure 10

Figure 10: Using the GPRO interface for Prinseq.



- Trimming & Cleaning: Trimmomatic

Trimmomatic (Bolger et al. 2014) is trimming tool specific for paired-end and single-end reads obtained via Illumina’s NGS technology that can perform a variety of trimming tasks. For more information see the trimmomatic manual at "http://www.usadellab.org/cms/?page=trimmomatic".

To run Trimmomatic go to [“Preprocessing → Trimming & Cleaning → Trimmomatic”] and follow Fig. 11.


Figure 11

Figure 11: Using the GPRO interface for Trimmomatic.


- Trimming & Cleaning: FastxToolKit 

FASTX-Toolkit (Hannon Lab 2016) is a collection of tools for the preprocessing of Fasta/Fastq files that include the following:

For more information, see the "Fastx-ToolKit" manual at "http://hannonlab.cshl.edu/fastx_toolkit/".

To run any of the FastxToolKit go to [“Preprocessing → Trimming & Cleaning → Fastx-Toolkit”] and follow Fig. 12.


Figure 12

Figure 12: Example of GPRO interface for a tool (FASTXTOOLKIT: FASTQ to FASTA) of Fastx-Toolkit.


- PrepSeq: FastqCollapser

FastqCollapser is used to remove duplicate reads from fastq files based on their sequence content.

To run FastqCollapser go to[Preprocessing → PrepSeq → FastqCollapser] and proceed as shown in Fig. 13.


Figure 13

Figure 13: Using the GPRO interface for FastQCollapser.



- Trimming & Cleaning: FastqIntersect 

FastqIntersect is a script that compares the information of two pair-end files that have been independently preprocessed and the information on both files to edit them keeping only those reads, and in the same order, that are present in both files (mate reads). This tool used when the number of reads obtained does not match the output of the execution of any preprocessing tool in each file individually the other. This is because assembly/mapping processes require that the files match in the number and the sort of reads. Please note that both Prinseq and Trimmomatic already have a function to intersect reads by ticking the ‘pair end files’ box. Thus, FastqIntersect will only need to be run in either those cases where the ’pair end files’ box has not been selected. FastqIntersect will also not need to be used when Cutadapt has been used, since this tool does not implement intersecting functions.

To run FastqIntersect go to [Preprocessing → PrepSeq → FastqIntersect] and follow Fig. 14.


Figure 14

Figure 14: Using the GPRO interface Fastqintersect.



2.2.2 - DENOVO ASSEMBLY

De Novo assembly is the step for reconstruction and annotation of new genomes and transcriptomes without the aid of a reference genome and with no prior knowledge of their length, layout and/or composition. De Novo assembly is the methodology most commonly used to characterize new sequences or partially characterized sequences. De novo assemblers assemble short or long nucleotide sequences into longer sequences called contigs without the use of a reference sequence. In case of genome assemblies, contigs can be postprocessed for improving the quality and accuracy of genome draft assemblies. Contigs can be post-processed to resolve sequence gaps (undeterminations normally represented as “Xs” or “Ns”) and/or for assembling the contigs into larger units called scaffolds. For these three tasks, DeNovoSeq provides distinct interface solutions to manage different assemblers, gap filling tools and scaffolders.

Input configuration file :

An input configuration file is requiered by DeNovoSeq to guide the assembly process. It is very easy; you can use a previously existent configuration file or to create a new one as follows:

To create a new Input configuration file go to[DeNovo Protocols → Input configuration file] and follow Fig. 15.


Figure 15

Figure 15: Using the GPRO interface for configuring an input file for DeNovo assemblers.


ASSEMBLY:

- Transcriptomes: Oases

Oases (Schulz et al., 2012) is a de novo transcriptome assembler powered by the Velvet assembler core with the aim to resolve transcripts from short read and long sequencing reads in the absence of any genomic reference. For more information, see the Oases manual at "https://www.ebi.ac.uk/~zerbino/oases/OasesManual.pdf" for more information.

To run Oases go to [De novo assembly → Assembly → Transcriptomes → Oases]and follow Fig. 16.


Figure 16

Figure 16: Using the GPRO interface to assemble short and long transcriptome reads into new transcripts with Oases.


- Transcriptomes: SOAPdenovo-Trans

SOAPdenovo-Trans (Luo et al., 2012) is a de novo transcriptome assembler adapted from the SOAPdenovo framework to resolve transcripts (and alternative splicing and different expression levels) from short read sequencing reads in the absence of any genomic reference. For more details, see the SOAPdenovo-Trans manual at "https://github.com/aquaskyline/SOAPdenovo-Trans".

To run SOAPdenovo-Trans go to [De novo assembly → Assembly → Transcriptomes → SOAPdenovo-Trans]and follow Fig. 17.


Figure 17

Figure 17: Using the GPRO interface to assemble short transcriptome reads into new transcripts with SOAPdenovo-Trans.


- Genomes: Velvet

Velvet (Zerbino and Birney, 2008) ) is a de novo genome assembler that takes short read sequences and resolves high quality contigs. For more information, see the Velvet manual at "https://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf".

To run Velvet go to [De novo assembly → Assembly → Genomes → Velvet]and follow Fig. 18.


Figure 18

Figure 18: Using the GPRO interface to assemble short and long genome reads into new contigs with Velvet.


- Genomes: SOAPdenovo2

SOAPdenovo2 (Luo et al., 2012) ) is an assembler designed to assemble Illumina GA short reads. SOAPdenovo aims to reduces memory consumption in graph construction resolving repeat regions in contig assembly, increasing coverage and length in scaffold construction and improving gap closing. See the SOAPdenovo manual at "https://github.com/aquaskyline/SOAPdenovo2" for more information.

To run SOAPdenovo2 go to [De novo assembly → Assembly → Genomes → SOAPdenovo2]and follow Fig. 19.


Figure 19

Figure 19: Using the GPRO interface to assemble short genome reads into new contigs with SOAPdenovo2.


- Genomes: CANU

CANU (Koren et al., 2017) is an assembler of the Celera Assembler designed for high-noise single-molecule sequencing such as the PacBio RSII or Oxford Nanopore MinION. CANU is a hierarchical assembly pipeline, which runs in four steps: See "http://canu.readthedocs.io/en/latest/tutorial.html" for more information.

To run CANU go to [De novo assembly → Assembly → Genomes → CANU]and follow Fig. 20.


Figure 20

Figure 20: Using the GPRO interface to assemble long genome reads into new contigs with CANU.


- Genomes: SPAdes

SPAdes (Bankevich et al., 2012) is an assembler specifically recommended to reconstruct bacterial genomes (both single-cell MDA and standard isolates), fungal and other small genomes. SPAdes supports paired-end reads, mate-pairs and unpaired reads. See the SPAdes manual at "http://cab.spbu.ru/software/spades/ " for more information.

To run SPAdes go to [De novo assembly → Assembly → Genomes → SPAdes]and follow Fig. 21.


Figure 21

Figure 21: Using the GPRO interface to assemble bacterial genome reads into new contigs with SPAdes.


GAP FILLING:

- Gap filling: GapCloser

Due to low sequence coverage, repetitive elements assemblies reconstructed de Novo often show sequence and/or fragment “gaps” represented as uncharacterized nucleotide (N) stretches. Some of these gaps can be closed by re-processing latent information in the raw reads. GapCloser (Luo et al., 2012) closes gaps emerging during the scaffolding process by SOAPdenovo or other assembler using the abundant pair relationship of short reads. See the GapCloser manual at "https://vcru.wisc.edu/simonlab/bioinformatics/programs/soap/GapCloser_Manual.pdf " for more information.

To run GapCloser go to [De novo assembly → Gap filling → GapCloser]and follow Fig. 22.


Figure 22

Figure 22: Using the GPRO interface to manage GapCloser.


SCAFFOLDING:

- Scaffolding: BESST

BESST (Sahlin et al 2014) is a software for scaffolding genomic assemblies. It includes several tools to build a “contig graph” from available assembly information, obtaining scaffolds from this graph and accurate gap size information. See the BESST manual at "https://github.com/ksahlin/BESST" for more information.

To run BESST go to [De novo assembly → Scaffolding → BESST]and follow Fig. 23.


Figure 23

Figure 23: Using the GPRO interface to manage BESST scaffolder.


- Scaffolding: OPERA-LG long reads

OPERA (Gao et al., 2011) ) is a scaffolder based on an exact algorithm oriented to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads. OPERA uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project. See the OPERA manual at "https://sourceforge.net/projects/operasf/files/OPERA-LG%20version%202.0.6/" for more information.

To run OPERA go to [De novo assembly → Scaffolding → OPERA-LG long reads]and follow Fig. 24.


Figure 24

Figure 24: Using the GPRO interface to manage OPERA scaffolder.


2.2.3 GENE PREDICTION:

Gene prediction refers to the set of methodologies used to identify the regions of genomic DNA that encode genes (protein-coding and non-coding) as well as other regulatory regions and functional elements. For prokaryotic genes, we provide here an ORF finder script also available in the SeqEditor app of the GPRO suite, for eukaryotic genomes, we provide an interface within DeNovoSeq to run AUGUSTUS.

- Find ORFs:

Find ORFs searches and finds ORFs, simultaneously, in one or more fasta files with multiple sequences just specifying a minimum length and the open reading frames (both forward and reverse). Detected ORFs can be subsequently selected and exported or translated and exported as protein sequences. The tool also exports annotation files with the coordinates of the ORFs.

To run Find ORFs go to [Gene Prediction → Prokaryotes → Find ORFs]and follow Fig. 25.


Figure 25

Figure 25: Using the GPRO interface to manage Find ORFs.


- AUGUSTUS

AUGUSTUS is an ab initio program that predicts genes from eukaryotic genome sequences based on a Generalized Hidden Markov Model, a probabilistic model of a sequence and its gene intron-exon structure Stanke et al., 2008. The workflow we implement in DeNovoSeq contemplates three steps: “Training”, “Hints preparation” and “Prediction”. See the AUGUSTUS manual at "http://bioinf.uni-greifswald.de/augustus/" for more information

.

To fulfill the step of Training go to [Gene prediction → Augustus → Training]and follow Fig. 26.


Figure 26

Figure 26: Using the GPRO interface to train AUGUSTUS for predicting gene structures in your genome.


AUGUSTUS lets you to incorporate hints on the gene structure coming from extrinsic sources such as:

If you have material needed to create the hints go to [Gene prediction → Augustus → Hints]and follow Fig. 27.


Figure 27

Figure 27: Using the GPRO interface to create the hints with AUGUSTUS


Once you have the training already performed (and if you have the hints created), you can make the final step Predictions going to [Gene prediction → Augustus → Prediction]and follow Fig. 28.


Figure 28

Figure 28: Using the GPRO interface to perform the Gene prediction with AUGUSTUS.


2.2.4 - ANNOTATION

In deNovo protocols, annotation is the step (normally final) oriented to identify function, domains and biological roles of a set of predicted genes and transcripts (coding and not coding). The annotation is normally performed via sequence-to-sequence or sequence-to-profile alignment comparison to find statistically significant homologies between your query sequences and a RefSeq database. DeNovoSeq permits the user to perform the annotation either by manual or by automatic means using three of the most tools for automatic annotation; NCBI BLAST package Altschul et al., 1990, HHMER3 Mistry et al 2013

.

NCBI-BLAST

NCBI BLASTAltschul et al., 1990is a software package that finds regions of local similarity between a query sequence (or a file in fasta format with the query sequences) and the subjected sequence models or refseq database searched by the query/s. The package implements different tools to compare nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches:

  1. blastp searches protein subject databases using protein queries
  2. blastn searches nucleotide subject databases using nucleotide queries
  3. blastx searches protein subject databases using translated nucleotide queries
  4. tblastn searches nucleotide subject databases using protein queries
  5. tblastx searches translated nucleotide database using translated nucleotide queries

For more information see the NCBI-BLAST manual at "https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs".

The step-by-step mode of DeNovoSeq presents a specific interface for each step of a typical BLAST analysis organized in the Annotation tab, as follows:

- Format BLAST databases

To perform a BLAST analysis the subject database must be formatted in BLAST format. For this task, DeNovoSeq implements an interface that accepts both protein and nucleotide fasta files as input formatting them for BLAST.

To do this go to [Annotation → NCBI-BLAST → Format databases] and follows what is stated in Fig.29 for formatting blast subjects.


Figure 29

Figure 29: Using the GPRO interface for formatting blast subjects.


- Import RefSeq databases

Big databases such as those provided by NCBI (NR, RefSeq) or Uniprot (Swissprot) are very difficult to process and occupy a significant part of the disk size in your PC or in your user account if you are working on the server side.

For users working on the server side, we facilitate a centralized repository for big databases whose links can be imported to user account going to [Annotation → NCBI-BLAST → Import BLAST databases] and doing as indicated in Fig.30 for importing RefSeq databases.


Figure 30

Figure 30: Using the GPRO interface to format fasta files as NCBI-BLAST reference databases or for importing them precompiled.


- BLAST search with fasta file query

DeNovo Seq provides a frame for the conventional web-based interface of the NCBI-BLAST allowing the users to make fast searches against their databases and multifasta files using one or multiples sequences queries and obtaining the typical alignment output provided by BLAST . Annotation of multifasta files with multiple sequences (for example genomes, transcriptomes o proteomes) may take a significant time to complete the process and deliver the results (hours or perhaps days depending on the number of sequences per file).

To execute the BLAST search for single queries go to [Annotation → NCBI-BLAST → BLAST search with one query] and follows what stated for single queries in Fig.31.


Figure 31

Figure 31: Using the GPRO interface to search blast-formatted databases with one or more queries using NCBI-BLAST package.


- Process BLAST output

The output of the BLAST search delivered by DeNovoSeq for multifasta files is a set of XML files (one per each sequence queried to the subject database) containing all matches obtained found in the database by each query. This thus means that if a multifasta file annotated via BLAST has 25000 sequences the BLAST search will report 25000 xml files. DeNovoSeq also provides an interface for an internal script of GPRO for automatic annotation. This script processes the XML outputs provided by BLAST and prints them into a human-readable annotation file in CSV format. The interface for this script provides filtering parameters to define an evalue cut-off, filter redundant matches, extract a given number of best hits per query and more.

To execute the process BLAST output script, go to [Annotation → NCBI-BLAST → Process BLAST outoput] and follows in Fig.32.


Figure 32

Figure 32: Using the GPRO interface for processing BLAST outputs and extracting annotation files from BLAST results.


- Retrieve sequences from BLAST outputs

DeNovoSeq also provides a parsing script for extracting sequences (that can be either from the query file or from the subject database) according to results provided by the BLAST search and create a new fasta file containing only with the sequences retrieved. The interface for this script also provides filtering parameters to extract full sequences or just the core of the query or subject sequence that aligns constituting the High-scoring Segment Pair (HSP). This last mode also permits to make extractions extending “n” nucleotides upstream and downstream of the HSP core.

To access the script for retrieving sequences from query or subject multifasta files according to the BLAST output, go to [Annotation → NCBI-BLAST → Retrieve sequences from BLAST outputs] and follows what stated in Fig.33 for this script.


Figure 33

Figure 33: Using the GPRO interface for processing BLAST outputs and extracting sequences files from BLAST results.


HMMER3

HMMER3Mistry et al 2013is a software for searching sequence homologs that makes comparisons between protein or nucleotide sequence queries and a user-made database of Hidden Markov Model (HMM) profiles (or vice versa). HMM profiles are probabilistic models capturing position-specific information in a set of aligned sequences (i.e. a multiple alignment) about the evolutionary changes occurred per alignment position.

The step-by-step mode of DeNovoSeq presents different interfaces to manage HMMER for creating and editing HMM databases from multiple alignments, creating consensus sequences or for performing comparative analyses. See the manual of HMMER at "http://eddylab.org/software/hmmer/Userguide.pdf" for more information.

- Create HMMER databases

DeNovoseq provides an interface to call a small pipeline executing the HMMER commands hmmbuild, hmmcalibrate and hmmpress to respectively:

  1. Construct HMMs from multiple alignment files in fasta format
  2. Calibrate the HMM profiles.
  3. Generate a Majority Rule Consensus (MRC) Sequence

To execute the hmmbuild-hmmcalibrate-hmmpress pipeline go to [Annotation → HMMER → Create HMMER databases] and follows what stated in Fig.34 for creating HMMs.


Figure 34

Figure 34: Using the GPRO interface for building HHMER databases.


- Edit HMMER databases

For editing HMM databases DeNovoseq provides an interface that calls the HMMER commands hmmalign and hmmemit to respectively:

  1. Add and align new sequences to the HMM database and then update the HMM
  2. Generate an update MRC sequence

To execute the hmmalign-hmmemit pipeline go to [Annotation → HMMER → Edit HMMER databases] and follows what stated in Fig.35 for editing and updating HMMs.


Figure 35

Figure 35: Using the GPRO interface for editing HHMER databases.


- HMMER search fasta file query

DeNovo Seq provides a frame for a web-based interface to run fast searches against the HMM databases using one sequence or multiples queries and obtaining the typical alignment output provided by HMMER. Three kinds of HMM searches are allowed:

  1. Protein queries against a HMM profile database using the “hmmscan” command of HMMER
  2. Protein HMM queries vs a protein database using the “hmmsearch” command
  3. DNA sequence queries against a database of DNA HMMs using the “nhmmscan” command

For annotation of multifasta files, DeNovoSeq provides a specific interface allowing the users to run multiple searches with HMMER3 executed in background process. For large query files the interface also permits the user to divide the input file in multiple subqueries to accelerate the process.

To execute a HMMER search for multifasta file queries, go to [Annotation → HMMER → HMMER search with fasta file query] and follows what stated for multifasta files searches in Fig.36.


Figure 36

Figure 36: Using the GPRO interface for searching against the HMM databases using one sequence or multiples queries and obtaining the typical alignment output provided by HMMER.








GPRO licensing and Usage           Former versions           TSI-100903-2019-11

Biotechvana


Valencia Lab
Parc Cientific Universitat de Valencia
Carrer del Catedràtic Agustín Escardino, 9. 46980 Paterna (Valencia) Spain
Madrid Lab
Parque Científico de Madrid
Campus de Cantoblanco
Calle Faraday 7, 28049 Madrid Spain
Contact us
Phone: +34 960 06 74 93
Email: biotechvana@biotechvana.com

Supported by


Hipra Scientific S.L.U, Polypeptide Therapeutic Solutions S.L., Biotechvana S.L. and Nostrum Biodiscovery constitute the consortium of enterprises participating in the project "Research of a new vaccine for a human respiratory disease", granted by the CDTI (Center for Industrial Technological Development), and supported by the Ministry of Science and Innovation and financed by the European Union – NextGenerationEU. The main objective of this project is to design a safe immunogenic and effective vaccine against the respiratory syncytial virus.

Biotechvana © 2015
Privacy policy
Política de privacidad
This website use cookies, by continuing to browse the site you are agreeing to our use of cookies. More info about our cookies here.