DENOVOSEQ - FUNCTIONAL GENE ANNOTATION


Functional annotation of predicted genes and transcripts (coding and not coding) is one of the most important steps of a genome de novo project. The process is mostly based on the sequence-to-sequence alignment comparison of sequence queries against a reference database of nucleotides or proteins in order to find statistically significant homologies allowing you to identify the function, domains and biological role of a sequence. To perform a functional annotation of transcriptomes and genomes my manual means is indeed very laborious and tedious, especially for sequencing projects, which comprise thousand of sequences. For this reason, protocols for functional annotation of full sets of predicted genes and or transcripts are usually automated. DeNovoSeq implements interface solutions to call three of the most tools for automatic annotation; NCBI BLAST package (Altschul et al., 1990), HMMER3 (Mistry et al 2013 ) and InterProScan (Jones et al. 2014; Finn et al. 2017).

5.1 - FORMAT DATABASE

To perform a functional annotation with BLAST and HMMER3 you need the query file with the gene or transcript predictions previously obtained in the assembly process and gene prediction (in the case of genomes) but also a database of reference sequence models. Before executing your analyses, you do need to compile your reference material in the appropriate blast or hmmer formats to be used as subjects for your query files.


Compilation of BLAST sequence databases


To create and format BLAST databases, DeNovoSeq implements an interface that calls BLAST format of the NCBI blast package to create the reference database. BLAST accepts both protein and nucleotide fasta files. To format a protein or nucleotide sequence database for a blast search with DeNovoSeq go to


            [“Annotation → Format databases → Compile BLAST databases”]

Basic Procedure

For NCBI (NR, RefSeq) or Uniprot (NR, RefSeq, Swissprot) databases, the interface permits you to format these databases as precompiled if you already have them stored in your server (see also the GPRO site for server sides dependencies of DeNovoSeq). To compile these databases do the following;

  1. Choose “Import pre-compiled database” and select from the drop-down menu below “ Select database” the reference database you want to compile for blast format
  2. Drag the folder you want to use as database directory from the FTP browser to the “Output folder” field associated with the block to import precompiled databases
  3. Click on compile databases

To format any protein or nucleotide sequence database in fasta format (the fasta format is mandatory) do the following;

  1. Choose the option “Compile a FASTA multiple sequence file
  2. Drag the sequence file/s you want to compile for blast format
  3. Drag the folder you want to use as database directory from the FTP browser to the “Output folder” field
  4. Select the sequence type of your database (nucleotide or protein) and drag an output directory into “Output folder” field associated with the block to compile “fasta multiple sequence file
  5. Click on compile databases button

You will get a confirmation message that the job has been launched. Otherwise, revise all options again. If an input field is invalid or missing you will get an error icon beside the field (hover the mouse over it to see the error message).

Create or manage HMM databases for HMMER


HMMER3 (Mistry et al 2013 ) compares sequences against HMM profiles or vice versa. If you are interested in compare sequences against a reference database of Hidden Markov Model (HMM) profiles, you must to create the HMMs if you do not have them. DeNovoSeq implements an interface to call the HMMER3 distinct tools to create HMMs using protein or nucleotide multiple alignments as input and build a database with these HMMs. In addition, DeNovoSeq also allows you to edit and manage HMMs using other HMMER3 tools. To this end, the interface to create and/or manage HMMs presents two sub-interfaces; “Create HMMs”; “Update/Edit HMMs”.

To create HMMs with DeNovoSeq go to


            [“Annotation → Format databases → Create or manage HMMs→ Create HMMs”]


Basic Procedure

The sub-interface “Create HMMs” calls the tools hmmbuild, hmmpress of HMMER3 to create HMMs and hmmemit to create a Majority-Rule Consensus (MRC) sequence” per HMM. To manage this interface, do the following:

  1. Drag the multiple alignment file/s (one per protein or gene family) based on which you want to create the HMM database into the input field “Upload a multiple alignment” (the tool accepts fasta and/or stockolm alignment format).
  2. Check on the alignment type option to declare which type of data is provided with multiple alignments. It can be protein or nucleotide multiple alignments in fasta format
  3. Check if you want to also create a consensus sequence with the tool hmmemit, check the option “Generate Majority-Rule Consensus (MRC) sequence”
  4. Drag the folder you want to use as database directory from the FTP browser to the “Output folder” field
  5. Declare if you want to merge all created HMM profiles in a single database of HMM profiles or prefer to store each HMM in a single file.
  6. Click on the Run button

You will get a confirmation message that the job has been launched. Otherwise, revise all options again. If an input field is invalid or missing you will get an error icon beside the field (hover the mouse over it to see the error message).


The sub-interface “Update/Edit HMMs” calls the tools hmmalign and hmmemit of HMMER3 allowing you to update HMM files one by one (i.e the input file must be always a single HMM file) go to;


            [“Annotation → Format databases → Create or manage HMMs→ Update/Edit HMMs”]


Basic Procedure

  1. Drag the HMM file you want to edit into the input field “Upload HMM”
  2. Check on the HMM type option to declare which type of data is provided in the HMM. It can be protein or nucleotide data.
  3. If you want to align new sequences to the HMM profile using hmmalign, paste in the input field “Add your fasta sequences” the new sequences you want add to the HMM. Fasta format is required
  4. Check on the Sequence type option to declare, which type of sequence data is going to be aligned to the HMM
  5. Check the option “Generate Majority-Rule consensus (MRC) sequence”” if you also want to create a new MRC sequence associated with the edited HMM profile
  6. Drag the folder for the HMM output from the FTP browser to the “Output folder” field
  7. Click on the Run button

You will get a confirmation message that the job has been launched. Otherwise, revise all options again. If an input field is invalid or missing you will get an error icon beside the field (hover the mouse over it to see the error message).

5.2 - BLAST SEARCH

NCBI BLAST (Altschul et al., 1990) is a software package that finds regions of local similarity between sequences. The package implements different tools to compare nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches; a) blastp searches protein subject databases using protein queries; b) blastn searches nucleotide subject databases using nucleotide queries; c) blastx searches protein subject databases using translated nucleotide queries; d) tblastn searches nucleotide subject databases using protein queries; e) tblastx searches translated nucleotide database using translated nucleotide queries. See the BLAST manual at https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs for more information.

To run BLAST with DeNovoSeq go to

            [“Annotation → BLAST → Run BLAST”]


Basic Procedure

The interface provided by DeNovoSeq to call BLAST is divided into three blocks “Input Configuration”, “Process Output” and “Sequence Retrieval options”. The first block (“Input Configuration”) is provided for uploading the query and subject files you want to compare via blast search and for configuring the options and parameters of the blast search as follows:

  1. Drag one or more fasta query files into the fasta box of the input field. If you upload multiple files blast will execute as many blast processes as query files. Bear however in mind to not run more blast processes than those allowed by the computational capability of your server (i.e. RAM and number of processors, more or less one blast search needs to use a thread)
  2. Drag the database folder where your blast subject database is available into “Database folder” field
  3. Select the program you are going to use (blastp, blastn, blastx, tblastn or tblastx) in options
  4. If desired set other blast options such as “evalue cutoff”, “matrix”, “genetic code” and “complexity filters”, which are also available in the input configuration block
  5. Drag an output directory into “Output folder” field available in the input configuration block

The second block (“Process Output”) processes the XML outputs provided by blast and prints them into a human-readable annotation file in CSV format. For configuring the options and parameters in this block do as follows:

  1. If you want to run this step check the Run process output step tab. If not leave it in blank
  2. Here you do not need to provide an input folder as the input field is automatically filled when you declare the input and output folders in the first block of the interface
  3. If you checked to run the process output step you can set more options to filter the blast csv output file such as evalue cutoff to filter the annotations and number of best hits per query
  4. If you are analyzing protein queries you can opt to automatically include GO annotations and KEGG pathways checking the filed “include GO terms”. Remember that this option works only with protein sequences
  5. Select if you want or not filter redundant matches or not. The interface gives the option to filter per query, per subject or not to filter redundant matches. By default, not filter is stated
  6. If you are working with multiple blast searchers (meaning that you uploaded multiple query files) and these belong to the same genome or transcriptome project, you have the option to merge all csv outputs in a single one

The third block (“Sequence retrieval options”) lets you to create a new sequence file according to results provided by blast. The configuration of this block is as follows:

  1. If you want to run this step check the Run retrieval step tab. If not leave it in blank
  2. Here you do not need to provide an input folder as the input field is automatically filled when you declare the input and output folders in the first block of the interface
  3. If you checked to run process output step, select if you want to extract the sequences from the query file or the subject file
  4. Drag into the Sequence retrieval options the query file or the subject file from where you want to extract the sequences
  5. Indicate if you want to extract the full sequences from the selected file or want to extract the sequences trimmed according to the alignment coordinates provided by the blast csv output.

Finally, click on the Run button DeNovoSeq to execute the three steps configured one after another automatically. You will get a confirmation message that the job has been launched. Otherwise, revise all options again

5.3 - HMMER SEARCH

HMMER3 (Mistry et al 2013 ) makes comparisons between protein or nucleotide sequence queries and HMM profile subject databases (or vice versa) to search sequence homologs. Three kinds of HMMER comparisons are allowed by the current version of DeNovoseq. If you compare protein sequence queries against a HMM profile database DeNovoSeq will call “hmmscan”. If you compare protein HMM queries against a protein sequence database DeNovoSeq will call “hmmsearch”. If you compare protein DNA queries against a database of HMM DBNA profiles DeNovoSeq will call “nhmmscan”. See the HMMER3 manual at http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf for more information.

To run HMMer with DeNovoSeq go to

            [“Annotation → HMM → Run HMMER”]


Basic Procedure

The interface provided by DeNovoSeq to call HMMER3 is divided into two blocks “Input Configuration” and “Process Output”. The first block (“Input Configuration”) is provided for uploading the query and subject files you want to compare via blast search. Do the following:

  1. Drag your input sequence files from the FTP browser into the input field and declare which type of data and format is/are your input file/s. Data can be protein or nucleotide and the accepted format can either be a sequence fasta file or HMMs.
  2. Check the field Query type to declare if your input file, is protein sequence, nucleotide sequence or protein HMMs
  3. Drag the folder containing the subject database in the HMMER search from the FTP into the input field “Database folder” and declare which type of data and format is/are your subject file/s. Data can be protein or nucleotide and the accepted format can either be again, a sequence fasta file or HMMs
  4. Check the field Subject type to declare if your database file, is protein sequence, nucleotide sequence or protein HMMs
  5. Drag an output directory for your results into “Output folder” field
  6. Set the evalue threshold in “Options”

The second block (“Process Output”) processes the outputs provided by HMMER3 and prints them into a human-readable annotation file in CSV format. To configure the options and parameters provided by this block do as follows:

  1. Select postprocessing options. You can set more options to filter the HMMER csv output file such as evalue cutoff to filter the annotations and number of best hits per query
  2. Finally, click on the Run HMM analysis button to execute the analysis

You will get a confirmation message that the job has been launched. Otherwise, revise all options again







GPRO licensing and Usage           Former versions

Biotechvana


Valencia Lab
Parc Cientific Universitat de Valencia
Carrer del Catedràtic Agustín Escardino, 9. 46980 Paterna (Valencia) Spain
Madrid Lab
Parque Científico de Madrid
Campus de Cantoblanco
Calle Faraday 7, 28049 Madrid Spain
Contact us
Phone: +34 960 06 74 93
Email: biotechvana@biotechvana.com

Biotechvana © 2015
Privacy policy
Política de privacidad
This website use cookies, by continuing to browse the site you are agreeing to our use of cookies. More info about our cookies here.