MetaCLADE

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Overview

Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism, one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, one wishes to quantitatively relate genetic information with these factors. For this, we need to be as precise as possible in evaluating the quantity of gene protein sequences or transcripts associated to a given pathway. We wish to estimate the precise abundance of protein domains, but also recognise their weak presence or absence.

We introduce MetaCLADE, a novel profile-based domain annotation pipeline based on the multi-source domain annotation strategy. It provides a domain annotation realised directly from reads, and reaches an improved identification of the catalog of functions in a microbiome. MetaCLADE can be applied to either metagenomic or metatranscriptomic datasets.

Reference

  • A. Ugarte, R. Vicedomini, J.S. Bernardes, and A. Carbone. "A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling." Microbiome, 2018 6:149. https://doi.org/10.1186/s40168-018-0532-2


MetaCLADE v2

Compared to the original version, MetaCLADE v2 provides and improved software implementation and integrates the possibility to run MetaCLADE in a SGE computing environment. Moreover, it is accompanied by a new and improved version of the model library (see below).

The latest version of MetaCLADE v2 is available at the following git repository:

System requirements, installation instructions, and usage information are available in the corresponding page.

MetaCLADE v2 model library

The latest version of MetaCLADE works on an improved model library (currently based on Pfam32) and is no more compatible with the older library. The procedure to build the model library was affected by the following major improvements:

  • Pfam FULL was used, instead of SEED, to cluster the initial set of representative sequences.
  • The selection of representative sequences (spanning the tree of life) follows a new strategy which allows, for example, to select multiple sequences from the same species if they show enough divergence.
  • Homologous sequences used to build the models are sought using the HH-suite (hhblits) on the UniClust30 database.
  • The library is now completely based on profile HMMs for both clade-centered models (CCMs) and single-consensus models (SCMs).

The model library can be downloaded at the following address:

Installation instructions are included in MetaCLADE v2 README file.



MetaCLADE v1

Download

System requirements

  • MetaCLADE has been developed under Unix operating system.
  • The bash environment should be installed.
  • Python 2.7 is required for this package.

Software requirements

Model library

In order to run MetaCLADE, CLADE library must be downloaded from here. Let [MetaCLADE_DIR] be the directory of MetaCLADE. The library should be extracted in the following two directories:

[MetaCLADE_DIR]/data/models/pssms/
[MetaCLADE_DIR]/data/models/hmms/

Test databases

The datasets used to evaluate MetaCLADE are available at the following links:


1. MetaCLADE configuration/parameters

You can unpack the archive using the command

tar -xf MetaCLADE-1.1.tar.gz

Then it is advised to include (if it is not) MetaCLADE main directory to your PATH environment variable by adding the following line to your ~/.bashrc

export PATH=[MetaCLADE]:"${PATH}" 

where [MetaCLADE_DIR] is MetaCLADE's installation directory.

Finally, in order to create MetaCLADE jobs you must create a Run configuration file (see below) and run the following command:

metaclade --run-cfg [Run configuration file]

Input file preprocessing

Before running MetaCLADE on the input FASTA file you should build a BLAST database. You can either set the CREATE_BLASTDB parameter to True in the Run configuration file (see below) or you can manually run the following command:

makeblastdb -dbtype prot -in /path/to/sequence/database/CDS.faa

Run configuration file example (mandatory)

Lines starting with a semicolon are considered as comments and are not taken into account. Also, you should provide absolute paths.

[Parameters]
DATASET_NAME = CDS
FASTA_FILE = /path/to/sequence/database/CDS.faa
NUMBER_OF_JOBS = 32
;CREATE_BLASTDB = True
;WORKING_DIR = /path/to/a/custom/working/directory
;TMP_DIR = /path/to/a/custom/temporary/directory
;DOMAINS_LIST = /path/to/a/custom/model.list

A custom working directory (where jobs and results are saved) could be set with the WORKING_DIR parameter (the default value is the directory from which the metaclade command has been called). A custom temporary directory could be set using the TMP_DIR parameter (the default is a temp subdirectory in the working directory). If you want to restrict the annotation to a subset of domains, you could provide a file containing one domain identifier per line to the DOMAINS_DIR parameter.


MetaCLADE configuration file example (optional)

Optionally, a MetaCLADE configuration file could be provided to metaclade with the parameter --metaclade-cfg. This file could be used to set custom paths to PSI-BLAST/HMMER/Python executables or to the MetaCLADE model library.

Lines starting with a semicolon are not taken into account. Also, you should provide absolute paths.

[Programs]
;PSIBLAST_DIR = /home/ncbi-blast-2.7.1+/bin/
;HMMER_DIR = /home/hmmer-3.2.1/bin/
;PYTHON_DIR = /home/python-2.7.15/bin

[Models]
;PSSMS_DIR = /home/MetaCLADE/data/models/pssms
;HMMS_DIR = /home/MetaCLADE/data/models/hmms

2. MetaCLADE jobs

By default jobs are created in:

[WORKING_DIR]/[DATASET_NAME]/jobs/

Each (numbered) folder in this directory represents a step of the pipeline and contains several .sh files (depending on the value assigned to the NUMBER_OF_JOBS parameter):

[DATASET_NAME]_0.sh
[DATASET_NAME]_1.sh
[DATASET_NAME]_2.sh
...

Jobs must be run in the following order:

[WORKING_DIR]/[DATASET_NAME]/jobs/1_model_search/
[WORKING_DIR]/[DATASET_NAME]/jobs/2_arff_files/
[WORKING_DIR]/[DATASET_NAME]/jobs/3_mclade_eval/
[WORKING_DIR]/[DATASET_NAME]/jobs/4_best_domains/
[WORKING_DIR]/[DATASET_NAME]/jobs/5_final_prediction/

In the first three directories you can find a `submit.sh` file that contains the `qsub` command to submit each job to the queue system of a SGE environment. This file can be used (or adapted for other HPC environments) in order to submit all jobs at each step.


3. MetaCLADE results

By default results are stored in:

[WORKING_DIR]/[DATASET_NAME]/results/

Each (numbered) folder in this directory contains the results after each step of the pipeline.

After running each step, the final annotation is saved in the file:

[WORKING_DIR]/[DATASET_NAME]/results/5_final_prediction/final_prediction.mclade

It is a tab-separated values (TSV) file whose lines represent annotations.
Each annotation has the following 10 fields:

  • E-value
  • Score
  • Model identifier
  • Model start
  • Model end
  • Domain identifier (i.e., Pfam accession number)
  • Sequence identifier
  • Sequence start
  • Sequence end
  • Prediction probability

Licence

The MetaCLADE program has been developed under the CeCILL 2.1 licence.

Contacts

For questions, comments, or suggestions feel free to contact Alessandra Carbone, Riccardo Vicedomini or Ari Ugarte

Last Update: 26 April 2021