High performance domain identification in proteins reached with the agreement of many profiles and domain co-occurrence



Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fails. Here we address the fundamental question of how to identify domains for proteins that highly diverged. By using a large computer power we demonstrate that the limits in annotation reached by current methods can be bypassed. A new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades, has been designed to tackle the problem through a novel exploitation of the large amount of data available:

  1. probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences,
  2. a decision-making protocol combines models multiple outcomes,
  3. a multi-criteria optimization algorithm finds the most likely protein architecture.

Clade-centered models being particularly close to actual protein sequences, have been shown to be more specific and functionally predictive than the broader Pfam family models. Based on them, we could realize highly accurate annotations of P. falciparum protein sequences on a scale not previously possible. The method, applicable to any genome, opens new avenues of investigation to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age.



The CLADE package can be downloaded here (150 GB).

To unpack the archives see


Results obtained on P. falciparum: AllDomains.xls


System requirements:

  • CLADE has been developed under Unix operating system.
  • The bash environment should be installed.
  • Perl executable should be accessible from the PATH environment variable.
  • Your stdlibc++ libraries must support C++ 2011
  • It is strongly recommended to use a High Performance Computing environment. Scripts for SGE are provided in the package.


Software requirements:


Configuring CLADE environment_variables:

Change the file CLADE/files/run/env_var/ for updating the environment variables



How to execute CLADE:

CLADE is constituted by three main steps:

  1. 1. Scanning CLADE library
  2. 2. Running SVM
  3. 3. Running DAMA

    1. Scanning CLADE library

    It requires a FASTA file with protein sequences to be annotated. Each protein must be identified by a unique ID and its NCBI taxon code, both are separated by a tabulation.
    See an example at CLADE/database/test/proteins.fasta

    #preparing executable files 
    #see $CLADE_DIR/files/run/search/ for parameter details. cd $CLADE_DIR/files/log
    bash $CLADE_DIR/files/run/search/ \ $CLADE_DIR/files/run/env_var/ \ $CLADE_DIR/pfamLists/pfam_27.full \ $CLADE_DIR/files/log/ \ search \ 100 \ $CLADE_DIR/databases/test/proteins.fasta \ $CLADE_DIR/models/pssms/FAMILY_FACTOR.tar.gz \ $CLADE_DIR/models/hmms/FAMILY_FACTOR.hmm \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/ \ $CLADE_DIR/pfamLists/used/FAMILY_FACTOR/ \ 1
    #Attention:: Do NOT delete or replace the words FAMILY_FACTOR in the command line above. #This is a keyword and it will be replaced with real domain Ids coming from

    The executable files were placed in $CLADE_DIR/files/log/. If you are using SGE do:

    bash $CLADE_DIR/scriptsCluster/ \ \
    search \
    $CLADE_DIR/files/log/out/ \

    Result files will be saved in $CLADE_DIR/results/domainsPfam/. They are required for the second step.


    2. Running SVM

    Before running the SVM we need to create the meta-features or attributes

    #preparing executable files 
    #see $CLADE_DIR/files/run/ensemble/ for
    #parameter details. cd $CLADE_DIR/files/log
    bash $CLADE_DIR/files/run/ensemble/ \ $CLADE_DIR/files/run/env_var/ \ $CLADE_DIR/pfamLists/pfam_27.full \ $CLADE_DIR/files/log/ \ att \ 100 \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/result_resume.txt \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/ \ $CLADE_DIR/pfamLists/used/FAMILY_FACTOR/ \ $CLADE_DIR/ensemble/taxonPath/FAMILY_FACTOR.taxon \ $CLADE_DIR/databases/proteins.fasta \ $CLADE_DIR/results/att/FAMILY_FACTOR.att \ 1 \ $CLADE_DIR/databases/pfam/ \ $CLADE_DIR/taxon/

    The executable files were placed in $CLADE_DIR/files/log/. If you are using SGE do:

    bash $CLADE_DIR/scriptsCluster/ \ \
    att \
    $CLADE_DIR/files/log/out/ \

    Attribute files will be saved in $CLADE_DIR/results/att.

    Running SVM

    #preparing executable files 
    #see $CLADE_DIR/files/run/ensemble/ for
    #parameter details. cd $CLADE_DIR/files/log
    bash $CLADE_DIR/files/run/ensemble/ \ $CLADE_DIR/files/run/env_var/ \ $CLADE_DIR/pfamLists/pfam_27.full \ $CLADE_DIR/files/log/ \ svm \ 100 \ $CLADE_DIR/ensemble/attFiltre/pos.neg/FAMILY_FACTOR.att \ $CLADE_DIR/ensemble/att/neg.att \ $CLADE_DIR/ensemble/att/pos.att \ 2 \ $CLADE_DIR/results/att/FAMILY_FACTOR.att \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/ \ $CLADE_DIR/results/domainsPfam/ \ $CLADE_DIR/ensemble/svmVein.cutOFF \ $CLADE_DIR/databases/pfam/

    The executable files were placed in $CLADE_DIR/files/log/. If you are using SGE do:

    bash $CLADE_DIR/scriptsCluster/ \ \
    svm \
    $CLADE_DIR/files/log/out/ \


    3. Running DAMA

    qsub -V -S /bin/bash -N damaPer -e $CLADE_DIR/files/log/out \
    -o $CLADE_DIR/files/log/out/ $CLADE_DIR/files/run/archs/ \
    $CLADE_DIR/files/run/env_var/ \
    $CLADE_DIR/results/domainsPfam/ \
    $CLADE_DIR/databases/pfam/ \
    $CLADE_DIR/databases/pfam/pfam.knownArch \
    $CLADE_DIR/databases/pfam/pfam.overlapping \
    0.001 0 \

    The architecture domain predictions will be saved in $CLADE_DIR/results/archs.txt



    The CLADE program has been developed under the CeCILL licence (see LICENCE).



    For questions, comments, or suggestions feel free to contact Alessandra Carbone or Juliana S. Bernardes.



    If you use CLADE, please cite:

    • J.S. Bernardes, C. Vaquero, G. Zaverucha and A. Carbone. (2016) Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence. PLoS Computational Biology , 2016 12(7):e1005038.
    • J.S. Bernardes, F.R.J. Vieira, G. Zaverucha and A. Carbone. (2015) A multi-objective approach accurately resolves protein domain architectures. Bioinformatics , Adv Acc Oct 2015.

    Last Update Jan. 2015