Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fails. Here we address the fundamental question of how to identify domains for proteins that highly diverged. By using a large computer power we demonstrate that the limits in annotation reached by current methods can be bypassed. A new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades, has been designed to tackle the problem through a novel exploitation of the large amount of data available:
Clade-centered models being particularly close to actual protein sequences, have been shown to be more specific and functionally predictive than the broader Pfam family models. Based on them, we could realize highly accurate annotations of P. falciparum protein sequences on a scale not previously possible. The method, applicable to any genome, opens new avenues of investigation to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age.
The CLADE package can be downloaded here (150 GB).
To unpack the archives see
README.txt
Results obtained on P. falciparum: AllDomains.xls
Change the file CLADE/files/run/env_var/environment_variables.sh for updating the environment variables
export CLADE_DIR=$YOUR_DIRECTORY/CLADE
CLADE is constituted by three main steps:
It requires a FASTA file with protein sequences to be annotated. Each protein must be identified by a unique ID and its NCBI taxon code, both are separated by a tabulation.
See an example at CLADE/database/test/proteins.fasta
#preparing executable files
#see $CLADE_DIR/files/run/search/scann_models.sh for parameter details. cd $CLADE_DIR/files/log
bash $CLADE_DIR/files/run/search/scann_models.sh \ $CLADE_DIR/files/run/env_var/environment_variables.sh \ $CLADE_DIR/pfamLists/pfam_27.full \ $CLADE_DIR/files/log/ \ search \ 100 \ $CLADE_DIR/databases/test/proteins.fasta \ $CLADE_DIR/models/pssms/FAMILY_FACTOR.tar.gz \ $CLADE_DIR/models/hmms/FAMILY_FACTOR.hmm \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/ \ $CLADE_DIR/pfamLists/used/FAMILY_FACTOR/list.txt.final \ 1
#Attention:: Do NOT delete or replace the words FAMILY_FACTOR in the command line above. #This is a keyword and it will be replaced with real domain Ids coming from
$CLADE_DIR/pfamLists/pfam_27.full
The executable files were placed in $CLADE_DIR/files/log/. If you are using SGE do:
bash $CLADE_DIR/scriptsCluster/qsub.sh \ search_runFamily.sh \ search \ $CLADE_DIR/files/log/out/ \ > clade_search.sh bash clade_search.sh
Result files will be saved in $CLADE_DIR/results/domainsPfam/. They are required for the second step.
Before running the SVM we need to create the meta-features or attributes
#preparing executable files
#see $CLADE_DIR/files/run/ensemble/createAttributeVector_test.sh for
#parameter details. cd $CLADE_DIR/files/log
bash $CLADE_DIR/files/run/ensemble/createAttributeVector_test.sh \ $CLADE_DIR/files/run/env_var/environment_variables.sh \ $CLADE_DIR/pfamLists/pfam_27.full \ $CLADE_DIR/files/log/ \ att \ 100 \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/result_resume.txt \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/result_resume.txt.best \ $CLADE_DIR/pfamLists/used/FAMILY_FACTOR/list.txt.final \ $CLADE_DIR/ensemble/taxonPath/FAMILY_FACTOR.taxon \ $CLADE_DIR/databases/proteins.fasta \ $CLADE_DIR/results/att/FAMILY_FACTOR.att \ 1 \ $CLADE_DIR/databases/pfam/pfam.domains \ $CLADE_DIR/taxon/
The executable files were placed in $CLADE_DIR/files/log/. If you are using SGE do:
bash $CLADE_DIR/scriptsCluster/qsub.sh \ att_runFamily.sh \ att \ $CLADE_DIR/files/log/out/ \ > clade_att.sh bash clade_att.sh
Attribute files will be saved in $CLADE_DIR/results/att.
Running SVM
#preparing executable files
#see $CLADE_DIR/files/run/ensemble/predict_svm.sh for
#parameter details. cd $CLADE_DIR/files/log
bash $CLADE_DIR/files/run/ensemble/predict_svm.sh \ $CLADE_DIR/files/run/env_var/environment_variables.sh \ $CLADE_DIR/pfamLists/pfam_27.full \ $CLADE_DIR/files/log/ \ svm \ 100 \ $CLADE_DIR/ensemble/attFiltre/pos.neg/FAMILY_FACTOR.att \ $CLADE_DIR/ensemble/att/neg.att \ $CLADE_DIR/ensemble/att/pos.att \ 2 \ $CLADE_DIR/results/att/FAMILY_FACTOR.att \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR/result_resume.txt.best.avg \ $CLADE_DIR/results/domainsPfam/FAMILY_FACTOR.domains \ $CLADE_DIR/ensemble/svmVein.cutOFF \ $CLADE_DIR/databases/pfam/pfam.domains
The executable files were placed in $CLADE_DIR/files/log/. If you are using SGE do:
bash $CLADE_DIR/scriptsCluster/qsub.sh \ svm_runFamily.sh \ svm \ $CLADE_DIR/files/log/out/ \ > clade_svm.sh bash clade_svm.sh
qsub -V -S /bin/bash -N damaPer -e $CLADE_DIR/files/log/out \ -o $CLADE_DIR/files/log/out/ $CLADE_DIR/files/run/archs/runDAMA.sh \ $CLADE_DIR/files/run/env_var/environment_variables.sh \ $CLADE_DIR/results/domainsPfam/ \ $CLADE_DIR/databases/pfam/pfam.domains \ $CLADE_DIR/databases/pfam/pfam.knownArch \ $CLADE_DIR/databases/pfam/pfam.overlapping \ 0.001 0 \ $CLADE_DIR/results/archs.txt
The architecture domain predictions will be saved in $CLADE_DIR/results/archs.txt
The CLADE program has been developed under the CeCILL licence (see LICENCE).
For questions, comments, or suggestions feel free to contact Alessandra Carbone or Juliana S. Bernardes.
If you use CLADE, please cite:
Last Update Jan. 2015