MyCLADE: an accurate multi-source domain annotation server designed for a fast exploration of genomic and metagenomic sets of sequences
The understanding of the ever-increasing number of genomic and metagenomic sequences accumulating in our databases demands for approaches that rapidly "explore" the content of sets of (fragmented) protein sequences with respect to specific domain targets,
avoiding full domain annotation and full assembly. MyCLADE performs a multi-source domain annotation strategy based on a library of probabilistic domain models associated to each domain. It works in two modes:
- It explores large datasets of a few thousands amino-acid sequences and extracts those sequences containing few targeted domains.
- It annotates small datasets of a few hundreds amino-acid sequences with the full set of Pfam domains.
If sequences are sufficiently long, DAMA can be used to accurately resolve protein domain architectures.
MyCLADE annotates protein sequences with a library of more than 2.5 million probabilistic models for the whole Pfam32 database (17,929 domains). For this reason, keep in mind that a domain annotation of a single protein sequence could take the same time as for 50-100 sequences.