DAMA

A multi-objective approach that accurately resolves protein domain architectures

Overview:

Given a protein sequence and a number of potential domains matching it, what are the domain content and the most likely domain architecture for the sequence? This problem is of fundamental importance in protein annotation, constituting one of the main steps of all predictive annotation strategies. On the other hand, it
might easily become difficult to solve when potential domains are several and in conflict because of overlapping domain boundaries. An accurate prediction of the domain architecture of a multi-domain protein provides important information for function prediction, comparative genomics, molecular evolution.

We developed DAMA (Domain Annotation by a Multi-objective Approach), a novel approach that identifies architectures through a multi-objective optimisation algorithm combining scores of domain matches, previously observed multi-domain co-occurrence, and domain overlapping. DAMA has been validated on a known benchmark data set based on CATH structural domain assignments and on the Plasmodium falciparum proteome. When compared to existing tools on both data sets, it outperforms all of them.

Download

The DAMA package can be downloaded here.

You can unpack the archive through the command

tar xzf DAMA.tar.gz

System requirements:

DAMA has been developed under Unix operating system.
The bash environment should be installed.
Perl executable should be accessible from the PATH environment variable.
Your stdlibc++ libraries must support C++ 2011

How to compile DAMA C++ source codes:

To compile C++ sources, type the following commands (you can also read the INSTALL file for informations):

	cd [DAMA repository]/Release/src/

	g++ -std=gnu++11 -o "DAMA" -O3 DAMA.cpp

How to excute DAMA:

./DAMA [-Options] -domainsHitFile <f> -knownArchFile <a> -outputFile <o>

Input files <f> and <a>, output file <o> and options are discussed below. To see a brief help and usage, you can type the following command:

./DAMA -h

Input files:

-domainsHitFile <f> [Required]

File <f> contains domain hits for a set of proteins to be annotated (see an example at dataset/p.f/domains.hits. Each line reports the sequence identifier, the domain identifier, start and end positions of the domain match on the probabilistic model used for annotation, start and end positions of the domain along the sequence and E-value associated to the prediction. <f> can be produced from HMMscan output, a script for that can be found at scripts/convertHmmscanOutput.pl

-knownArchFile <a> [Required]

File <a> contains the list of known domain architectures. In our analyses, we used the lists produced by CATH and Pfam but any choice is possible, see the format file at database/pfam27/pfam.knownArch

Output File:

-outputFile <o> [Required]

After executing DAMA, the output file containing domain architectures for a set of proteins is saved at <o>. The file has the same format of <f>, the domain hit file described above.

Options:

Filtering domain hits

-evalueCutOff <x> [default value 1e-3]

E-value threshold used to filter out weak predictions.

-domainCov <x> [default value 40]

Domain matches must cover at least <x>% of the domain average size that is provided by setting the option -domainsInfoFile (see below). It can be disabled by setting <x> at 0.

-domainsInfoFile <d>

File <d> contains additional information about domains such as average size and clans (available only for Pfam domains). An example of <d> format can be found at database/pfam27/pfam.domains

Controlling overlapping

-overlappingDomainFile <p>

File <p> contains the list of allowed domain overlaps, see an example at database/pfam27/pfam.overlapping

-overlappingAA <x> [default value 30]

Number of amino acids allowed in the domain overlapping. No overlapping is allowed if <x> is 0.

-overlappingMaxDomain <x> [default value 50]

Domain overlapping comprises at most <x>% of the match. No overlapping is allowed if <x> is 0.

Enriching architectures with new domains

--review

New domains are added to the architecture if they have an E-value < -evalueCutOffConf (see below).

-evalueCutOffConf <x> [default value 1e-10]

Confidence threshold used to add new domains into the architecture.

Tolerance value (δ) for each objective function

-df1 [default value 10]

Tolerance value for function F1 (δ1). [We used 10 for pfam database and 40 for CATH. For other databases, the user might want to make new experiments in order to set (δ1) properly.]

-df2 [default value 0]

Tolerance value for function F2 (δ2).

-df3 [default value 0]

Tolerance value for function F3 (δ3).

-df4 [default value 0]

Tolerance value for function F4 (δ4).

-df5 [default value 0]

Tolerance value for function F5 (δ5). Note that -df5 has no effect being F5 the last optimisation function. However, this parameter allows the user to easily introduce new functions in future implementations..

Setting objective functions

-normProb [on or off], default off

If -normProb is "on" then function F3 replaces the counting of the distinguished domain pair A and B in a sequence by the probability of observing the domain pair P(AB)/P(A)P(B) in the sequence. Probability values are computed from the list of known domain architectures.

-archSize [domain or clan], default domain

This option implements function F4 by counting either different domains or different clans. The clan option can be used iff domains are grouped into clans in the reference database.

Output

-showAll [on or off], default off

This option allows to show all architectures for each protein and not only the best one.

Licence:

The DAMA program has been developed under the CeCILL licence (see LICENCE).

Contacts:

For questions, comments, or suggestions feel free to contact Alessandra Carbone or Juliana S. Bernardes.

Reference:

If you use DAMA, please cite:

J.S. Bernardes, F.R.J. Vieira, G. Zaverucha and A. Carbone. (2015) A multi-objective approach accurately resolves protein domain architectures. Bioinformatics

Last Update Jan. 2015