MaxSubTree (or MST for short)

MaxSubTree: A combinatorial approach to detect coevolved amino
acid networks in protein families of variable divergence.

Overview:

Fine analyses of families of protein sequences reveal the existence of networks of coevolved amino acids. These networks are clusters of residues often entering in physical contact one with the other, and they relate residues which are located far apart on the three dimensional structure. Coevolved residues often play a major biological role in the protein, and the nature of their interactions might be multiple, spanning among binding specificity, allosteric regulation and conformational change of the protein. By carefully tracing the way residues evolved within the phylogenetic tree of sequences of a protein family, the Maximal SubTree Method captures the transition along the time scale evolution of a conserved position to a coevolved position, and provides a numerical evaluation of the degree of coevolution of pairs of coevolved residues in a protein. This combinatorial approach drops the constraints on high sequence divergence limiting the range of applicability of the statistical approaches previously proposed, and it can be applied with high accuracy to families of protein sequences with variable divergence.

Download

The MaxSubTree package can be downloaded here.

You can unpack the archive through the command

tar xzf MaxSubTree.tgz

System requirements:

Linux or Mac OS X
Java

How to excute MaxSubTree:

A Makefile is given to compile the program. Type

make comp

to display the helpfile, type

make help

to run demos, type

make demo1

for the globin analysis. Notice that outputs for globin, serine protease and deshydrogenase analysis are given in the demo directory. A description of the output files is given below.

The program is compiled by typing

make start

Once the program is compiled, to get some help:

java MaxSubTree -help

and to run an analysis:

java MaxSubTree [arg1] [arg2] [arg3] [arg4] '-d [op]'

Input Parameters:

[arg1] : the first argument is the number of aligned sequences in the sequence file. It cannot be 0. If the given number of sequences does not correspond to the number of sequences in the sequence file then an error message is displayed and the run stops.

[arg2] : the second argument is the number of aligned positions in the sequence file. It cannot be 0. If the given number of positions does not correspond to the number of positions in the sequence file then an error message is displayed and the run stops.

[arg3] : the third argument is the sequence file. Sequences have to be in fasta format. !! The order of the sequences have to be te same as in the tree file !!

[arg4] : the fourth argument is the tree file. The tree has to be a binary tree in a parenthesed format.

[op] : the path to an output directory can be specified with the -d suffix.

Output files:

Six output files are created during the analysis, 3 gives the details of the calculation performed during the analysis (calculation outputs), and the 3 remaining files give the result of the analysis (Coevolution analysis output). Residue positions are refered in outputs files according to their position in the alignment of the sequence file.

Calculation outputs

1. 'seed.calc' file: it gives the list of the alignment positions with the persistency score. Posititons selected as seed position (Ps>0) are indicated.

Example :

                  1  Ps=-83.0
                  2  Ps=-55.0
                  3  Ps=-34.0
                  4  Ps=43.0   <- seed
                  5  Ps=-27.0
                  ...
                  159  Ps=-5.0
                  160  Ps=63.0   <- seed
                  161  Ps=-33.0

2. 'correspMatrices.calc' file: it gives the details of the calculation that is performed during the coevolution score analysis for each pair of seed positions.

For each pair of seed positions, the following data are given:
- the two positions which are compared in the form : 'Maximal SubTrees Analysis for positions <i> and <j>'
- the correspondence matrix: with residues of position <i> as vertical indexes of the matrix and residues of <j> as horizontal indexes of the matrix

- the intermediate calculation for each residues with the 3 factors in the form:

'* [res][occ]: [max] * [spe] * [int] * [occ] = [result]'

where
[res] : residue identity
[occ] : the occurence of the residue at this position
[max] : the maximal correspondance factor
[spe] : the specificity factor 
[int] : the interference factor

- the intermediate coevolution subscores for each of the two positions deduced from the above intermediate calculations in the form:

'SubScore for residues of position <i> compared to position <j>=[coEj(i)]
'SubScore for residues of position <j> compared to position <i>=[coEi(j)]'

- the final coevolution score for the pair of positions in the form:

'FINAL COEVOLUTION SCORE: CoE(<i>,<j>)=[result]'

Example : 



Maximal SubTrees Analysis for positions 108 and 112:


 |   P  |   V  |   I  |   T  |   C  |   R  |   K  |   D  |   S  |   N  |
L| 0.95 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
H| 0.00 | 0.39 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
F| 0.00 | 0.00 | 0.00 | 0.00 | 0.11 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 |
I| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
M| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
N| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |
K| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 |
T| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |



                  * L776: 0.95 * 0.99 * 0.99 * 776.0 = 736.0
                  * H84: 0.39 * 0.56 * 0.96 * 84.0 = 18.2
                  * F10: 0.11 * 0.49 * 0.99 * 10.0 = 0.54
                  * I3: 0.00 * 1.00 * 1.00 * 3.0 = 0.00
                  * M2: 0.00 * 0.00 * 1.00 * 2.0 = 0.00
                  * N2: 1.00 * 1.00 * 1.00 * 2.0 = 2.00
                  * K2: 1.00 * 1.00 * 1.00 * 2.0 = 2.00
                  * T1: 0.00 * 0.00 * 1.00 * 1.0 = 0.00
SubScore for residues of position 108 compared to position 
112=0.8630297324089018

                

                  * P781: 0.95 * 0.99 * 0.99 * 781.0 = 739.0
                  * V52: 0.39 * 1.00 * 1.00 * 52.0 = 20.6
                  * I33: 0.30 * 1.00 * 1.00 * 33.0 = 9.93
                  * T4: 0.00 * 1.00 * 1.00 * 4.0 = 0.00
                  * C2: 0.11 * 1.00 * 1.00 * 2.0 = 0.22
                  * R2: 0.11 * 1.00 * 1.00 * 2.0 = 0.22
                  * K2: 1.00 * 1.00 * 1.00 * 2.0 = 2.00
                  * D2: 1.00 * 1.00 * 1.00 * 2.0 = 2.00
                  * S1: 0.00 * 0.00 * 1.00 * 1.0 = 0.00
                  * N1: 0.00 * 0.00 * 1.00 * 1.0 = 0.00
SubScore for residues of position 112 compared to position 
108=0.879712343711068

               
FINAL COEVOLUTION SCORE: CoE(108,112)=1.74

3. 'relAverBehavDiff.calc' file: it gives details of the calculation that is performed during clustering. More precisely, it provides the Relative Average Behaviour between neighbouring seed positions in the clusterized relative coevolution score matrix.

For each pair of neighbouring seed positions in the clusterized relative coevolution score matrix, the following data are given:

- the two neighbouring positions in the form:

'Neighbouring seed positions <i>-><j>:  RAB difference=[result]'

- the relative coevolution score of <j> with each of the 5 positions of its maximal set <Ej> in the form:

'<j>-<Ej1>=[coE(<j>,<Ej1>)]*[coh]=[result]' 

                  
where
[coE(<j>,<Ej1>)] : the coevolution score between <j> and a position in <Ej>
[coh] : coherence of relative position of the coevolution score in the
        domain of variation of <j>

- the relative coevolution score of <j> with each one of the 5 positions of the neighbouring set associated to position <i>, denoted <Ei>, in the form:

'<j>-<Ei1>=[coE(<j>,<Ei1>)]*[coh]=[result]'

Example :


Neighbouring seed positions 112->108:  RAB difference=0.0


108-112=1.7427420761199697*0.9607359470736643=1.674314959006243
108-108=1.9977272727272728*1.0=1.9977272727272728
108-110=1.7177099801703488*0.9806937946208927=1.6845475185114378
108-115=1.8139657222447463*0.9826848978075667=1.7825567203905075
108-99=1.7883074299558124*0.9826848978075667=1.7573427040546399
RAB=1.7792978349380202


108-108=1.9977272727272728*1.0=1.9977272727272728
108-115=1.8139657222447463*0.9826848978075667=1.7825567203905075
108-99=1.7883074299558124*0.9826848978075667=1.7573427040546399
108-110=1.7177099801703488*0.9806937946208927=1.6845475185114378
108-112=1.7427420761199697*0.9607359470736643=1.674314959006243

RAB=1.7792978349380202

Coevolution analysis output

Two files provide the clusterized relative coevolution score matrix in raw format and in the VidaExpert format. A third file gives the positions in the alignment of the neighbouring seed positions in the matrix. They are described in detail below.

4. output.index: the ordered list of the seed positions after clustering, it corresponds to the indexes of the relative coevolution score matrix. Residue positions are indexed according to their position in the alignment given in the alignment file. For example, the following list

means that position 98 in the sequence alignment is clusterized next to alignment position 43, which is itself clusterized next to alignment position 104, ...

It corresponds to vertical indexes in the raw format matrix from left to right and to horizontal indexes in the raw format matrix topdown.

It corresponds to vertical indexes in the vida format matrix from left to right and to horizontal indexes in the vida format matrix bottomup.

5. output.raw: the relative coevolution score matrix in raw format.

6. output.vida: the relative coevolution score matrix in Vida Expert format. Vida Expert is downloadable from http://www.ihes.fr/~materials.

VidaExpert

The visualization tool ViDaExpert can be downloaded here. A document describing how to use it to visualize coevolved amino-acids networks is found here.

Licence:

The MaxSubTree program has been developed under the CeCILL licence.

Contacts:

For questions, comments, or suggestions feel free to contact Alessandra Carbone or Julie Baussand.

Reference:

If you are using MaxSubTree, please cite:

J.Baussand, A.Carbone. A combinatorial approach to detect coevolved amino-acid networks in protein families with variable divergence, PLoS Computational Biology 5(9) e1000488 (2009)

Last Update Sept. 2013