Predicting mutational effects through the analysis of protein sequences

Input


GEMME takes as input a multiple sequence alignment in FASTA format. The alignment should contain the ungapped query sequence (on top) and sequences homologous to the query (up to several hundreds of thousands). The alignment should represent the variability of the query protein's family. Here is an example of the beginning of an input alignment file:

>HIS7_YEAST/1-220
MTEQKALVKRITNETKIQIAISLKGGPLAIEHSIFPEKEAEAVAEQATQS
QVINVHTGIGFLDHMIHALAKHSGWSLIVECIGDLHIDDHHTTEDCGIAL
GQAFKEALGAVRGVKRFGSGFAPLDEALSRAVVDLSNRPYAVVELGLQRE
KVGDLSCEMIPHFLESFAEASRITLHVDCLRGKNDHHRSESAFKALAVAI
REATSPNGTNDVPSTKGVLM
>UniRef100_A0A1E4SLW9/2-218
..TRTATIKRDTNETKIQIAVSLDGGPLSVESSIFDKPKYDEHAAQSTSS
QVIQVHTGIGFLDHMLHALAKHSGWSLIVECIGDLHIDDHHTAEDVGITL
GLAFHQALGQVKGVKRFGTGFAPLDEALSRAVVDLSNRPYAVVELGLKRE
KIGDLSCEMIPHVLESFAQGAAITLHVDCLRGFNDHHRAESAFKALAVAI
KEAISSNGTNDVPSTKGVL.
>UniRef100_P06633/4-219
...QKALVKRITNETKIQIAISLKGGPLAIEHSIFPEKEAEAVAEQATQS
QVINVHTGIGFLDHMIHALAKHSGWSLIVECIGDLHIDDHHTTEDCGIAL
GQAFKEALGAVRGVKRFGSGFAPLDEALSRAVVDLSNRPYAVVELGLQRE
KVGDLSCEMIPHFLESFAEASRITLHVDCLRGKNDHHRSESAFKALAVAI
REATSPNGTNDVPSTKGVL.


This name is used as a label for the results directory.



Advanced options


By default, GEMME will predict the effects of all possible substitutions at all positions in the query sequence (full single-site mutational landscape). Alternatively, you may provide a list of mutations of interest. Here is an example of the beginning of a mutation file:

S136C,N137D,V143I,E144N,I160V,P161T,F163V
S136C,N137S,P139A,Y140F,A141S,V143I,E144D,I160V,P161T,F163V
S136F,N137D,P139A,Y140F,V143I,E144N,I160V,F163V
S136F,N137D,P139A,V142F,V143T,E144N,C157S,I160V,P161T,F163I
S136F,N137D,P139A,A141C,E144D,I160V,F163I
S136F,N137D,P139A,A141S,V143I,E144N,C157T,I160V,P161T,F163V
S136F,N137D,Y140F,V142F,E144N,P161T
S136F,N137D,Y140F,A141C,V142F,V143A,E144N,C157T,P161T,F163L
S136F,N137D,Y140F,A141G,V142F,V143I,E144N,F163I
S136F,N137D,Y140F,A141S,V142F,V143I,E144N,C157S
S136F,N137D,A141S,V142F,V143T,E144N,C157S


GEMME uses the Joint Evolutionary Trees (JET) method to compute conservation levels for all positions in the query sequence. JET takes as input a set of sequences, extracted from GEMME's input alignment. It can be run once or several times, to get more statistically significant results. For a good compromise between speed and accuracy, we recommend to limit the number of sequences to 20 000 and set the number of iterations to 2. To get more precise predictions, the number of iterations can be increased up to 10. This will only slightly increase the required computing time. The maximum number of sequences may be adjusted depending on the length of the query sequence and on the size/variability of the input alignment.


Output

GEMME predicts mutational outcomes by combining (1) evolutionary conservation, (2) evolutionary fit and (3) site-independent frequencies. The output file of interest is named normPred_evolCombi.txt. It can be a 2D matrix or a 2-column data frame, depending on whether you asked for the full single-site landscape or if you provided a list of mutations. The following is an example of the beginning of an output file for a list of mutations:

"x"
"S136C,N137D,V143I,E144N,I160V,P161T,F163V" -8.04350366871689
"S136C,N137S,P139A,Y140F,A141S,V143I,E144D,I160V,P161T,F163V" -9.49421208888969
"S136F,N137D,P139A,Y140F,V143I,E144N,I160V,F163V" -16.0328552040513
"S136F,N137D,P139A,V142F,V143T,E144N,C157S,I160V,P161T,F163I" -21.3525902691973
"S136F,N137D,P139A,A141C,E144D,I160V,F163I" -17.1405120088204
"S136F,N137D,P139A,A141S,V143I,E144N,C157T,I160V,P161T,F163V" -17.7765596286594
"S136F,N137D,Y140F,V142F,E144N,P161T" -13.9883538607786
"S136F,N137D,Y140F,A141C,V142F,V143A,E144N,C157T,P161T,F163L" -18.1820493685406
"S136F,N137D,Y140F,A141G,V142F,V143I,E144N,F163I" -19.8509113500531
"S136F,N137D,Y140F,A141S,V142F,V143I,E144N,C157S" -14.9401630224143
"S136F,N137D,A141S,V142F,V143T,E144N,C157S" -15.8585891565074

We also provide results from an independent model combining (1) evolutionary conservation and (3) site-independent frequencies (normPred_evolInd.txt) and from an epistatic model combining (1) evolutionary conservation and (2) evolutionary fit (normPred_evolEpi.txt). In case of a highly conserved family, with very low diversity in the input alignment, it may be advantageous to consider the predictions issued by the independent model.

Three images, representing the matrices predicted by the three models, are generated to ease visualisation of the results.

- Default combined model (normPred_evolCombi.jpg) in orange
- Independent model (normPred_evolInd.jpg) in green
- Epistatic model (normPred_evolEpi.jpg) in blue

The color scales, ranging from white/light grey through shades of colors (oranges/blue/green) to dark grey/black, corresponding to increasing predicted mutational effects. If you need further assistance in using the web server, please contact elodie.laine-at-sorbonne-universite.fr