BIS2 is a computational method that finds coevolution signals among pairs of positions/fragments in a multiple sequence alignment (MSA) of a set of homologous protein sequences. It uses evolutionary information to score coevolution pairs and uses information coming from the phylogenetic tree built on the MSA (see [Dib et al., 2012]).
The algorithm implemented in BIS2 analyses the variability of amino acids on each column of the MSA. For a given column, the coevolution pattern is identified by the partitioning the sequences in groups, each groups showing the same amino-acid on that column. With this definition, we can identify pairs of columns that show the same partitioning, irrespective of the amino-acid composition.
The outcome of the coevolution analysis performed with BIS2 depends on the maximum number of exceptions allowed in a coevolution pattern. For a given position on the MSA, an exception is an amino acid that occurs only once on the corresponding column. The dimension is the number of exceptions on a column. Setting the maximum exception D to a given value will cause BIS2 to only focus on positions where the dimension d is such that d ≤ D.
BIS2 is run several times for each dimension d such that 0 ≤ d ≤ D. The first step is the selection and scoring of coevolving pairs, the second step is the clustering of coevolving pairs into sets. Clustering is performed with CLAG [Dib et al., 2012].
BIS2 is specifically designed to find co-evolution signal on few or conserved sequences. As a general guideline, we recommend applying it either on tens of sequences, or on few hundreds of sequences with relatively high API (Average Percentage of Identity).
BIS2 should not be used for the analysis of thousands of sequences, nor on very divergent ones (API ≤ 40%). It can be successfully used on divergent sequences if the aim is to find coevolution signals in their conserved motifs, as for the ATPase analysis reported in [Dib et al., 2012]. To check whether the dataset is suitable for BIS2 analysis, the API can be computed with alistat.Some examples of protein sequence families discussed in [Oteri et al., 2017]:
Please refer to [Dib et al., 2012] and [Champeimont et al., 2016] for details and more examples of BIS2 analyses.
To run BIS2 analysis on a complex (P1,P2), the user should provide a MSA where the sequences for protein P1 and the sequences for P2 are concatenated. Concatenated protein sequences must refer to the same species.
Such input MSA can be created through the following five steps:
Once the MSA is ready, a good practice before running BIS2 is to check the characteristics of the MSA, to see wheather the criteria described above are met or not.
For large alignments (several hundreds of sequences) displaying very high API (≥ 80%), BIS2 can be applied with default parameters.
On very large sets of sequences, displaying moderate API (∼60%), two possibilities are envisaged: run BIS2 with the alphabet reduction option (see below), or extract smaller samples from the MSA.
Below is sketched a procedure for extracting subsets of sequences from a MSA:
The simplest way to run BIS2 is to provide just a MSA from a set of sequences homologous to the protein(s) to be studied. Any further input is automatically pre-computed before running BIS2.
An input MSA in multi-FASTA format is required. It can either be pasted on the box or be uploaded as a file. A sample input file can be loaded.
By default, the phylogenetic tree is computed with BioNJ, but the user can customize this behavior in two ways, as described below.
The user can choose between BioNJ and PhyML.
The default is BioNJ, where the distance matrix is computed with protdist using Jones-Taylor-Thornton distance model. PhyML is run with default parameters.
The tree must have been computed on the MSA, should be provided in Newick format, and should be rooted. A sample input file can be loaded.
The user can either paste the tree on the box, or upload it as a file.
A custom D value can be provided by the user through the "Dimension" button.
BIS2 finds pairs of coevolving amino acids and, by default, extends each position by conservation on the neighboring positions of the MSA. This allows the creation of blocks around hits. Blocks creation can be disabled by checking the item "Only Hits":
For special cases, and particularly when lowly conserved proteins are analysed, the alphabet of 20 letters can be reduced to 8 letters, corresponding to the physico-chemical classes [Gouy et al., 2010]:
Alphabet reduction by physico-chemical properties can be set by checking the item "pc". By setting the "pc" option, BIS2 will be run on a MSA where amino-acids belonging to the same physico-chemical class are replaced by the same character.
The user can provide a custom definition of amino acid classes, by typing a string containing the 20 amino acids, with classes separated by commas (for instance: KR,AFILMVW,NQST,HY,C,DE,P,G) in the dedicated box.
The submission form is cleared by clicking on the bottom "Reset"; once input is set, the analysis is started by clicking the button "Submit":
Optionally, the user can provide a job ID and the e-mail address. Otherwise, a job ID is generated automatically.
After submission, the user is redirected to a webpage containing the link to the results page.
The results page displays BIS2 results for each dimension d such that 0 ≤ d ≤ D. To see the details of each cluster, say, for d = 0, click on the corresponding item:
The window allows to visualize the coevolution clusters directly on the MSA provided in input:
This box contains the aligned sequences (MSA); amino-acids are colored according to the physico-chemical class [Gouy et al., 2010].
This box contains a histogram representation of the conservation rate, on each column.
This box contains a line for each cluster, where coevolving positions, or hits, are represented with a "H", and extensions by conservations are represented with an "E". On the left, the names of the clusters are reported. Each cluster is associated the dimension d, the ID, and the clustering scores (symmetric and environmental, see [Dib et al., 2012] for details).
Detailed information on each cluster can be retrieved from the "view table" link on top right of the window:
A new page shows the clusters found by BIS2 in tabular format:
The description of each column, from left to right, is as follows:
Information contained in the html table can be displayed in text format (click "text format" on the right).
To download all the result in a .tar archive, click on the button "Download Results" at the bottom of the results page.
Information on coevolution clusters can be visualized on a sequence of choice. This feature is particularly useful. First, it allows to get the position in the sequence of interest, rather than on the MSA. Second, all clusters for each dimension d can be visualized all together, allowing to easily get all coevolution signal supporting a set of fragments in the protein sequence. To map clusters on a sequence, click on "Map on sequence" at the top of the results page.
The reference sequence can be either the MSA consensus or a sequence on the MSA, selected from the drop-down menu, or a different one, which can be uploaded as a file in FASTA format.
Multiple sequences can be loaded simultaneously.
Once the reference sequence is selected, it is displayed on the right with the corresponding ID. Each position is labelled with one or more colours, corresponding to the clusters containing that position. Clusters are shown on different lines, for each dimension d.
The list of coevolution clusters is reported, with the corresponding colour, below the sequence representation. By default, all clusters are shown. Visualization of a cluster can be enabled or disabled, by clicking on the button on the right of the cluster name.
To download all the result in a .tar archive, click on the button "Download Mapping Data" at the bottom of the sequence box.
Information on coevolution clusters can be visualized on a structure, if available. This allows to appreciate existing coevolution links between structural fragments, which can either be evidence of spatial proximity or provide evidence for the correctness of a structural model. In particular, coevolution analysis with BIS/BIS2 was able to suggest functional links within a protein or among groups of proteins. To map clusters on a structure, click on "Map on structure" at the top of the results page.
The structure can be either provided through its pdb code, or uploaded as a file in pdb format. Click on "Add to panel 1" to visualize the structure. For the sample analysis on one protein, click on "Load Sample PDB Code" and select "Protein A (1BDD)".
By default, all chains are shown (and depicted with different colours). They can be enabled or disabled by the checkmark referred to each chain, at the bottom of the panel. To change the view of the structure, use the mouse controls below:
Action | Mouse |
---|---|
Rotate | Left key |
Move | Central key |
Zoom | Wheel |
The list of coevolution clusters in shown near the structure and they can be displayed interactively. To enable a cluster, click on the button "Show" on the right of its name; to disable it, click on the button "Hide".
To download all the result in a .tar archive, click on the button "Download Mapping data" at the top of the structure box.
Information on coevolution clusters can be visualized on two proteins simultaneously, thanks to the two-panel visualization.
The structures can be either provided through their pdb codes, or uploaded as files in pdb format. The user should provide the structure for the first protein and click on "Add to panel 1", then provide the structure for the second protein and click on "Add to panel 2". To automatically upload the pdb structures for the sample analysis on two proteins, click on "Load Sample PDB Code" and select "HCV NS3 (1CU1) and NS5B (1GX6)".
Visualization of the two protein is done independently on the two panels.
Clusters are enabled and disabled automatically on the two proteins, allowing the user to compare the relative positions of coevolving residues on the two structures.To download all the result in a .tar archive, click on the button "Download Mapping Data" at the top of the structure box.
BIS2Analyzer interface provides a simple way to load an example of analysis.
To run the analysis on a single protein, click on the button "Load Sample Data" and select "Protein A":
The MSA and the phylogenetic tree are loaded automatically. Then, click on "Submit" to start the analysis. See Jobs submission and monitoring for retrieving information on the launched job and to access the results.
The example dataset contains a MSA of 452 sequences for the B domain of protein A, and the corresponding phylogenetic tree (for more details, see [Dib et al., 2012]).
Results of the analysis can be visualized on the MSA, allowing the user to easily locate co-evolution clusters directly on the alignment.
On the Map to sequence page, clusters can be shown on a sequence of choice.
To display the clusters on a structure, the user can direcly load the sample pdb code on the Map to structure page.
Displaying of the residues belonging to clusters can be interactively enabled and disabled, both on a reference sequence and on a pdb structure. Enabled residues are shown with a fixed color in both representations in order to ease moving from one to the other.
BIS2Analyzer supports analysis on two proteins simultaneously. Guidelines on how to prepare the MSA for BIS2 analysis are detailed here.
To run a sample analysis on two proteins, click on the button "Load Sample Data" and select "HCV NS3 and NS5B":
The MSA and the phylogenetic tree are loaded automatically. Then, click on "Submit" to start the analysis. See Jobs submission and monitoring for retrieving information on the launched job and to access the results.
The example dataset contains a MSA of 27 sequences for proteins NS3 and NS5B, concatenated, and the corresponding phylogenetic tree (for more details, see [Champeimont et al., 2016]).
Visualization on the MSA is particularly useful in for the analysis of a protein complex, because it allows the user to easily pinpoint the clusters that span the two proteins by looking at their position. The position of a residue in the MSA is displayed when moving the mouse on it.
The Map to sequence page can be used not only to find clusters position on sequences from the MSA, but also on a sequence provided by the user. Such sequence need not to be a concatenation of two sequences homologous to the complex protein, but can be either one of the two, or just a part of it.
The two-panel feature on the Map to structure page allows simultaneous visualization of two protein structures. By loading the pdb files for the two-protein samples, the pdb structures and all BIS2 clusters are automatically shown. On a user-provided dataset, the second panel is optionally shown as long as the first one is enabled.
The Map on sequence page and the Map on structure page allow to select the significant co-evolution clusters. They are shown by decreasing p-value for each dimension D. Instruction for their selection and visualization on the structure are reported below (the same procedure might be used for visualization on a sequence).
The horizontal bar on top right allows to select the p-value cut-off. It can be set by either scrolling or by selecting a value from the drop-down menu. All the clusters with p-value lower than the selected cut-off are labelled in orange in the left panel. By default, the cut-off is set to 1, hence each cluster is labelled in orange.
Clusters satisfying the p-value cut-off are displayed in the structure by clicking on the "Show" button next to the p-value bar. In the example on the left, the two clusters with p-value ≤ 10e-7 are displayed.
Q1. At submission, the page displays "Some error has been detected in the following sequences: ..."
A1. The multiple sequence alignment must be provided in FASTA format. The allowed characters in the sequences are: "ABCDEFGHJKILMNOPQRSTUVWXYZ-.", where "." and "-" are interpreted as gaps. There is no restriction in the length and composition of sequence headers.
Q2. In the "Map on sequence" page, no cluster is displayed.
A2. The fact that the clusters are not displayed on the sequence might be due to three reasons. The first one is that BIS2 did not find any co-evolution clusters on the MSA provided. To check for this, go to the "Results page" and click on the link "view MSA", for a given D. The panel shows the MSA, the conservation histograms, and the list of BIS2 clusters. The second reason is that the sequence similarity between the MSA consensus sequence and the provided sequence is too low, hence it was not possible to align the two. The third reason is that the position of the coevolving residues map outside the provided sequence.
Q3. In the "Map on structure" page, no cluster is displayed.
A3. See the answer A2. In this case, the sequence of each chain of the pdb is mapped separately on the MSA consensus.
Q4. It is not possible to visualize some clusters on the "Map on sequence"/"Map on structure" page, or some of them contain only one residue.
A4. The reason is that some coevolving residues map outside the provided sequence. In the case no residue is mapped for a given cluster, the corresponding "show" button is disabled.
Q5. In the results page, the html tables displays the amino acids occurrences only for some clusters.
A5. For a given dimension d, the pattern of the clusters containing only conserved positions, up to d exceptions, is not displayed. Such clusters are not significant (p-value close to 1).
Q6. Clusters with different co-evolution patterns display the same p-value. How is it possible?
A6. The p-value is computed on the maximal perfect co-evolution pattern common to all positions of a cluster. Consider for example the third cluster on this table: a change on position 48 corresponds to a change on position 45. In this case, we talk about a perfect pattern. Otherwise, the pattern is not perfect and we identify a maximal subset of sequences in the MSA that display a perfect co-evolution pattern on the positions. The p-value is computed on this subset. For this reason, clusters displaying different patterns but having the same maximal perfect sub-pattern show the same p-value.