Input

The required inputs for COVTree are:

*To generate a codon alignment you can use the following servers: PAL2NAL server, RevTrans, TranslatorX. These servers can use extended nucleotide alphabet; an additional step may be required to generate the alignment with standard nucleotides (A, C, G, T/U).
**If you don't know the codon coordinates of your reference sequence, you can use ORFfinder server

We suggest to use sequence identifiers of less than 11 characters for best visualisation in the server. This is not a mandatory requirement.

Overlapping ORFs

The genomes of most viruses have overlapping genes—two or more proteins coded for by the same nucleotide sequence. ORFs may overlap in various manners considering the type, the direction of transcription and the ORFs’ phase (Fig 1)

Types of overlapping ORFs
Figure 1: Definitions on ORFs overlap.
An overlap between two ORFs can be complete (if the ORFs are nested) or partial (if only the 3’ or 5’ es are overlapping). ORFs can overlap on the same strand, or in the case of a double-stranded genome, on the reverse complementary strand.In the case of partial overlap, three directions are possible: unidirectional, convergent (with an overlap of their 3' ends) and divergent (with an overlap of their 5' ends). In the case of complete overlap, two directions are possible: unidirectional and bidirectional. The reference ORF, in a pair of overlapping ORFs, is called phase 0. Overlaps in a parallel strand can be in two phases whereas antiparallel-strand overlaps can be in three phases.

In the overlapping region, coevolution in an ORF may: 1. be mirrored by coevolution in the other ORF; 2. generate a non-synonymous substitution which in turn may be compensated by other mutations (inside or outside the overlapping region); 3. generate synonymous substitutions (Fig 2).

Coevolution patterns
Figure 2: Complexity of coevolution patterns in an overlapping region.
A mutation of the DNA/RNA sequence might imply two changes at the amino acid level of the two corresponding overlapping proteins (P1 and P2): A. P1 and P2 lie on the same strand; B. P1 and P2 lie on opposite strands and a frameshift is present; C. P1 and P2 lie on opposite strands and are in phase. D. Relative positioning and mutations of coevolving positions in the overlapped region of two proteins P1 and P2. A cluster of four coevolving positions in P1s alignment shows two sequences maintaining the wild-type residues (red circles) and three displaying mutations on all positions (orange circles). A mutation in P1 may be coupled by synonymous substitutions in P2 (position 1 in P1); the same non-synonymous substitution (position 2), two non-synonymous substitutions in adjacent positions (positions 3); a variety of non-synonymous substitutions (position 4). Clusters of coevolving positions may contain positions outside the overlapping region (see P2). E. Clusters of coevolving residues in P1 and P2 over a overlapping region. Note that some of the positions do not overlap.

Pipeline

The input is a multiple alignment of nucleotide sequences containing an overlapping region, together with the protein’s starting and ending positions. The server performs a translation of the alignment into two overlapped protein alignments and builds two associated distance trees.Then, the two alignments are separately analyzed with the BIS2TreeAnalyzer method (Dib & Carbone 2012, Oteri et al 2017) which searches for coevolved positions in all subtrees of the distance trees. Our iterative strategy allows applying BIS2 in a large number of conserved sequences. As part of the result, the clusters of coevolving positions detected for both proteins are provided. If coevolution is detected in the overlapping region for one of the proteins, the effect of variation is analysed in the other protein. By analysing the subset of sequences where the cluster is detected for the first protein, we identify if the coevolving positions are accompanied by one or more synonymous/non-synonymous substitution(s) and if these positions also show coevolution in the second protein.

Workflow
COVTree workflow.
A. BIS2TreeAnalyzer coevolution analysis: BIS2 is reiterated on all subsets of amino acid sequences corresponding to subtrees of at least 20 sequences of the initial phylogenetic tree. Coevolution matrices, one for each subtree analysis, are produced and clustered. A set of clusters of coevolving positions is given as output. B. The input of COVTree is a nucleotide alignment covering two ORFs of interest. Sequences are translated to generate the two protein multiple sequence alignments. Each protein alignment is used to build a phylogenetic tree. The protein alignment and the phylogenetic tree are analyzed with BIS2TreeAnalyzer (A). COVTree output includes the coevolution analysis of the two proteins, as well as the effect of mutations in one protein over the other, for coevolving positions lying in the overlapping region. Both proteins may show coevolution in “mirrored” positions (“mirrored” coevolution) or coevolution in a protein may be accompanied by synonymous/nonsynonymous mutations in the other. To distinguish between these two situations, nucleotide information is provided.

BIS2TreeAnalyzer

BIS2TreeAnalyzer (Teppa et al 2020) is designed to apply the co-evolution analysis method BIS2, successfully used in the past on small sets of conserved sequences (Champeimont et al 2016; Douam et al 2018) , to large sets of evolutionary related sequences. It detects clusters of coevolving positions over large distance trees, by analysing closely related subtrees of sequences using BIS2 (Dib & Carbone 2012; Oteri et al 2017) and combining the subtree results in a principled manner.

P-value

A Binomial test was computed to evaluate if the frequency of the observed pattern deviates significantly from that expected by chance following a binomial distribution:

P-value

where n is the total number of sequences in the subtree, k is the number of sequences with the mutational pattern, and p is the expected probability of observing that pattern by chance if the positions were independente (Teppa et al 2020). The minimal number of sequences on the subtree can be as low as 20, however the p-value is computed taking into account the total number of sequences in the alignment.

Toy example of a P-value calculation

A binomial test has three parameters: number of successes (x), the number of trials(n) and the hypothesized probability of succes (Phyp).

example_p-value
Example of P-value calculation
For each subtree considered by BIS2TreeAnalyzer and for each pattern occurring in a cluster identified for the subtree, a Binomial test is computed. The figure shows P-value calculation of two patterns of amino acids "MSK" and "VLH". The number of trials is the number of sequences on the subtree. Note that the hypothesized probability of success is calculated considering all the sequences in the MSA. "Total n tests" is the total number of Bonferroni tests calculated in the complete MSA; in the example "total n tests" is 2.

For a complete Tree/MSA, multiple P-values are calculated using a binomial test, they are adjusted using the Bonferroni correction, for which the P-values are multiplied by the number of performed tests on the complete MSA. If at least one pattern of the cluster showed a corrected P-value < 0.005, the cluster is retained. The P-value cut-off guarantees that the family-wise error rate (i.e. the probability of making at least one Type I error) is at most 0.005

Results

The results are presented in three pages, the first two correspond to the coevolution clusters of each of the proteins. The third page is dedicated to the study of coevolution in the region of overlap between the two proteins. The main result is the coevolution clusters shown in an interactive table. The user can filter the clusters with different criteria; for example, clusters that contain a particular position. Clusters can also be filtered and sorted by P-value. By selecting (clicking) a particular cluster the subtree in which the cluster was detected is shown in a panel on the right.

table_results
Figure 4: Interactive table of coevolution clusters and subtree
Each row describes a BIS2TreeAnalyzer cluster. From left to right, the columns indicate the cluster identifier,the subtree identifier, the list of co-evolving positions, the number of sequences in which the coevolution pattern was detected, the amino acid pattern found and the P-value corresponding to each pattern. For example, the first cluster consists of five positions (196, 234, 242, 253 and 327), found in 43 sequences (38 + 5). In 38 sequences the amino acids "SRSRQ" were found in the five positions indicated respectively; while in 5 sequences the amino acids "PSIQK" were found. The P-value corresponding to the pattern found in the 43 sequences is 3.81e-07; and the P-value for the pattern of the 5 sequences is 2.62e-15. The subtree corresponding to the 43 sequences is shown to the right of the table. The subtree chart is dynamically updated when a cluster is selected.
table_search
Figure 5: Filtering clusters by position
It is possible to visualize only the clusters that contain a position of interest.
histo_results
Figure 6: Distribution of occurrences of positions appearing in clusters
When a particular position is selected, it is highlighted in red and the cluster identifier where the position was found is indicated at the top of the bar.

Identification of coevolving positions in the overlapping region

Based on the clusters of coevolving amino-acid positions in the two protein alignments, COVTree crosses their information to evaluate coevolution signals in the overlapping region of the proteins. To analyse the effects of mutations in the coevolving position of Protein 1 (Protein 2) over coevolving positions in Protein 2 (Protein 1), COVTree translates the amino-acid sequence alignments into nucleotide sequence alignments and displays the mutated codon in Protein 1 (Protein 2) together with its immediate nucleotidic environment, showing the mutational effect over one or two amino-acid positions in Protein 2 (Protein 1)

table_overlap
Figure 7: Impact of coevolution in the overlapping region
Each row describes a coevolved position in the overlapping region. This table corresponds to coevolving positions of Protein 2, and their affect on P1. From left to right, the columns indicate the cluster identifier, the subtree identifier, the co-evolving positions, the number of sequences in which the coevolution was detected, the amino acids found, the nucleotide change(s) responsible for the mutations at the coevolved sites, the amino acids at P1 on the overlapping positions and the position number. The "Nucleotides" column shows the nucleotide change(s) responsible for the mutations at the coevolved sites; the central codon corresponds to the amino acids of protein 2, whereas the reported fragment encodes the overlapping residues of Protein 1.

Results availability

The data will be removed from the server storage space one month after the end of the job.