Functional annotation

Domain annotation in genomes and metagenomes

Multiple probabilistic models in the context of domain annotation [Bernardes et al. 2017, Ugarte et al 2018] demonstrated their power in the discovery of remote homologous sequences. These models, combined with machine learning approaches (Support Vector Machines and Naïve Bayes are used to combine the annotations of the models and to estimate their classification parameters), showed to be highly accurate on full genomes and metagenomic/metatranscriptomic datasets, allowing for the discovery of new sequences enriching protein families [Fortunato et al 2016; Bernardes et al 2017; Amato et al 2017]. The same fine degree of accuracy was reached on very different environments such as soil and marine ecosystems, ancient metagenomes, human tissues. Our multi-source annotation methods open new avenues of investigation to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age.

Two open-source tools, CLADE [Bernardes et al. 2017] (for genomes) and MetaCLADE [Ugarte et al 2018] (for metagenomics), are available for download. The webserver MyCLADE makes the MetaCLADE computational approach accessible at large [Vicedomini et al 2021]. PlasmoBase [Bernardes et al. 2017] is a platform that compares the protein domain architectures found in 11 Plasmodium fully sequenced genomes present in PlasmoDB, the reference repository for Plasmodium species.

We have been involved in MetaSub, an international consortium comprised of experts across many fields, including genomics, data analysis, engineering, public health, and architecture [MetaSUB International Consortium, 2016]. Its ultimate goal is to improve city utilization and planning through the detection, measurement, and design of metagenomics within urban environments [Danko et al 2021; Wu et al 2022]. To learn more.


Representation spaces and functional classification

Sequence functional classification has become a fundamental bottleneck to the understanding of the ever-increasing genomic and metagenomic data. The large diversity of homologous sequences hides a variety of functional activities that cannot be anticipated. Their identification appears critical for the understanding of living organisms and for biotechnological applications. We designed ProfileView [Vicedomini et al. 2022], an innovative computational method aimed at functionally classifying homologous sequences. It relies on two main ideas: the use of multiple probabilistic models whose construction explores and extracts evolutionary information from the huge space of available sequences, and a new definition of a representation space where to look at sequences from the point of view of functional motifs. ProfileView is applicable at large scale to classify hundreds/thousands of proteins. It applies to protein families whose homologs might be very divergent and for which functions should be discovered or characterised more precisely. It demonstrated to be a powerful approach to extract information on protein functional diversity, select sequences towards the design of accurate functional experiments and discover new biological functions. ProfileView was shown to distinguish functions on paralogous proteins [Le Moine et al 2022].

Deep learning opens new directions for scaling ProfileView to handle million sequences.


Fast functional profiling of metagenomic datasets

The understanding of the ever-increasing number of metagenomic sequences accumulating in our databases demands for approaches that rapidly "explore" the content of multiple and/or large metagenomic datasets with respect to specific domain targets, avoiding full domain annotation and full assembly. S3A [David et al. 2020] is a fast and accurate domain-targeted assembler designed for a rapid functional profiling. S3A is a recommended choice when studying a few dozens of functional domains, where it is faster than assemblers ignoring targeted domains by up to a factor of 10, thus enabling the analysis of a larger number of datasets in the same amount of time. S3A opens new avenues to the fast exploration of large number of metagenomic datasets with ever-increasing size.


Protein interactions

Predicting protein interfaces

The growing body of experimental and computational data describing how proteins interact with each other and with other molecules has emphasized the multiplicity of protein interactions and the complexity underlying protein surface usage and deformability. Along the years, we proposed new concepts and methods toward deciphering such complexity. We have developed Joint Evolutionary Trees (JET) for the prediction of protein interfaces [Engelen et al 2009]. JET relies on the assumption that interaction patches are composed of a central highly conserved core and multiple concentric layers of less conserved residues. Today, the strategy is implemented in the fully automated pipeline JET2 [Laine & Carbone 2015]. JET2 exploits both sequence and structure information and accounts for the geometry of the protein surfaces. JET2 was applied on the non-redundant set (at 40% sequence identity) of all protein chains for which a 3D structure is available in the Protein Data Bank [Ripoche et al 2016]. The knowledge base is freely available at JET2Viewer.

To account for the multiple usage of a protein's surface residues by several partners and for the variability of protein interfaces coming from molecular flexibility, we introduced the notion of interacting region. We crossed evolutionary, physicochemical and geometrical properties of the protein surface with information coming from complete cross‐docking (CC‐D) simulations to predict interacting regions [Dequeker et al 2019]. See dynJET2.

Protein-DNA and protein-RNA have been recently predicted with a novel version of JET2. A large portion of interacting residues are detected with good precision, even when they are ’hidden’ by conformational changes. We uncover the alternative binding sites and relate their properties with their specific roles. See JET2DNA.

Our Deep Learning approach, DLA-Ranker, designed to identify near-native conformations from ensembles generated by molecular docking, showcases its usefulness to discover alternative interfaces [Mohseni et al 2022].

This work can help guiding mutagenesis experiments and the development of new drugs specifically targeting one site while limiting possible side effects.


Soft disorder and protein interfaces

The importance of unstructured biology has quickly grown during the last decades accompanying the explosion of the number of experimentally resolved structures. The idea that structural disorder might be a novel mechanism of protein interaction is widespread in the literature, although the number of statistically significant structural studies supporting this idea is surprisingly low. Through a large-scale analysis of all the crystallographic structures of the Protein Data Bank averaged over clusters of homologous sequences, we show a clear evidence that both the (experimentally verified) interaction interfaces (blue in the figure) and the soft disordered regions (made of residues that underwent a disorder to order transition and/or are flexible; orange in the figure) are involving roughly the same amino-acids of the protein [Seoane & Carbone 2021]. And beyond, disordered regions appear to carry information about the location of alternative interfaces when the protein lies within complexes, thus playing an important role in determining the order of assembly of protein complexes. On the right of the figure, see how soft disordered residues move in the structure and correspond to the next interfaces [Seoane & Carbone 2022]. The data to reproduce the analysis is here; the hierarchies of progressive interactions are here.

  • Seoane B, Carbone A. (2022) PLoS Computational Biology.
  • Seoane B, Carbone A. (2021) PLoS Computational Biology.

  • Inference of protein-protein interactions

    Protein-protein interactions (PPIs) are at the heart of the molecular processes that constitute life. Their interfaces are an increasingly important target for drug design. Given their functional importance, it is clearly vital to characterize PPIs in order to (i) determine which interactions are likely to be stable enough to have functional relevance and (ii) assess weak and possibly non-functional interactions. Since 2006, we attempt to create a large scale mapping of PPIs with information at the residue level. We integrated experimental data on protein binding with sequence- and structure-based computational methods to predict the conformation of interacting proteins, and also which proteins will interact and how strongly.

    This research theme was first developed in the context of the “Help Cure Muscular Dystrophy" (HCMD) project run on World Community Grid network. It was a community effort counting 300,000 participants around the world. Phase 1 consisted in molecular complete cross-docking of 168 proteins from the Docking Benchmark 2.0 and ended in June 2007. By combining the docking results with sequence analysis of residue conservation we were able to discriminate true partners from non-interactors with reasonable accuracy [Lopes et al 2013], extending a preliminary study on a small set of proteins [Sacquin-Mora et al 2008]. Phase 2 of the project investigated PPIs for more than 2,200 human proteins whose structures are known, with particular focus on proteins involved in neuromuscular diseases. In the fall 2013, the project finished running on WCG. Its data has been analysed in [Lopes et al 2013, Laine & Carbone 2017, Dequeker et al 2022]. A description of the project can be found here.

    Since 2013, we have developed a number of metrics and algorithms to improve partner identification. INTBuilder is a fast and easy-to-use program to efficiently screen millions of docking conformations to detect protein interfaces [Dequeker et al 2017] . CIPS is a new statistical pair potential to evaluate docking poses and identify near-native conformations [Nadalin et al 2017]. We have also shown that the knowledge of the global social behaviour of a protein or its "sociability" is more important than shape complementarity for partner identification [Laine et al 2017]. DLA-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface, to identify near-native conformations from ensembles generated by molecular docking [Mohseni et al 2022]. LEVELNET is an interactive web-interface designed for visualising, exploring and comparing PPI networks. It helps to break down the complexity of PPI networks and facilitate direct comparison toward biological interpretation [Mohseni et al 2023].

    We are currently designing a Deep Learning architecture that exploits protein language models to reconstruct PPIs with a high degree of accuracy, with the aim of moving towards interspecies interaction [Volzhenin et al 2023].


    Inference of co-evolution within and between proteins

    We have developed two combinatorial approaches, MST [Baussand et al 2009] and BIS [Dib et al 2012], to discover co-evolution signals between individual residues or blocks of residues in proteins. Contrary to previously proposed statistical approaches, our methods can be applied to treat sets of protein sequences of variable divergence, they require only a few sequences and they are particularly suited for the analysis of highly conserved regions. Using BIS, we were able to reconstruct the protein-protein interaction network of the Hepatitis C Virus (HCV) at the residue resolution [Champeimont et al 2016]. BIS is available through BIS2Analyzer, a webserver based on a very fast re-implementation of the method [Oteri et al 2017]. BIS2Analyzer provides the possibility to analyse conservation of the physical-chemical properties of the residues.

    Motivated by the idea that highly related sequences, which stem from a close common ancestor, may contain evolutionary information that is lost in a global multiple sequence alignment, where local signals might be highly diluted, we developed iBIS2Analyzer [Oteri et al 2022], a webserver dedicated to a phylogeny-driven coevolution analysis of protein families characterized by different evolutionary pressure. iBIS2Analyzer is designed for coevolution analysis of large sets of sequences (possibly thousands) organised in a distance tree, where each subtree is iteratively accessed and studied.

    The method has been used in combination with cellular models, and humanized mice, to describe a unique model of HCV fusion among viruses [Douam et al 2018]. Allosteric changes of the capsid proteins could be predicted by coevolution, which made it possible to propose new hypotheses on the fusion process. In Hepatitis B Virus (HBV), by combining the method with experiments, we unraveled the main determinants of HBV membrane fusion process. The membrane fusion mechanism that could be triggered by ERp57, allowing a thiol/disulfide exchange reaction to occur and regulate isomerization of a critical cross-strand disulfide bond in the HBV S glycoprotein, which ultimately leads to the exposition of the fusion peptide [Vargas et al 2021].

    Interested in the understanding of the mechanisms of cross-resistance to drugs and the design of effective therapeutic strategies based on several drugs, we demonstrated that sequence covariation identifies drug resistant mutations in viral sequences [Teppa et al 2020a]. These mutations might be located far in the structure and are nevertheless identified as covariating, matching known observations.

    The webserver COVTree helps to analyse coevolution in overlapping viral genes [Teppa et al 2022b].


    Inference of protein binding affinity and affinity changes upon mutations

    We have proposed Local Interaction Signal Analysis (LISA), an empirical function designed to estimate protein-protein binding affinities. Its comprehensive model of protein interactions describes strength, favorable/unfavorable character, and geometric distribution of interatomic contacts. It also accounts for the non-interacting surface and secondary structures contributions. It enables to identify ‘‘hot-sites’’ at the interface. LISA applies to a large variety of complexes resulting in a very stable behavior. In LISA, we simultaneously explore a wide range of Non Covalent Interaction types (van de Waals interactions,
    hydrogen bonds, dipole-dipole interactions, steric repulsions, and London dispersion) as isosurfaces. Such surfaces allow us to distinguish favorable from non-favorable contacts, and to take into account only specific regions in space that contribute to the PPI [Raucci et al 2018]. See also [Laplaza et al 2020, Boto et al 2020].

    Deep Local Analysis (DLA) is a novel and efficient deep learning framework that relies on a strikingly simple deconstruction of protein interfaces into small locally oriented residue-centered cubes and on 3D convolutions recognizing patterns within cubes. Merely based on the two cubes associated with the wild-type and the mutant residues, DLA-Mutation accurately estimates the binding affinity change for the associated complexes [Mohseni et al 2023].

  • Mohseni Behbahani Y, Laine E, Carbone A. (2023) Bioinformatics.
  • Laplaza R, et al. (2020) WIREs Computational Molecular Science.
  • Boto R, et al. (2020) Journal of Chemical Theory and Computation.
  • Raucci R, Laine E, Carbone A. (2018) Structure 26:905-915

  • Phenotypes and genetic mutations

    Reconstruction of mutational landscapes

    The effects of a disease-associated mutation generally depend on its location and on the substituting amino acid type. The systematic and accurate description of protein mutational landscapes is a question of utmost importance in biology, bioengineering and medicine. Recent progress has been achieved by leveraging on the increasing wealth of genomic data and by modelling inter-site dependencies within biological sequences. However, state-of-the-art methods remain time consuming. We developed GEMME, an original and fast method that predicts mutational outcomes by explicitly modelling the evolutionary history of natural sequences. This allows accounting for all positions in a sequence when estimating the effect of a given mutation. GEMME uses only a few biologically meaningful and interpretable parameters. Assessed against 50 high- and low-throughput mutational experiments, it overall performs similarly or better than existing methods. It accurately predicts the mutational landscapes of a wide range of protein families, including viral ones and, more generally, of very conserved families. Given an input alignment, it generates the full mutational landscape of a protein in a matter of minutes. It is freely available as a package and a webserver.

    We designed a novel approach, ESGEMME, integrating structural information in the GEMME model. This approach ouperforms the new generation of deep learning method designed to predict the effects of mutations and remains biologically interpretable. It uses a very few parameters, all biologically significant. ESGEMME has been run at large scale on more than 3000 human proteins.


    Allosteric communication analysis, visualization and targeting

    We have developed a conformational dynamics approach [Karami 2016], COMMA, to detect allosteric communication within proteins and use this information to guide drug discovery and deleterious mutation neutralization. COMMA provides a solid basis for the systematic identification of key residues that mediate the dynamic changes by which proteins fold, associate with partner/ligand or switch from inactive to active states. Such residues are also expected to display high degrees of conservation and/or coevolution. We have developed a second approach, COMMA2, based on the new concept of "infostery", from ”info” - information - and ”steric” - arrangement of residues in space, to predict mutational outcomes, to identify highly deleterious hotspots in protein structures and to provide a physical interpretation of their sensitivity to mutations [Karami 2018].

    Based on COMMA2, we proposed a computational framework to quantify the extent of disorder within a coiled-coil in solution and to help design substitutions modulating such disorder. We applied it to the phosphoprotein multimerisation domains (PMD) of Measles virus (MeV) and Nipah virus (NiV), both forming tetrameric left-handed coiled-coils.


    Genome organisation

    3D chromosomal structures in prokaryots and eukaryots

    We demonstrated a periodic distribution of genes with a highly biased codon composition in E. coli K12 [Mathelier et al 2010], suggesting an encoded 3D genomic organization helping translation, and possibly transcription. This extends to functional classes of genes that systematically organize in two independent positional gene networks, one driven by metabolic genes and the other by genes involved in cellular processing and signaling. We also studied the 3D structure of yeast chromosomes during meiosis, determinant for recombination events. So far, the recombination regions have been mainly determined by experiments, both expensive and time-consuming, emphasizing the strong need for predictive tools. We could produce a mathematical model, implemented in the form of the program SPoRE, that describes a precise mapping of double strand breaks and axis proteins along the genome during meiosis [Champeimont et al 2014]. We discovered an intriguing 180nt periodic pattern of sRNA distribution over DNA methylated sequences in Phaeodactylum tricornutum [Rogato et al 2014].

    In multicellular organisms, genome expression patterns are tightly modulated in response to developmental and external signals to define cellular specialization and adaptation. Recent studies have unveiled that light signaling pathways in plants converge onto chromatin regulatory mechanisms, a programmable platform that determines DNA accessibility and expression. In this context, light has a dramatic influence on higher order chromatin organization, from single genes to nuclear architecture. We develop data (Hi-C, repeat elements, epigenomic markers,...) analysis approaches to dissect spatio-temporal sequence of events and to decipher the underlying functional determinants in Arabidopsis thaliana [Teano et al 2023].

    Finally, we compared genome 3D organisations in representative eukaryotic species to explore the links between chromosomal sub-compartments and chromatin marks, genome replication timing, and genomic repeats in six model organisms, including vertebrates, plants and insects. We report that the 3D organisation of chromatin in organisms with different genome content and size can be described as layers characterised by distinct chromatin marks and activities. We propose a ”layer cake” model for the genome 3D organisation as a more refined view than the prevalent ”two compartments” model of chromatin organisation in multi-cellular organisms [Carron et al 2023].


    Reconstructing genome evolutionary history

    Chromosome rearrangements are a hallmark of genome evolution and essential for understanding the mechanisms of speciation and adaptation. Determining chromosome rearrangements over evolutionary time scales has been a difficult problem, primarily because of the lack of high-quality, chromosome-scale genome assemblies that are necessary for reliable reconstruction of ancestral genomes. In particular, for sequence-based genome-wide comparisons that require resolving large numbers of rearrangements of varying scale, determining ancestral chromosomal states is challenging both methodologically and computationally because of the complexity of genomic events that have led to extant genome organizations, including duplications, deletions, and reuse of evolutionary breakpoint regions flanking regions of homologous synteny. CHROnicle is a package dedicated to the reconstruction of the complete evolutionary history of genomes. It is based on the analysis of the marks accumulated over evolutionary time and left in the genomes by chromosomal rearrangements. CHROnicle is composed of four independent programs: SynChro, PhyChro, ReChro and AnChro.

    SynChro. Reconstruction and visualization of Synteny blocks along Chromosomes.
    PhyChro. Phylogenetic reconstruction based on Chromosomal rearrangement signal.
    ReChro. Reconstruction of the Rearrangements along and between Chromosomes.
    AnChro. Reconstruction of the Ancestral Chromosome gene order.


    Copy Number Variations in ancient human genomes

    Genomic DNA copy number variations have been studied for more than 30 years. However, it has been assumed for a long time that imbalances were few in number and held a limited impact on the total content of human genetic variation. Recent technologies allowed to identify thousands of heritable genomic copy number variants within modern populations and these new data generated considerable interest over the functional significance of gene copy number variants (CNV). CNV have been demonstrated to influence gene transcriptional levels, to confer an adaptive advantage, and some have been associated with differential susceptibility to complex diseases. Hence, learning about the landscape of CNV present in genomes and their functional roles becomes challenging and fundamental for the understanding of specific phenotypes in populations and across species. This excitement holds true for the comprehension of the adaptation to environment of human species, since in this case, only a handful number of genomes will be available ever. We are comparing 15 archaic genomes from Neandertals, Denisova and early Homo sapiens with the reference modern human genome to compile a complete functional catalogue of gene copy number variations. We are revealing information on those functional groups or individual genes that likely influenced the survival or the extinction of the human species, their adaptation to different environmental conditions, and their cultural development.

    • Vicedomini R, Condemi S, Longo L, Carbone A. (2020)

    Transcriptome analysis and algorithms for NGS data

    microRNAs and their structural clusters

    MicroRNAs (miRNAs) are endogenes derived from a precursor (pre-miRNA) and involved in post-transcriptional regulation. Experimental identification of novel miRNAs is difficult because of their condition- and cell type-specific transcription. Several computational methods were developed to detect new miRNAs starting from known ones or from deep sequencing data, and to validate their pre-miRNAs. We developed a genome-wide search algorithm, MIReNA, that looks for miRNA sequences by exploring a multidimensional space defined by only five (physical and combinatorial) parameters characterizing acceptable pre-miRNAs. MIReNA validates pre-miRNAs with high sensitivity and specificity, and detects new miRNAs by homology from known miRNAs or from deep sequencing data [Mathelier et al 2010].

    miRNAs can group together along the human genome to form stable secondary structures made of several hairpins. A large scale computational analysis of human chromosomes crossing sequence analysis and deep sequencing data revealed the presence of >400 structural clusters of miRNAs in the human genome [Mathelier et al 2013]. A functional analysis of structural clusters position along the chromosomes colocalizes them with genes involved in key cellular processes like immune systems, sensory systems, signal transduction and development. Target genes functional analysis strongly supports a regulatory role of most predicted miRNAs and, notably, a strong involvement of predicted miRNAs in the regulation of cancer pathways.

    Also, we have tackled a comprehensive analysis of the small RNA fraction of the diatom P. tricornutum in various growth conditions, where we uncovered a previously unexpected diversity of regulatory mechanisms within this species [Rogato et al 2014].


    The evolution of protein isoforms

    Alternative splicing (AS) greatly contributes to functional diversity in higher eukaryotes by generating multiple transcript isoforms from the same gene. Virtually all human protein coding genes are subject to AS which deregulation leads to diseases like cancer. Although the mechanisms of AS have been well described at the genomic level, very little is know about its functional impact at the protein level. Elodie and Hugues have been engaged in an interdisciplinary project (MASSIV project) at the cross-talk of genomics/transcriptomics and structural bioinformatics adressing that question. They have been exploiting the massive amounts of data generated by high-throughput sequencing and structure determination to assess the structural impact of AS in evolution. Their working hypothesis is that evolutionary conservation and structural stability are valid proxies for function. They have developed a number of tools, namely ThorAxe and PhyloSofs, to infer plausible evolutionary scenarios explaining a set of transcripts observed in a set of species and predict the 3D structures of the corresponding isoforms. As a proof-of-concept, they applied their framework to the c-Jun N-terminal kinase family, they dated an ancient AS event and identify key residues likely responsible for its functional outcome (substrate selectivity), they identified a new isoform displaying a large deletion, which could serve as a therapeutic target. They created a phylogenetic mapping of the energetical and conformational changes associated to ASEs. This seminal work is going on with the ERC Consolidator Grant (PROMISE: Proteome diversification in evolution) that Elodie obtained for 2023-2028. Hugues is an active collaborator in PROMISE.


    Handling of high throughput sequencing data

    The understanding of the ever-increasing number of genomic and metagenomic sequences accumulating in our databases demands for fine approaches devoted to the primary analysis of high throughput sequencing data. Hugues developed Fiona [Schulz 2014], a fully automated read error correction strategy for genome sequencing experiments. Fiona takes advantage of multicore architecture, it was specifically developed with indel-prone sequencing technologies in mind (Ion torrent, Pacific Bioscience for instance), and is in this aspect superior to previous methods. Furthermore, he developed a new statistical strategy for the split alignment of reads coming from high throughput sequencing experiments [Shrestha 2017].


    Statistical tools for transcriptome analysis

    Hugues developed statistical methods for the analysis of transcriptome sequencing data (RNA-Seq). He demonstrated, among others, the superiority of RNA-Seq over preexisting hybridization based methods for the detection of lowly abundant transcripts [Sultan 2008]. Shortly after that, he proposed a method in order to detect and quantify alternative splicing events (ASEs), starting from an RNA-Seq experiment [Richard 2010]. To adress the expression levels inference problem, he has developed a statistical approach, Parseq [Mirauta 2014]. Parseq starts from the RNA-seq read counts and integrates new sources of variability arising along transcribed regions, improving its accuracy over preexisting methods that reconstruct transcript boundaries. He has also proposed PureCLIP, a hidden Markov Model based approach to capture protein-RNA interaction footprints from CLIP-seq data [Krakau 2017]. PureCLIP explicitly incorporates RNA abundances and, for the first time, non-specific sequence biases. On both simulated and real data, PureCLIP is more accurate in calling crosslink sites than other state-of-the-art methods and has a higher agreement across replicates.