Functional annotation

Domain annotation in genomes and metagenomes

We introduced the usage of multiple probabilistic models in the context of domain annotation [Bernardes et al. 2017, Ugarte et al 2018] (see CLADE and MetaCLADE), and demonstrated their power in the discovery of remote homologous sequences. These models, combined with machine learning approaches (Support Vector Machines and Naïve Bayes are used to combine the annotations of the models and to estimate their classification parameters), showed to be highly accurate on full genomes and metagenomic/metatranscriptomic datasets, allowing for the discovery of new sequences enriching protein families [Fortunato et al 2016; Bernardes et al 2017; Amato et al 2017] (see PlasmoBase). The same fine degree of accuracy was reached on very different environments such as soil and marine ecosystems, ancient metagenomes, human tissues. Our multi-source annotation methods open new avenues of investigation to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age.

We are actively involved in MetaSub, an international consortium comprised of experts across many fields, including genomics, data analysis, engineering, public health, and architecture [MetaSUB International Consortium, 2016]. Its ultimate goal is to improve city utilization and planning through the detection, measurement, and design of metagenomics within urban environments. To learn more.


Representation spaces and functional classification

Sequence functional classification has become a fundamental bottleneck to the understanding of the ever-increasing genomic and metagenomic data. The large diversity of homologous sequences hides a variety of functional activities that cannot be anticipated. Their identification appears critical for the understanding of living organisms and for biotechnological applications. We designed ProfileView, an innovative computational method aimed at functionally classifying homologous sequences. It relies on two main ideas: the use of multiple probabilistic models whose construction explores and extracts evolutionary information from the huge space of available sequences, and a new definition of a representation space where to look at sequences from the point of view of functional motifs. ProfileView is applicable at large scale to classify hundreds/thousands of proteins. It applies to protein families whose homologs might be very divergent and for which functions should be discovered or characterised more precisely. It demonstrated to be a powerful approach to extract information on protein functional diversity, select sequences towards the design of accurate functional experiments and discover new biological functions.

Deep learning opens new directions for scaling ProfileView to handle million sequences.


Fast functional profiling of metagenomic datasets

The understanding of the ever-increasing number of metagenomic sequences accumulating in our databases demands for approaches that rapidly "explore" the content of multiple and/or large metagenomic datasets with respect to specific domain targets, avoiding full domain annotation and full assembly. S3A is a fast and accurate domain-targeted assembler designed for a rapid functional profiling. S3A is a recommended choice when studying a few dozens of functional domains, where it is faster than assemblers ignoring targeted domains by up to a factor of 10, thus enabling the analysis of a larger number of datasets in the same amount of time. S3A opens new avenues to the fast exploration of large number of metagenomic datasets with ever-increasing size.

  • David L, Vicedomini R, Richard H, Carbone A. (2020) Bioinformatics. In revision.

Protein interactions

Predicting protein interfaces

The growing body of experimental and computational data describing how proteins interact with each other and with other molecules has emphasized the multiplicity of protein interactions and the complexity underlying protein surface usage and deformability. Along the years, we proposed new concepts and methods toward deciphering such complexity. We have developed Joint Evolutionary Trees (JET) for the prediction of protein interfaces [Engelen 2009]. JET relies on the assumption that interaction patches are composed of a central highly conserved core and multiple concentric layers of less conserved residues. Today, the strategy is implemented in the fully automated pipeline JET2 [Laine 2015]. JET2 exploits both sequence and structure information and accounts for the geometry of the protein surfaces. JET2 was applied on the non-redundant set (at 40% sequence identity) of all protein chains for which a 3D structure is available in the Protein Data Bank [Ripoche 2016]. The knowledge base is freely available at http://www.jet2viewer.upmc.fr/.

To account for the multiple usage of a protein's surface residues by several partners and for the variability of protein interfaces coming from molecular flexibility, we introduced the notion of interacting region. We crossed evolutionary, physicochemical and geometrical properties of the protein surface with information coming from complete cross‐docking (CC‐D) simulations to predict interacting regions [Dequeker 2019]. See http://www.lcqb.upmc.fr/dynJET2/.

Protein-DNA and protein-RNA have been recently predicted with a novel version of JET2. A large portion of interacting residues are detected with good precision, even when they are ’hidden’ by conformational changes. We uncover the alternative binding sites and relate their properties with their specific roles. See http://www.lcqb.upmc.fr/JET2DNA/.

This work can help guiding mutagenesis experiments and the development of new drugs specifically targeting one site while limiting possible side effects.


Protein disorder and protein interfaces

The importance of unstructured biology has quickly grown during the last decades accompanying the explosion of the number of experimentally resolved structures. The idea that structural disorder might be a novel mechanism of protein interaction is widespread in the literature, although the number of statistically significant structural studies supporting this idea is surprisingly low. Through a large-scale-analysis of all the crystallo graphic structures of the Protein Data Bank averaged over clusters of homologous sequences, we show a clear evidence that both the (experimentally verified) interaction interfaces and the disordered regions are involving roughly the same amino-acids of the protein. And beyond, disordered regions appear to carry information about the location of alternative interfaces when the protein lies within complexes, thus playing an important role in determining the order of assembly of protein complexes.

Seoane B, Carbone A. (2019)


Inference of protein-protein interactions

Protein-protein interactions (PPIs) are at the heart of the molecular processes that constitute life. Their interfaces are also an increasingly important target for drug design. Given their functional importance, it is clearly vital to characterize PPIs in order to (i) determine which interactions are likely to be stable enough to have functional relevance and (ii) assess weak and possibly non-functional interactions. We are engaged in a collaborative effort to create a large scale mapping of PPIs with information at the molecular level (MAPPING project). We propose to integrate novel experimental data on protein binding with sequence- and structure-based bioinformatics methods to predict the conformation of interacting proteins, and also which proteins will interact and how strongly.

This research theme was first developed in the context of the “Help Cure Muscular Dystrophy" (HCMD) project run on World Community Grid. Phase 1 consisted in molecular cross-docking of 168 proteins from the Docking Benchmark 2.0 and ended in June 2007. By combining the docking results with sequence analysis of residue conservation we were able to discriminate true partners from non-interactors with high accuracy [Lopes 2013], extending a preliminary study of a small set of proteins [Sacquin-Mora 2008]. Phase 2 of the project investigates PPIs for more than 2,200 human proteins whose structures are known, with particular focus on proteins involved in neuromuscular diseases. In the fall 2013, the project finished running on WCG and we are currently analyzing the results. A description of the project with an update on the current status can be found here.

Since 2013, we have developed a number of metrics and algorithms to improve partner identification. We have developed INTBuilder, a fast and easy-to-use program to efficiently screen millions of docking conformations to detect protein interfaces [Dequeker 2017] . We have shown that the knowledge of the global social behaviour of a protein or its "sociability" is more important than shape complementarity for partner identification [Laine 2017]. We have developed CIPS, a new statistical pair potential to evaluate docking poses and identify near-native conformations [Nadalin 2017]. We have proposed LISA, a new empirical scoring function that relies on a fine quantum mechanics based description of the geometry of the interface to estimate binding affinites [Raucci 2018].


Inference of co-evolution within and between proteins

We have developed combinatorial approaches [Baussand 2009, MST, Dib 2012, BIS] to discover co-evolution signals between individual residues or blocks of residues in proteins. Contrary to previously proposed statistical approaches, our methods can be applied to treat sets of protein sequences of variable divergence, they require only a few sequences and they are particularly suited for the analysis of highly conserved regions. Using BIS, we were able to reconstruct the protein-protein interaction network of the Hepatitis C Virus (HCV) at the residue resolution [Champeimont 2016]. BIS is available through BIS2Analyzer, a webserver based on a very fast re-implementation of the method [Oteri 2017]. The method has been used in combination with cellular models, and humanized mice, to describe a unique model of HCV fusion among viruses [Douam 2018]. Allosteric changes of the capsid proteins could be predicted by coevolution, which made it possible to propose new hypotheses on the fusion process. A new analytical approach based on the conservation of the physical-chemical properties of the residues will also be developed and integrated into the analysis.

We used BIS2 to study the mutational landscapes of viral proteins and demonstrated that sequence covariation identifies drug resistant mutations in viral sequences. A new algorithmic strategy, BIS2TreeAnalyzer, has been designed to apply the co-evolution analysis method BIS2 to large sets of evolutionary related sequences. These studies are fundamental for the understanding of the mechanisms of cross-resistance to drugs and the design of effective therapeutic strategies based on several drugs.


Phenotypes and genetic mutations

Reconstruction of mutational landscapes

The effects of a disease-associated mutation generally depend on its location and on the substituting amino acid type. The systematic and accurate description of protein mutational landscapes is a question of utmost importance in biology, bioengineering and medicine. Recent progress has been achieved by leveraging on the increasing wealth of genomic data and by modelling inter-site dependencies within biological sequences. However, state-of-the-art methods remain time consuming. We developed GEMME, an original and fast method that predicts mutational outcomes by explicitly modelling the evolutionary history of natural sequences. This allows accounting for all positions in a sequence when estimating the effect of a given mutation. GEMME uses only a few biologically meaningful and interpretable parameters. Assessed against 50 high- and low-throughput mutational experiments, it overall performs similarly or better than existing methods. It accurately predicts the mutational landscapes of a wide range of protein families, including viral ones and, more generally, of very conserved families. Given an input alignment, it generates the full mutational landscape of a protein in a matter of minutes. It is freely available as a package and a webserver.

We plan to combine evolutionary information and structural dynamics to: (i) determine general trends between equivalent mutations shared by several proteins, (ii) discriminate between different sets of mutations that result in distinct phenotypes, (iii) establish a hierarchical classification of mutations based on their location and nature.


Allosteric communication analysis, visualization and targeting

We have developed conformational dynamics approaches [Karami 2016, COMMA, Laine 2010, Laine 2012] to detect allosteric communication within proteins and use this information to guide drug discovery and deleterious mutation neutralization. These approaches provide a solid basis for the systematic identification of key residues that mediate the dynamic changes by which proteins fold, associate with partner/ligand or switch from inactive to active states. Such residues are also expected to display high degrees of conservation and/or coevolution. We have developed an approach based on the new concept of "infostery", from ”info” - information - and ”steric” - arrangement of residues in space, to predict mutational outcomes, to identify highly deleterious hotspots in protein structures and to provide a physical interpretation of their sensitivity to mutations [Karami 2018].


Genome organisation

3D chromosomal structures in prokaryots and eukaryots

We demonstrated a periodic distribution of genes with a highly biased codon composition in E. coli K12 [Mathelier 2010], suggesting an encoded 3D genomic organization helping translation, and possibly transcription. This extends to functional classes of genes that systematically organize in two independent positional gene networks, one driven by metabolic genes and the other by genes involved in cellular processing and signaling. We also studied the 3D structure of yeast chromosomes during meiosis, determinant for recombination events. So far, the recombination regions have been mainly determined by experiments, both expensive and time-consuming, emphasizing the strong need for predictive tools. We could produce a mathematical model, implemented in the form of the program SPoRE, that describes a precise mapping of double strand breaks and axis proteins along the genome during meiosis [Champeimont 2014]. We discovered an intriguing 180nt periodic pattern of sRNA distribution over DNA methylated sequences in Phaeodactylum tricornutum [Rogato 2014].

In multicellular organisms, genome expression patterns are tightly modulated in response to developmental and external signals to define cellular specialization and adaptation. Recent studies have unveiled that light signaling pathways in plants converge onto chromatin regulatory mechanisms, a programmable platform that determines DNA accessibility and expression. In this context, light has a dramatic influence on higher order chromatin organization, from single genes to nuclear architecture. We develop data (Hi-C, repeat elements, epigenomic markers,...) analysis approaches to dissect spatio-temporal sequence of events and to decipher the underlying functional determinants in Arabidopsis thaliana.


Reconstructing genome evolutionary history

Chromosome rearrangements are a hallmark of genome evolution and essential for understanding the mechanisms of speciation and adaptation. Determining chromosome rearrangements over evolutionary time scales has been a difficult problem, primarily because of the lack of high-quality, chromosome-scale genome assemblies that are necessary for reliable reconstruction of ancestral genomes. In particular, for sequence-based genome-wide comparisons that require resolving large numbers of rearrangements of varying scale, determining ancestral chromosomal states is challenging both methodologically and computationally because of the complexity of genomic events that have led to extant genome organizations, including duplications, deletions, and reuse of evolutionary breakpoint regions flanking regions of homologous synteny. CHROnicle is a package dedicated to the reconstruction of the complete evolutionary history of genomes. It is based on the analysis of the marks accumulated over evolutionary time and left in the genomes by chromosomal rearrangements. CHROnicle is composed of four independent programs: SynChro, PhyChro, ReChro and AnChro.

SynChro. Reconstruction and visualization of Synteny blocks along Chromosomes.
PhyChro. Phylogenetic reconstruction based on Chromosomal rearrangement signal.
ReChro. Reconstruction of the Rearrangements along and between Chromosomes.
AnChro. Reconstruction of the Ancestral Chromosome gene order.


Copy Number Variations in ancient human genomes

Genomic DNA copy number variations have been studied for more than 30 years. However, it has been assumed for a long time that imbalances were few in number and held a limited impact on the total content of human genetic variation. Recent technologies allowed to identify thousands of heritable genomic copy number variants within modern populations and these new data generated considerable interest over the functional significance of gene copy number variants (CNV). CNV have been demonstrated to influence gene transcriptional levels, to confer an adaptive advantage, and some have been associated with differential susceptibility to complex diseases. Hence, learning about the landscape of CNV present in genomes and their functional roles becomes challenging and fundamental for the understanding of specific phenotypes in populations and across species. This excitement holds true for the comprehension of the adaptation to environment of human species, since in this case, only a handful number of genomes will be available ever. We are comparing 16 archaic genomes from Neandertals, Denisova and early Homo sapiens with the reference modern human genome to compile a complete functional catalogue of gene copy number variations. We are revealing information on those functional groups or individual genes that likely influenced the survival or the extinction of the human species, their adaptation to different environmental conditions, and their cultural development.

  • Vicedomini R, Condemi S, Longo L, Carbone A. (2020)

Transcriptome analysis and algorithms for NGS data

The evolution of protein isoforms

Alternative splicing (AS) greatly contributes to functional diversity in higher eukaryotes by generating multiple transcript isoforms from the same gene. Virtually all human protein coding genes are subject to AS which deregulation leads to diseases like cancer. Although the mechanisms of AS have been well described at the genomic level, very little is know about its functional impact at the protein level. We are engaged in an interdisciplinary project (MASSIV project, ANR-17-CE12-0009) at the cross-talk of genomics/transcriptomics and structural bioinformatics adressing that question. We have been exploiting the massive amounts of data generated by high-throughput sequencing and structure determination to assess the structural impact of AS in evolution. Our working hypothesis is that evolutionary conservation and structural stability are valid proxies for function. We have developed a couple of tools, namely ThorAxe and PhyloSofs, to infer plausible evolutionary scenarios explaining a set of transcripts observed in a set of species and predict the 3D structures of the corresponding isoforms. As a proof-of-concept, we applied our framework to the c-Jun N-terminal kinase family. We could date an ancient AS event and identify key residues likely responsible for its functional outcome (substrate selectivity). We also identified a new isoform displaying a large deletion, which could serve as a therapeutic target. We are now scaling up to a few tens of genes for which several isoforms have been biochemically characterized. We intend to create a phylogenetic mapping of the energetical and conformational changes associated to ASEs.


microRNAs and their structural clusters

MicroRNAs (miRNAs) are endogenes derived from a precursor (pre-miRNA) and involved in post-transcriptional regulation. Experimental identification of novel miRNAs is difficult because of their condition- and cell type-specific transcription. Several computational methods were developed to detect new miRNAs starting from known ones or from deep sequencing data, and to validate their pre-miRNAs. We developed a genome-wide search algorithm, MIReNA, that looks for miRNA sequences by exploring a multidimensional space defined by only five (physical and combinatorial) parameters characterizing acceptable pre-miRNAs. MIReNA validates pre-miRNAs with high sensitivity and specificity, and detects new miRNAs by homology from known miRNAs or from deep sequencing data [Mathelier 2010].

miRNAs can group together along the human genome to form stable secondary structures made of several hairpins. A large scale computational analysis of human chromosomes crossing sequence analysis and deep sequencing data revealed the presence of >400 structural clusters of miRNAs in the human genome [Mathelier 2013]. A functional analysis of structural clusters position along the chromosomes colocalizes them with genes involved in key cellular processes like immune systems, sensory systems, signal transduction and development. Target genes functional analysis strongly supports a regulatory role of most predicted miRNAs and, notably, a strong involvement of predicted miRNAs in the regulation of cancer pathways.


Statistical tools for transcriptome analysis

We develop statistical methods for the analysis of transcriptome sequencing data (RNA-Seq). We demonstrated, among others, the superiority of RNA-Seq over preexisting hybridization based methods for the detection of lowly abundant transcripts [Sultan 2008]. Shortly after that, we proposed a method in order to detect and quantify alternative splicing events (ASEs), starting from an RNA-Seq experiment [Richard 2010]. To adress the expression levels inference problem, we have developed a statistical approach, Parseq [Mirauta 2014]. Parseq starts from the RNA-seq read counts and integrates new sources of variability arising along transcribed regions, improving its accuracy over preexisting methods that reconstruct transcript boundaries. We have also proposed PureCLIP, a hidden Markov Model based approach to capture protein-RNA interaction footprints from CLIP-seq data [Krakau 2017]. PureCLIP explicitly incorporates RNA abundances and, for the first time, non-specific sequence biases. On both simulated and real data, PureCLIP is more accurate in calling crosslink sites than other state-of-the-art methods and has a higher agreement across replicates. On the application side we have tackled a comprehensive analysis of the small RNA fraction of the diatom P. tricornotum in various growth conditions, where we uncovered a previously unexpected diversity of regulatory mechanisms within this species.


Handling of high throughput sequencing data

The understanding of the ever-increasing number of genomic and metagenomic sequences accumulating in our databases demands for fine approaches devoted to the primary analysis of high throughput sequencing data. We developed Fiona [Schulz 2014], a fully automated read error correction strategy for genome sequencing experiments. Fiona takes advantage of multicore architecture, it was specifically developed with indel-prone sequencing technologies in mind (Ion torrent, Pacific Bioscience for instance), and is in this aspect superior to previous methods. Furthermore, we have developed a new statistical strategy for the split alignment of reads coming from high throughput sequencing experiments [Shrestha 2017].