Development of new methods for domain annotation

We have developed CLADE, a new and original approach to remote homology detection [Bernardes 2016]. Our strategy relies on on learning and inductive logic programming [Bernardes 2011] and on the exploration of homology signals within species along the tree of life: (1) probabilistic local models are constructed from a large and differentiated panel of homologous sequences, (2) a decision-making protocol combines models’ multiple outcomes, (3) a multi-criteria optimization algorithm finds the most likely protein architecture [Bernardes 2015, DAMA].

Based on this new strategy, we can construct highly probable domain architectures and which can be used to reannotate, in a highly accurate manner, the Plasmodium falciparum genome, known to be very difficult to annotate. We successfully predicted domains for 67% of P. falciparum proteins against 58% achieved previously, corresponding to 22% of improvement over Pfam domain predictions with 0.23% False Discovery Rate over new predictions. P. falciparum genome annotations are available at http://genome.lcqb.upmc.fr/plasmobase/ [Bernardes 2017]. The method, applicable to any genome, opens new avenues of investigation to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age.


Novel methods for domain annotation of metagenomic sequences

Learning about the functional activity of environmental microbial communities is a crucial step to understand microbial interactions and large scale environmental impact. MetaCLADE [Ugarte 2018] has been explicitly designed for metagenomic and metatranscriptomic data, and allows for the discovery of patterns in divergent sequences thanks to its multi-source strategy. MetaCLADE highly improves current domain annotation methods and reaches a fine degree of accuracy in annotation of very different environments such as soil and marine ecosystems [Amato 2017], ancient metagenomes, human tissues.

We are actively involved in MetaSub, an international consortium comprised of experts across many fields, including genomics, data analysis, engineering, public health, and architecture [MetaSUB International Consortium, 2016]. The ultimate goal of the MetaSUB Consortium is to improve city utilization and planning through the detection, measurement, and design of metagenomics within urban environments. To learn more, see here.


Periodicity in genomes and the 3D structure of chromosomes in unicellular organisms

Spectral analysis has been successfully applied to genome sequences for finding periodicity of DNA motifs along chromosomes. New challenges and perspectives for this methodology arise from the mapping of high throughput data along genomes. Tools for local spectral analysis are needed to study whole genomes as well as small chromosomal chunks.

We demonstrated a periodic distribution of genes with a highly biased codon composition in E. coli K12 [Mathelier 2010], suggesting an encoded 3D genomic organization helping translation, and possibly transcription. This extends to functional classes of genes that systematically organize in two independent positional gene networks, one driven by metabolic genes and the other by genes involved in cellular processing and signaling. We also studied the 3D structure of yeast chromosomes during meiosis, determinant for recombination events. So far, the recombination regions have been mainly determined by experiments, both expensive and time-consuming, emphasizing the strong need for predictive tools. We could produce a mathematical model, implemented in the form of the program SPoRE, that describes a precise mapping of double strand breaks and axis proteins along the genome during meiosis [Champeimont 2014]. We discovered an intriguing 180nt periodic pattern of sRNA distribution over DNA methylated sequences in Phaeodactylum tricornutum [Rogato 2014].


Development of high throughput sequencing methods

We are participating in worldwide collaborative efforts to develop tools for the primary analysis of high throughput sequencing data. We first developed Fiona [Schulz 2014], a fully automated read error correction strategy for genome sequencing experiments. Fiona takes advantage of multicore architecture, it was specifically developed with indel-prone sequencing technologies in mind (Ion torrent, Pacific Bioscience for instance), and is in this aspect superior to previous methods [with D.Weese, M.Holtgrewe, K. Reinert FU Berlin, M. H. Schulz MPI informatics, Saarbrücken]. Furthermore, we have developed a new statistical strategy for the split alignment of reads coming from high throughput sequencing experiments [Shrestha 2017].

  • Schulz M.H.*, Weese D.*, Holtgrewe M.*..., Richard H. (2014) Bioinformatics 30:i356-63
  • Shrestha AMS, Asai K, Frith M, Richard H. (2017) Nucleic Acids Research. In press

microRNAs and their structural clusters: predictions and functional analysis

MicroRNAs (miRNAs) are endogenes derived from a precursor (pre-miRNA) and involved in post-transcriptional regulation. Experimental identification of novel miRNAs is difficult because of their condition- and cell type-specific transcription. Several computational methods were developed to detect new miRNAs starting from known ones or from deep sequencing data, and to validate their pre-miRNAs. We developed a genome-wide search algorithm, MIReNA, that looks for miRNA sequences by exploring a multidimensional space defined by only five (physical and combinatorial) parameters characterizing acceptable pre-miRNAs. MIReNA validates pre-miRNAs with high sensitivity and specificity, and detects new miRNAs by homology from known miRNAs or from deep sequencing data [Mathelier 2010].

miRNAs can group together along the human genome to form stable secondary structures made of several hairpins. A large scale computational analysis of human chromosomes crossing sequence analysis and deep sequencing data revealed the presence of >400 structural clusters of miRNAs in the human genome [Mathelier 2013]. A functional analysis of structural clusters position along the chromosomes colocalizes them with genes involved in key cellular processes like immune systems, sensory systems, signal transduction and development. Target genes functional analysis strongly supports a regulatory role of most predicted miRNAs and, notably, a strong involvement of predicted miRNAs in the regulation of cancer pathways.

  • Rogato A*, Richard H*, Sarazin A, Voss B et al. (2014), BMC Genomics 15:698
  • Mathelier A. and Carbone A. (2013), Nucl. Acids. Res. 41: 4392-4408
  • Mathelier A. and Carbone A. (2010), Bioinformatics. 26: 2226-2234

Transcriptome analysis

We are developing statistical methods for the analysis of transcriptome sequencing data (RNA-Seq). We were among the first to demonstrate the superiority of RNA-Seq over preexisting hybridization based methods for the detection of lowly abundant transcripts [Sultan 2008]. Shortly after that, we proposed a method in order to detect and quantify alternative splicing events (ASEs), starting from an RNA-Seq experiment [Richard 2010]. To adress the expression levels inference problem, we have developed a statistical approach, Parseq (joint work with P. Nicolas, MIG-INRA, Jouy-en-Josas) [Mirauta 2014]. Parseq starts from the RNA-seq read counts and integrates new sources of variability arising along transcribed regions, improving its accuracy over preexisting methods that reconstruct transcript boundaries. We have also recently proposed PureCLIP, a hidden Markov Model based approach to capture protein-RNA interaction footprints from CLIP-seq data [Krakau 2017]. PureCLIP explicitly incorporates RNA abundances and, for the first time, non-specific sequence biases. On both simulated and real data, PureCLIP is more accurate in calling crosslink sites than other state-of-the-art methods and has a higher agreement across replicates. On the application side we have tackled a comprehensive analysis of the small RNA fraction of the diatom P. tricornotum in various growth conditions, where we uncovered a previously unexpected diversity of regulatory mechanisms within this species.

  • Mirauta B, Nicolas P*, Richard H* (2014) Bioinformatics. 30:1409-16
  • Sultan M*, Schulz MH*, Richard H* et al. (2008) Science. 321: 956–960
  • Krakau S, Richard H*, Marsico A* (2017) Genome Biology. In press

Inferring quantitative models for describing protein fold space reshaping by alternative splicing events

Alternative splicing events (ASEs) greatly contribute to functional diversity in higher eukaryotes by generating multiple transcript isoforms from the same gene. Virtually all human protein coding genes are subject to AS which deregulation leads to diseases like cancer. Given their functional importance it is essential to better characterize the impact of ASEs at the molecular level on whole protein folds. Based on our double expertise in sequence analysis/transcriptomics and structure prediction/conformational dynamics areas we have embarked on a large-scale study of the link between the occurrence of ASEs on a gene and their influence on the stability and conformational preferences of the corresponding proteins. We intend to (i) characterize alternative folds of alternatively spliced proteins and (ii) create a phylogenetic mapping of the energetical and conformational changes associated to ASEs.


Predicting protein interfaces using evolutionary and structural signals

We have developed Joint Evolutionary Trees (JET) for the prediction of protein interfaces [Engelen 2009] . JET relies on the assumption that interaction patches are composed of a central highly conserved core and multiple concentric layers of less conserved residues. JET was recently improved to specifically predict protein-protein interfaces and discriminate them from small-molecule binding pockets. The strategy is implemented in the fully automated pipeline JET2 [Laine 2015]. JET2 exploits both sequence and structure information and accounts for the geometry of the protein surfaces. Beyond its predictive power, it enables to dissect protein interfaces and unravel their complexity. JET2 was applied on the non-redundant set (at 40% sequence identity) of all protein chains for which a 3D structure is available in the Protein Data Bank [Ripoche 2016]. The knowledge base is freely available at http://www.jet2viewer.upmc.fr/.


Development of combinatorial methods for detecting co-evolution signals

We have developed combinatorial approaches [Baussand 2009, MST, Dib 2012, BIS] to discover co-evolution signals between individual residues or blocks of residues in proteins). Contrary to previously proposed statistical approaches, our methods can be applied to treat sets of protein sequences of variable divergence, they require only a few sequences and they are particularly suited for the analysis of highly conserved regions. Using BIS, we were able to reconstruct the protein-protein interaction network of the Hepatitis C Virus (HCV) at the residue resolution [Champeimont 2016] . BIS is available through BIS2Analyzer, a webserver based on a very fast re-implementation of the method [Oteri 2017] . Based on the hypothesis that the interfaces of proteins partners in a cell should evolve together, we will further extend and refine these methodologies to specifically detect and analyze co-evolution signals between pairs of potentially interacting sites. A new analytical approach based on the conservation of the physical-chemical properties of the residues will also be developed and integrated into the analysis.


Protein-protein interactions prediction

Protein-protein interactions (PPIs) are at the heart of the molecular processes that constitute life. Their interfaces are also an increasingly important target for drug design. Given their functional importance, it is clearly vital to characterize PPIs in order to (i) determine which interactions are likely to be stable enough to have functional relevance and (ii) assess weak and possibly non-functional interactions. We are engaged in a collaborative effort to create a large scale mapping of PPIs with information at the molecular level (MAPPING project). We propose to integrate novel experimental data on protein binding with sequence- and structure-based bioinformatics methods to predict the conformation of interacting proteins, and also which proteins will interact and how strongly.

This research theme was first developed in the context of the “Help Cure Muscular Dystrophy" (HCMD) project run on World Community Grid. Phase 1 consisted in molecular cross-docking of 168 proteins from the Docking Benchmark 2.0 and ended in June 2007. By combining the docking results with sequence analysis of residue conservation we were able to discriminate true partners from non-interactors with high accuracy [Lopes 2013], extending a preliminary study of a small set of proteins [Sacquin-Mora 2008]. Phase 2 of the project investigates PPIs for more than 2,200 human proteins whose structures are known, with particular focus on proteins involved in neuromuscular diseases. In the fall 2013, the project finished running on WCG and we are currently analyzing the results. A description of the project with an update on the current status can be found here.

Since 2013, we have developed a number of metrics and algorithms to improve partner identification. We have developed INTBuilder, a fast and easy-to-use program to efficiently screen millions of docking conformations to detect protein interfaces [Dequeker 2017] . We have shown that the knowledge of the global social behaviour of a protein or its "sociability" is more important than shape complementarity for partner identification [Laine 2017]. We have developed CIPS, a new statistical pair potential to evaluate docking poses and identify near-native conformations [Nadalin 2017]. We have proposed LISA, a new empirical scoring function that relies on a fine quantum mechanics based description of the geometry of the interface to estimate binding affinites [Raucci 2018].


Allosteric communication analysis, visualization and targeting

We have developed conformational dynamics approaches [Karami 2016, COMMA, Laine 2010, Laine 2012] to detect allosteric communication within proteins and use this information to guide drug discovery and deleterious mutation neutralization. These approaches provide a solid basis for the systematic identification of key residues that mediate the dynamic changes by which proteins fold, associate with partner/ligand or switch from inactive to active states. Such residues are also expected to display high degrees of conservation and/or coevolution. We have developed an approach based on the new concept of "infostery", from ”info” - information - and ”steric” - arrangement of residues in space, to predict mutational outcomes, to identify highly deleterious hotspots in protein structures and to provide a physical interpretation of their sensitivity to mutations [Karami 2018].


Genetic mutation combinatorics

The effects of a disease-associated mutation generally depend on its location and on the substituting amino acid type. Recent work from our team has revealed that mutational hotspots residues in p53 and other proteins involved in genetic deseases display co-evolution patterns that are predictable by coevolution and conservation analysis. We have also contributed to establish correlations between the phenotypic outcome of disease-related mutations and the corresponding proteins conformational dynamics [Chauvot de Beauchêne 2014, Gardie 2014]. A large-scale map of mutational sites would be of major importance in genetics and we intend to reconstruct it. We plan to combine evolutionary information and structural dynamics to: (i) determine general trends between equivalent mutations shared by several proteins, (ii) discriminate between different sets of mutations that result in distinct phenotypes, (iii) establish a hierarchical classification of mutations based on their location and nature.

  • Couve S., Ladroue C., Laine E., Mahtouk K. et al. (2014) Cancer Research 74:6554-64
  • Chauvot de Beauchêne I., Allain A., Panel N., Laine E. et al. (2014) PLoS Comput. Biol. 10:e1003749