Research database

EXMOPRODE - Explainable Models for Protein Design

Duration:
24 months (2025)
Principal investigator(s):
Project type:
Nationally funded research - PRIN
Funding body:
MINISTERO (Ministero Università e Ricerca (MUR))
Project identification number:
2022TE5B7X
PoliTo role:
Coordinator

Abstract

The question of how to design novel protein sequences with desired biochemical characteristics using machine learning methods has increasingly become a focal point for academic research and industry. Especially recent advances in artificial intelligence have led to massive investments in this field, which promises to revolutionize tasks like drug discovery, antibody design, and enzyme engineering. However, developing models that capture important biological constraints while being versatile and robust enough for design tasks is still an open problem. One promising approach, apart from neural network-based architectures, is the use of Potts models, which originate from statistical physics and are trained on homologous protein sequences. These constituted a major breakthrough in protein sequence modeling and are now used widely for protein structure prediction, the assessment of mutational effects, and also the creation of novel protein sequences. The aim of the project is to improve Potts models for protein sequences and to adapt them for high-potential applications. Potts models are often preferable to black-box neural network architectures due to their simplicity combined with state-of-the-art performance in several tasks. Neural networks, on the other hand, are in principle able to model more complex patterns in the data and are easily scalable to very large datasets, albeit with a loss of interpretability in terms of the underlying biology. In this project, we will close the gap between the two approaches by i) combining Potts models with neural networks to develop models that are both expressive and interpretable and ii) applying the tools of sensitivity analysis and explainable AI to neural networks trained for protein design and other tasks for understanding what biological information these models capture from the data. This connects to the topic of physics-informed machine learning, which integrates physical models and machine learning methods, and which we will leverage in the project for developing novel methods for predictive modeling in the context of panning experiments for protein design, where better computational methods promise a large reduction in experimental costs and novel insights into the explored fitness landscapes. An additional important aspect underlying the models and methods in this project is the sequence alignment step during preprocessing, which influences the results heavily by shaping the data used for training the models and is often a source of biases and distortion. We will, therefore, in combination with the design of the models, develop methods for training directly on unaligned data, with the goal of improving performance in domains of application where alignment is difficult and prone to error.

Structures

Partners

  • POLITECNICO DI TORINO - AMMINISTRAZIONE CENTRALE - Coordinator
  • UNIVERSITA' COMMERCIALE LUIGI BOCCONI

Keywords

ERC sectors

PE3_15 - Statistical physics: phase transitions, noise and fluctuations, models of complex systems, etc.
PE6_13 - Bioinformatics, biocomputing, and DNA and molecular computation systems, cyber-physical systems

Budget

Total cost: € 195,678.00
Total contribution: € 189,531.00
PoliTo total cost: € 104,478.00
PoliTo contribution: € 98,331.00