The chemical, pharmaceutical, and materials industries increasingly rely on sustainable and efficient strategies for molecular synthesis, driving the growing importance of biocatalysis in modern life sciences. Enzymes, as highly selective and efficient natural catalysts, offer significant advantages for environmentally benign and carbon-neutral processes. However, their practical application is often limited by narrow substrate scope, insufficient stability, and incomplete understanding of structure–function relationships. This PhD research addresses these challenges by developing and applying advanced computational approaches for enzyme engineering, integrating molecular modeling and machine learning, collaborating with experimental expertise to improve enzyme design, predict biocatalytic behavior, and expand catalytic capabilities. A main focus of this work is the prediction of cytochrome P450 (CYP450)-mediated metabolism and inhibition, which play critical roles in drug efficacy and toxicity. A systematic benchmarking study of machine learning models for CYP450 substrate and inhibitor prediction was conducted, followed by the development of two predictive frameworks. First, CypEGAT integrates protein embeddings with graph-based molecular representations to predict interactions across five major human CYP450 isoforms. Building on this, CypST was developed as a multimodal deep learning framework combining ESM-3 protein embeddings and MolBERT-derived chemical features through attention-based fusion strategies. CypST demonstrated improved predictive performance over existing methods across nine CYP450 isoforms. Additionally, a toxicity prediction module trained on Tox21 dataset enabled the identification of hepatotoxicity-related mechanisms, although prediction of non-hepatic toxicities remains challenging due to limited endpoint coverage. To complement data-driven approaches, this work also investigates enzyme structure–function relationships through classical molecular modeling. The bacterial monooxygenase CYP153A6 was studied using homology modeling, ensemble docking, molecular dynamics simulations, and QM/MM calculations to elucidate substrate recognition and reactivity toward toluene derivatives. The results reveal key determinants of substrate orientation and reactivity, highlighting the importance of active-site geometry and electronic effects in catalytic selectivity. In the context of protein design, two machine learning frameworks were developed to address critical challenges in mutation effect prediction and sequence prioritization. The ThermoFusion model combines structural and evolutionary representations to predict mutation-induced stability changes (ΔΔG), achieving improved generalization across diverse proteins. In parallel, a supervised consensus ranking framework based on learning-to-rank methods integrates outputs from multiple design models to enhance the selection of functional protein variants, demonstrating improved performance in large-scale benchmarks and real-world enzyme design applications. The application of computational enzyme design strategies was further explored in the engineering of stilbene cleavage oxygenase NOV1 for the biotransformation of ferulic acid to vanillin. Despite the use of multiple approaches, including deep learning–based design, physics-based modeling, and free energy calculations, experimental results revealed the loss of enzymatic activity in designed variants, underscoring the complexity of enzyme redesign and the limitations of current computational methods. Finally, the catalytic mechanism and substrate specificity of lysine cyclodeaminase (LCD) were investigated using extensive molecular dynamics simulations and experimental validation. While substrate access was not limited, well-tempered metadynamics simulations revealed that catalytic inefficiency toward bulkier substrates arises from unfavorable conformational dynamics and misaligned reactive geometries. These findings emphasize the critical role of enzyme dynamics in determining catalytic outcomes and provide a mechanistic basis for rational enzyme engineering. Overall, this thesis investigated how integrated computational approaches can accelerate enzyme engineering, improve predictive modeling of drug metabolism and toxicity, and provide fundamental insights into enzymatic catalysis. The methodologies developed herein contribute to advancing next-generation biocatalyst design and support the broader goal of sustainable biochemistry.

COMPUTATIONAL ENZYME ENGINEERING: FROM MOLECULAR MODELING TO MACHINE LEARNING / Y. Wei ; tutor: I. Eberini, F. E. Molinari ; coordinator: G. D. Norata. Dipartimento di Scienze Farmacologiche e Biomolecolari Rodolfo Paoletti, 2026. 38. ciclo, Anno Accademico 2025/2026.

COMPUTATIONAL ENZYME ENGINEERING: FROM MOLECULAR MODELING TO MACHINE LEARNING

Y. Wei
2026

Abstract

The chemical, pharmaceutical, and materials industries increasingly rely on sustainable and efficient strategies for molecular synthesis, driving the growing importance of biocatalysis in modern life sciences. Enzymes, as highly selective and efficient natural catalysts, offer significant advantages for environmentally benign and carbon-neutral processes. However, their practical application is often limited by narrow substrate scope, insufficient stability, and incomplete understanding of structure–function relationships. This PhD research addresses these challenges by developing and applying advanced computational approaches for enzyme engineering, integrating molecular modeling and machine learning, collaborating with experimental expertise to improve enzyme design, predict biocatalytic behavior, and expand catalytic capabilities. A main focus of this work is the prediction of cytochrome P450 (CYP450)-mediated metabolism and inhibition, which play critical roles in drug efficacy and toxicity. A systematic benchmarking study of machine learning models for CYP450 substrate and inhibitor prediction was conducted, followed by the development of two predictive frameworks. First, CypEGAT integrates protein embeddings with graph-based molecular representations to predict interactions across five major human CYP450 isoforms. Building on this, CypST was developed as a multimodal deep learning framework combining ESM-3 protein embeddings and MolBERT-derived chemical features through attention-based fusion strategies. CypST demonstrated improved predictive performance over existing methods across nine CYP450 isoforms. Additionally, a toxicity prediction module trained on Tox21 dataset enabled the identification of hepatotoxicity-related mechanisms, although prediction of non-hepatic toxicities remains challenging due to limited endpoint coverage. To complement data-driven approaches, this work also investigates enzyme structure–function relationships through classical molecular modeling. The bacterial monooxygenase CYP153A6 was studied using homology modeling, ensemble docking, molecular dynamics simulations, and QM/MM calculations to elucidate substrate recognition and reactivity toward toluene derivatives. The results reveal key determinants of substrate orientation and reactivity, highlighting the importance of active-site geometry and electronic effects in catalytic selectivity. In the context of protein design, two machine learning frameworks were developed to address critical challenges in mutation effect prediction and sequence prioritization. The ThermoFusion model combines structural and evolutionary representations to predict mutation-induced stability changes (ΔΔG), achieving improved generalization across diverse proteins. In parallel, a supervised consensus ranking framework based on learning-to-rank methods integrates outputs from multiple design models to enhance the selection of functional protein variants, demonstrating improved performance in large-scale benchmarks and real-world enzyme design applications. The application of computational enzyme design strategies was further explored in the engineering of stilbene cleavage oxygenase NOV1 for the biotransformation of ferulic acid to vanillin. Despite the use of multiple approaches, including deep learning–based design, physics-based modeling, and free energy calculations, experimental results revealed the loss of enzymatic activity in designed variants, underscoring the complexity of enzyme redesign and the limitations of current computational methods. Finally, the catalytic mechanism and substrate specificity of lysine cyclodeaminase (LCD) were investigated using extensive molecular dynamics simulations and experimental validation. While substrate access was not limited, well-tempered metadynamics simulations revealed that catalytic inefficiency toward bulkier substrates arises from unfavorable conformational dynamics and misaligned reactive geometries. These findings emphasize the critical role of enzyme dynamics in determining catalytic outcomes and provide a mechanistic basis for rational enzyme engineering. Overall, this thesis investigated how integrated computational approaches can accelerate enzyme engineering, improve predictive modeling of drug metabolism and toxicity, and provide fundamental insights into enzymatic catalysis. The methodologies developed herein contribute to advancing next-generation biocatalyst design and support the broader goal of sustainable biochemistry.
14-mag-2026
Settore BIOS-07/A - Biochimica
EBERINI, IVANO
MOLINARI, FRANCESCO ENZO
NORATA, GIUSEPPE DANILO
Doctoral Thesis
COMPUTATIONAL ENZYME ENGINEERING: FROM MOLECULAR MODELING TO MACHINE LEARNING / Y. Wei ; tutor: I. Eberini, F. E. Molinari ; coordinator: G. D. Norata. Dipartimento di Scienze Farmacologiche e Biomolecolari Rodolfo Paoletti, 2026. 38. ciclo, Anno Accademico 2025/2026.
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R14186.pdf

embargo fino al 14/05/2027

Descrizione: Doctoral thesis
Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Licenza: Creative commons
Dimensione 33.21 MB
Formato Adobe PDF
33.21 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1244869
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact