Fine-tuning of Conditional Transformers Improves the Generalization of Functionally Characterized Proteins

Nicolini, M.; Malchiodi, D.; Cabri, A.; Cavalleri, E.; Mesiti, M.; Alberto, P.; Robinson Peter, N.; Justin, R.; Casiraghi, E.; Valentini, G.

doi:10.5220/0012567900003657

Conditional transformers improve the generative capabilities of large language models (LLMs) by processing specific control tags able to drive the generation of texts characterized by specific features. Recently, a similar approach has been applied to the generation of functionally characterized proteins by adding specific tags to the protein sequence to qualify their functions (e.g., Gene Ontology terms) or other characteristics (e.g., their family or the species which they belong to). In this work, we show that fine tuning conditional transformers, pre-trained on large corpora of proteins, on specific protein families can significantly enhance the prediction accuracy of the pre-trained models and can also generate new potentially functional proteins that could enlarge the protein space explored by the natural evolution. We obtained encouraging results on the phage lysozyme family of proteins, achieving statistically significant better prediction results than the original pre-traine d model. The comparative analysis of the primary and tertiary structure of the synthetic proteins generated by our model with the natural ones shows that the resulting fine-tuned model is able to generate biologically plausible proteins. Our results confirm and suggest that fine-tuned conditional transformers can be applied to other functionally characterized proteins for possible industrial and pharmacological applications.

Fine-tuning of Conditional Transformers Improves the Generalization of Functionally Characterized Proteins / M. Nicolini, D. Malchiodi, A. Cabri, E. Cavalleri, M. Mesiti, A. Paccanaro, N. Robinson Peter, J. Reese, E. Casiraghi, G. Valentini - In: Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 1: BIOINFORMATICS / [a cura di] M. P. Guarino; K. Hotta; M. Yousef; H. Liu; G. Saggio; A. Fred; H. Gamboa. - [s.l] : SCITEPress, 2024. - ISBN 978-989-758-688-0. - pp. 561-568 (( Intervento presentato al 17. convegno International Joint Conference on Biomedical Engineering Systems and Technologies tenutosi a Roma nel 2024 [10.5220/0012567900003657].

Fine-tuning of Conditional Transformers Improves the Generalization of Functionally Characterized Proteins

M. Nicolini^Primo;D. Malchiodi;A. Cabri;E. Cavalleri;M. Mesiti;Paccanaro Alberto;Robinson Peter N.;Reese Justin;E. Casiraghi;G. Valentini^Ultimo

2024

Abstract

Conditional transformers improve the generative capabilities of large language models (LLMs) by processing specific control tags able to drive the generation of texts characterized by specific features. Recently, a similar approach has been applied to the generation of functionally characterized proteins by adding specific tags to the protein sequence to qualify their functions (e.g., Gene Ontology terms) or other characteristics (e.g., their family or the species which they belong to). In this work, we show that fine tuning conditional transformers, pre-trained on large corpora of proteins, on specific protein families can significantly enhance the prediction accuracy of the pre-trained models and can also generate new potentially functional proteins that could enlarge the protein space explored by the natural evolution. We obtained encouraging results on the phage lysozyme family of proteins, achieving statistically significant better prediction results than the original pre-traine d model. The comparative analysis of the primary and tertiary structure of the synthetic proteins generated by our model with the natural ones shows that the resulting fine-tuned model is able to generate biologically plausible proteins. Our results confirm and suggest that fine-tuned conditional transformers can be applied to other functionally characterized proteins for possible industrial and pharmacological applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Large Language Models; Protein Language Models; Conditional Transformers; Protein design and modeling
			
	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Titolo del progetto
	
	Titolo Progetto
	
									National Center for Gene Therapy and Drugs based on RNA Technology (CN3 RNA)
								
	Acronimo
	
									CN3 RNA
								
	Nome finanziatore
	
										MINISTERO DELL'UNIVERSITA' E DELLA RICERCA
									
	N. Contratto
	
									CN00000041
								
	Data di pubblicazione
	
				2024
			
	DOI
	
				https://dx.doi.org/10.5220/0012567900003657
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
125679.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 1.24 MB Formato Adobe PDF Visualizza/Apri	1.24 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1027709

Citazioni

ND

ND

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca