Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Reese, J.T.; Chimirri, L.; Bridges, Y.; Danis, D.; Caufield, J.H.; Gargano, M.A.; Kroll, C.; Schmeder, A.; Liu, F.; Wissink, K.; Mcmurry, J.A.; Graefe, A.S.L.; Niyonkuru, E.; Korn, D.R.; Casiraghi, E.; Valentini, G.; Jacobsen, J.O.B.; Haendel, M.; Smedley, D.; Mungall, C.J.; Robinson, P.N.

doi:10.1038/s41431-026-02054-5

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses, and their accuracy compared to existing diagnostic tools is not well characterized. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5213 previously published case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to seven LLMs, including four generalist models and three LLMs specialized for medical applications. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools / J.T. Reese, L. Chimirri, Y. Bridges, D. Danis, J.H. Caufield, M.A. Gargano, C. Kroll, A. Schmeder, F. Liu, K. Wissink, J.A. Mcmurry, A.S.L. Graefe, E. Niyonkuru, D.R. Korn, E. Casiraghi, G. Valentini, J.O.B. Jacobsen, M. Haendel, D. Smedley, C.J. Mungall, P.N. Robinson. - In: EUROPEAN JOURNAL OF HUMAN GENETICS. - ISSN 1018-4813. - (2026), pp. 1-7. [Epub ahead of print] [10.1038/s41431-026-02054-5]

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Reese, Justin T.;Chimirri, Leonardo;Bridges, Yasemin;Danis, Daniel;Caufield, J. Harry;Gargano, Michael A.;Kroll, Carlo;Schmeder, Andrew;Liu, Fengchen;Wissink, Kyran;McMurry, Julie A.;Graefe, Adam S. L.;Niyonkuru, Enock;Korn, Daniel R.;E. Casiraghi;G. Valentini;Jacobsen, Julius O. B.;Haendel, Melissa;Smedley, Damian;Mungall, Christopher J.;Robinson, Peter N.

2026

Abstract

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses, and their accuracy compared to existing diagnostic tools is not well characterized. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5213 previously published case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to seven LLMs, including four generalist models and three LLMs specialized for medical applications. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
Settore MEDS-01/A - Genetica medica
			
	Data di pubblicazione
	
				2026
			
	Data ahead of print o data di stampa
	
				24-feb-2026
			
	Rivista in ANCE
	
				EUROPEAN JOURNAL OF HUMAN GENETICS
			
	DOI
	
				https://dx.doi.org/10.1038/s41431-026-02054-5
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
unpaywall-bitstream-1705228961.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 1.55 MB Formato Adobe PDF Visualizza/Apri	1.55 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1231540

Citazioni

3

2

0

3

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca