Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses, and their accuracy compared to existing diagnostic tools is not well characterized. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5213 previously published case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to seven LLMs, including four generalist models and three LLMs specialized for medical applications. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools / J.T. Reese, L. Chimirri, Y. Bridges, D. Danis, J.H. Caufield, M.A. Gargano, C. Kroll, A. Schmeder, F. Liu, K. Wissink, J.A. Mcmurry, A.S.L. Graefe, E. Niyonkuru, D.R. Korn, E. Casiraghi, G. Valentini, J.O.B. Jacobsen, M. Haendel, D. Smedley, C.J. Mungall, P.N. Robinson. - In: EUROPEAN JOURNAL OF HUMAN GENETICS. - ISSN 1018-4813. - (2026), pp. 1-7. [Epub ahead of print] [10.1038/s41431-026-02054-5]

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

E. Casiraghi;G. Valentini;
2026

Abstract

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses, and their accuracy compared to existing diagnostic tools is not well characterized. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5213 previously published case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to seven LLMs, including four generalist models and three LLMs specialized for medical applications. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.
Settore INFO-01/A - Informatica
Settore MEDS-01/A - Genetica medica
2026
24-feb-2026
Article (author)
File in questo prodotto:
File Dimensione Formato  
unpaywall-bitstream-1705228961.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Licenza: Creative commons
Dimensione 1.55 MB
Formato Adobe PDF
1.55 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1231540
Citazioni
  • ???jsp.display-item.citation.pmc??? 3
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact