Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models

Borgonovo, F.; Matsuo, T.; Petri, F.; Amin Alavi, S.M.; Mazudie Ndjonko, L.C.; Gori, A.; Berbari, E.F.

doi:10.1016/j.mcpdig.2025.100230

Objective: To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines. Materials and methods: The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions. Results: The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models. Conclusion: OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.

Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models / F. Borgonovo, T. Matsuo, F. Petri, S.M. Amin Alavi, L.C. Mazudie Ndjonko, A. Gori, E.F. Berbari. - In: MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH. - ISSN 2949-7612. - 3:3(2025 Sep), pp. 100230.1-100230.10. [10.1016/j.mcpdig.2025.100230]

Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models

F. Borgonovo^Primo;Matsuo, Takahiro^Secondo;F. Petri;Amin Alavi, Seyed Mohammad;Mazudie Ndjonko, Laura Chelsea;A. Gori^Penultimo;

2025

Abstract

Objective: To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines. Materials and methods: The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions. Results: The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models. Conclusion: OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore MEDS-10/B - Malattie infettive
			
	Titolo del progetto
	
	Titolo Progetto
	
									NATIONAL INSTITUTE OF HEALTH (NIH), NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SCIENCES (NCATS), BESPOKE GENE THERAPY CONSORTIUM - COORDINATION CENTER
								
	Nome finanziatore
	
										National Institutes of Health
									
	Finanziamento
	
									NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SCIENCES
								
	N. Contratto
	
									75N98022D00019-0-759802300006-1
								
	Data di pubblicazione
	
				set-2025
			
	Data ahead of print o data di stampa
	
				23-mag-2025
			
	Rivista in ANCE
	
				MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH
			
	DOI
	
				https://dx.doi.org/10.1016/j.mcpdig.2025.100230
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Battle of the Bots_ solving Clinical Cases in Osteoarticular Infections.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 481.11 kB Formato Adobe PDF Visualizza/Apri	481.11 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1238375

Citazioni

4

3

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca