Objective: To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines. Materials and methods: The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions. Results: The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models. Conclusion: OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.

Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models / F. Borgonovo, T. Matsuo, F. Petri, S.M. Amin Alavi, L.C. Mazudie Ndjonko, A. Gori, E.F. Berbari. - In: MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH. - ISSN 2949-7612. - 3:3(2025 Sep), pp. 100230.1-100230.10. [10.1016/j.mcpdig.2025.100230]

Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models

F. Borgonovo
Primo
;
F. Petri;A. Gori
Penultimo
;
2025

Abstract

Objective: To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines. Materials and methods: The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions. Results: The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models. Conclusion: OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.
Settore MEDS-10/B - Malattie infettive
   NATIONAL INSTITUTE OF HEALTH (NIH), NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SCIENCES (NCATS), BESPOKE GENE THERAPY CONSORTIUM - COORDINATION CENTER
   National Institutes of Health
   NATIONAL CENTER FOR ADVANCING TRANSLATIONAL SCIENCES
   75N98022D00019-0-759802300006-1
set-2025
23-mag-2025
Article (author)
File in questo prodotto:
File Dimensione Formato  
Battle of the Bots_ solving Clinical Cases in Osteoarticular Infections.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Licenza: Creative commons
Dimensione 481.11 kB
Formato Adobe PDF
481.11 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1238375
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact