Objective: To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines. Materials and methods: The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions. Results: The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models. Conclusion: OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.
Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models / F. Borgonovo, T. Matsuo, F. Petri, S.M. Amin Alavi, L.C. Mazudie Ndjonko, A. Gori, E.F. Berbari. - In: MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH. - ISSN 2949-7612. - 3:3(2025 Sep), pp. 100230.1-100230.10. [10.1016/j.mcpdig.2025.100230]
Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models
F. BorgonovoPrimo
;F. Petri;A. GoriPenultimo
;
2025
Abstract
Objective: To evaluate the ability of 15 different large language models (LLMs) to solve clinical cases with osteoarticular infections following published guidelines. Materials and methods: The study evaluated 15 LLMs across 5 categories of osteoarticular infections: periprosthetic joint infection, diabetic foot infection, native vertebral osteomyelitis, fracture-related infections, and septic arthritis. Models were selected systematically, including general-purpose and medical-specific systems, ensuring robust English support. In total, 126 text-based questions, developed by the authors from published guidelines and validated by experts, assessed diagnostic, management, and treatment strategies. Each model answered individually, with responses classified as correct or incorrect based on guidelines. All tests were conducted between April 17, 2025, and April 28, 2025. Results, presented as percentages of correct answers and aggregated scores, highlight performance trends. Mixed-effects logistic regression with a random question effect was used to quantify how each LLM compared in answering the study questions. Results: The performance of 15 LLMs was evaluated, with the percentage of correct answers reported. OpenEvidence and Microsoft Copilot achieved the highest score (119/126 [94.4%]), excelling in multiple categories. ChatGPT-4o and Gemini 2.5 Pro scored 117 of the 126 (92.8%). When used as references, OpenEvidence was not inferior to any comparator and was superior to 5 LLMs. Performance varied across categories, highlighting the strengths and limitations of individual models. Conclusion: OpenEvidence and Miccrosoft Copilot achieved the highest accuracy among evaluated LLMs, highlighting their potential for precisely addressing complex clinical cases. This study emphasizes the need for specialized, validated artificial intelligence tools in medical practice. Although promising, current models face limitations in real-world applications, requiring further refinement to support clinical decision making reliably.| File | Dimensione | Formato | |
|---|---|---|---|
|
Battle of the Bots_ solving Clinical Cases in Osteoarticular Infections.pdf
accesso aperto
Tipologia:
Publisher's version/PDF
Licenza:
Creative commons
Dimensione
481.11 kB
Formato
Adobe PDF
|
481.11 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




