Evaluating Three Large Language Models in Ophthalmology Education: A Comparative Study of Accuracy and Readability Across 442 Questions

ÇAKMAK, SEMİH; Karakiraz, Arzu; ULUIŞIK, Işılsu; Kara, Ali; Bayraktar, ŞERİFE; Altinkurt, Emre

doi:10.1080/09286586.2026.2651197

Evaluating Three Large Language Models in Ophthalmology Education: A Comparative Study of Accuracy and Readability Across 442 Questions

ÇAKMAK S., Karakiraz A., ULUIŞIK I. E., Kara A. E., Bayraktar S., Altinkurt E.

OPHTHALMIC EPIDEMIOLOGY, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1080/09286586.2026.2651197
Dergi Adı: OPHTHALMIC EPIDEMIOLOGY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE, MEDLINE
İstanbul Üniversitesi Adresli: Evet

Özet

Purpose To compare the accuracy and readability of ChatGPT-4.0 Mini, Gemini 1.5 Flash and Microsoft Copilot generated responses to fifth year medical school ophthalmology exam questions. Methods A total of 442 multiple-choice questions were submitted individually to each chatbot. Responses were marked correct or incorrect based on an answer key. Readability was assessed using Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE) and Simple Measure of Gobbledygook (SMOG) indices. Results There was a statistically significant difference in the overall accuracy rates among the chatbots (p < 0.001). Copilot has the highest accuracy rate (89.4%), followed by ChatGPT (84.2%) and Gemini (76.7%). Readability indices also showed a significant difference (p < 0.001 for all). Copilot demonstrated the lowest linguistic complexity, with the lowest FKGL and SMOG scores and the highest FRE values. In contrast, ChatGPT generated responses with the highest linguistic complexity across all evaluated metrics. Intraclass correlation coefficients for readability metrics ranged between 0.68 and 0.83, with the highest agreement between ChatGPT and Gemini. Despite the statistical significance in readability differences, all responses generally required a high school to early college-level reading ability. Conclusion Large language model-based chatbots demonstrated variable performance in answering ophthalmology questions at the medical school level. Among the models evaluated, Microsoft Copilot performed best in both accuracy and readability. These findings suggest that model choice may influence the usefulness of AI-generated content in educational settings.