Evaluating Three Large Language Models in Ophthalmology Education: A Comparative Study of Accuracy and Readability Across 442 Questions


ÇAKMAK S., Karakiraz A., ULUIŞIK I. E., Kara A. E., Bayraktar S., Altinkurt E.

OPHTHALMIC EPIDEMIOLOGY, 2026 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1080/09286586.2026.2651197
  • Dergi Adı: OPHTHALMIC EPIDEMIOLOGY
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE, MEDLINE
  • İstanbul Üniversitesi Adresli: Evet

Özet

Purpose To compare the accuracy and readability of ChatGPT-4.0 Mini, Gemini 1.5 Flash and Microsoft Copilot generated responses to fifth year medical school ophthalmology exam questions. Methods A total of 442 multiple-choice questions were submitted individually to each chatbot. Responses were marked correct or incorrect based on an answer key. Readability was assessed using Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE) and Simple Measure of Gobbledygook (SMOG) indices. Results There was a statistically significant difference in the overall accuracy rates among the chatbots (p < 0.001). Copilot has the highest accuracy rate (89.4%), followed by ChatGPT (84.2%) and Gemini (76.7%). Readability indices also showed a significant difference (p < 0.001 for all). Copilot demonstrated the lowest linguistic complexity, with the lowest FKGL and SMOG scores and the highest FRE values. In contrast, ChatGPT generated responses with the highest linguistic complexity across all evaluated metrics. Intraclass correlation coefficients for readability metrics ranged between 0.68 and 0.83, with the highest agreement between ChatGPT and Gemini. Despite the statistical significance in readability differences, all responses generally required a high school to early college-level reading ability. Conclusion Large language model-based chatbots demonstrated variable performance in answering ophthalmology questions at the medical school level. Among the models evaluated, Microsoft Copilot performed best in both accuracy and readability. These findings suggest that model choice may influence the usefulness of AI-generated content in educational settings.