Multidimensional assessment of large language model responses to patient questions on gestational diabetes mellitus

Yigit Yalcın, BETÜL; Mutlu, ÜMMÜ; Ok, Ayşe; Caklili, Ozge; Hacisahinogullari, HÜLYA; Yenidunya Yalin, GÜLŞAH; Soyluk Selcukbiricik, ÖZLEM; Kubat Uzum, AYŞE; Karsidag, KUBİLAY

doi:10.1038/s41598-025-27235-y

Multidimensional assessment of large language model responses to patient questions on gestational diabetes mellitus

Yigit Yalcın B., Mutlu Ü., Ok A. M., Caklili O. T., Hacisahinogullari H., Yenidunya Yalin G., ...Daha Fazla

SCIENTIFIC REPORTS, cilt.15, sa.1, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 15 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.1038/s41598-025-27235-y
Dergi Adı: SCIENTIFIC REPORTS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Chemical Abstracts Core, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Artificial intelligence, Gestational diabetes mellitus, Large language models, Patient education, Readability
İstanbul Üniversitesi Adresli: Evet

Özet

Gestational diabetes mellitus (GDM) is a prevalent condition requiring accurate patient education, yet the reliability and readability of large language models (LLMs) in this context remain uncertain. This study evaluated the performance of four LLMs-ChatGPT-4o, Gemini 2.5 Pro, Grok 3.0, and DeepSeek R-1-using 25 patient-oriented questions derived from clinical scenarios. Seven endocrinologists independently rated the responses with the modified DISCERN (mDISCERN) instrument and the Global Quality Score (GQS). Readability was analyzed using the Flesch Reading Ease (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG), while lexical diversity was assessed through type-token ratio (TTR). Grok and Gemini obtained the highest mDISCERN and GQS scores, whereas ChatGPT performed significantly lower (p < 0.05). DeepSeek generated the most readable outputs, while Grok provided the longest and most complex responses. All models scored below the FRES threshold of 60 recommended for lay audiences. Response length showed strong positive correlations with mDISCERN and GQS, while TTR was inversely related to quality but positively associated with readability. These findings highlight variability among LLMs in GDM education and emphasize the need for model-specific improvements to ensure reliable patient-facing health information.