Scientific Reports, cilt.15, sa.1, 2025 (SCI-Expanded, Scopus)
Gestational diabetes mellitus (GDM) is a prevalent condition requiring accurate patient education, yet the reliability and readability of large language models (LLMs) in this context remain uncertain. This study evaluated the performance of four LLMs—ChatGPT-4o, Gemini 2.5 Pro, Grok 3.0, and DeepSeek R-1—using 25 patient-oriented questions derived from clinical scenarios. Seven endocrinologists independently rated the responses with the modified DISCERN (mDISCERN) instrument and the Global Quality Score (GQS). Readability was analyzed using the Flesch Reading Ease (FRES), Flesch–Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Coleman–Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG), while lexical diversity was assessed through type–token ratio (TTR). Grok and Gemini obtained the highest mDISCERN and GQS scores, whereas ChatGPT performed significantly lower (p < 0.05). DeepSeek generated the most readable outputs, while Grok provided the longest and most complex responses. All models scored below the FRES threshold of 60 recommended for lay audiences. Response length showed strong positive correlations with mDISCERN and GQS, while TTR was inversely related to quality but positively associated with readability. These findings highlight variability among LLMs in GDM education and emphasize the need for model-specific improvements to ensure reliable patient-facing health information.