Can AI chatbots guide patients and physicians about neck pain? A reliability and readability comparison of ChatGPT-4 and Gemini

Ozdas Sevgin, DİCLE; Tarihci Cakmak, ELİF; Yildirim Ogras, GİZEM; Diracoglu, DEMİRHAN

doi:10.1177/10538127261433272

Can AI chatbots guide patients and physicians about neck pain? A reliability and readability comparison of ChatGPT-4 and Gemini

Ozdas Sevgin D. R., Tarihci Cakmak E., Yildirim Ogras G., Diracoglu D.

Journal of Back and Musculoskeletal Rehabilitation, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1177/10538127261433272
Dergi Adı: Journal of Back and Musculoskeletal Rehabilitation
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE
Anahtar Kelimeler: artificial intelligence < A, Neck < N, pain < P
İstanbul Üniversitesi Adresli: Evet

Özet

Background: Artificial intelligence (AI)-based chatbots are increasingly used as sources of medical information. Given the high prevalence of neck pain as a musculoskeletal symptom, patients may commonly consult such tools for health-related guidance. Objective: To evaluate and compare the performance of ChatGPT 4.0 and Google Gemini in addressing commonly asked patient questions and clinical case scenarios related to neck pain, focusing on their accuracy, quality, understandability, readability, reliability, and usability. Methods: Twenty-four patient-oriented questions and four clinical case scenarios regarding neck pain were submitted to ChatGPT 4.0 and Google Gemini. Responses were evaluated using validated tools: modified DISCERN (mDISCERN) for reliability, Global Quality Scale (GQS) for quality, PEMAT-P for understandability and actionability, and Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) for readability. Case-based responses were assessed for accuracy, safety, and usability on a 7-point Likert scale by two experienced physicians. Results: Gemini demonstrated significantly higher reliability (mDISCERN, p < 0.001), whereas ChatGPT 4.0 had slightly higher, though statistically insignificant, GQS and PEMAT-P scores. Readability metrics were similar: ChatGPT's FRE was 48.78 and FKGL 9.08; Gemini's FRE was 47.12 and FKGL 9.11. Both models’ outputs were considered difficult to read. In clinical scenarios, both chatbots showed comparable accuracy, safety, and usability, with minor omissions noted. Conclusion: ChatGPT 4.0 and Google Gemini provided similar performance in addressing neck pain-related queries. While both may support patient.