Let’s chat (GPT) about adolescent idiopathic scoliosis: accuracy and reliability of chat responses to frequently asked questions

Giray, Esra; Korkmaz, Merve; Illeez, Ozge; ÇAPAN, NALAN; Karadag Saygi, Evrim; Aydin, AYŞE

doi:10.1186/s12891-025-09315-2

Let’s chat (GPT) about adolescent idiopathic scoliosis: accuracy and reliability of chat responses to frequently asked questions

Giray E., Korkmaz M. D., Illeez O. G., ÇAPAN N., Karadag Saygi E., Aydin A. R.

BMC Musculoskeletal Disorders, cilt.26, sa.1, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.1186/s12891-025-09315-2
Dergi Adı: BMC Musculoskeletal Disorders
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Adolescent idiopathic scoliosis, Artificial intelligence, ChatGPT, Machine learning, Scoliosis
İstanbul Üniversitesi Adresli: Evet

Özet

Background: Patients increasingly seek online medical information, with artificial intelligence (AI) chatbots like ChatGPT emerging as potential resources for adolescent idiopathic scoliosis (AIS); however, their accuracy and reliability need assessment. This study aimed to evaluate the accuracy and reliability of ChatGPT, an AI model, in answering questions related to AIS. Methods: Sixty-four questions across four categories (general information, diagnosis and screening, treatment and follow-up, and quality of life [QoL]) were adapted from FAQs on professional association websites, the SOSORT consensus article, and QoL questionnaires. Two reviewers rated ChatGPT’s responses on a scale from 1 (correct and comprehensive) to 4 (completely incorrect). Descriptive statistics were calculated to demonstrate the percentages of responses per score as well as the percentages of scores across categories. Each question was entered twice to assess reliability and response similarity. The percentage of responses that differed when the same query was entered twice into the system was calculated. The Cohen’s Kappa statistic was utilized to assess the level of agreement between the two reviewers. Results: Of all the responses, 53.1% were rated as “correct and comprehensive,” while 34.4% were rated as “correct but not comprehensive.” ChatGPT performed best in the QoL category, with 13 out of 15 (86.7%) responses rated as correct. The second-best performance was in the diagnosis and screening category, with 7 out of 13 (53.8%) correct responses, followed by the general information category, with 9 out of 17 (52.9%) correct responses. The lowest performance was in the treatment and follow-up category, with 5 out of 19 (26.3%) correct responses. Consistency in ChatGPT’s responses when questions were entered twice was 76.6%. Agreement between the reviewers’ scores was excellent, as indicated by Cohen’s Kappa statistic (Kappa: 0.82, 95% CI: 0.59 to 1.04; p = 0.0001). Conclusions: ChatGPT demonstrated strong accuracy in addressing questions related to QoL in AIS, but its accuracy in treatment-related areas remains insufficient. Therefore, patients and parents are advised to consult medical professionals rather than rely solely on AI-generated information for AIS treatment and management.