LLM-based data augmentation for text classification on imbalanced datasets: A case study on fake news detection

Arık, Ahmet; Parlayandemir, GİZEM; ÇELİK, SERRA

doi:10.1016/j.eij.2026.100886

LLM-based data augmentation for text classification on imbalanced datasets: A case study on fake news detection

Arık A. O., Parlayandemir G., ÇELİK S.

Egyptian Informatics Journal, cilt.33, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 33
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.eij.2026.100886
Dergi Adı: Egyptian Informatics Journal
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Directory of Open Access Journals
Anahtar Kelimeler: Fake News, Generative Artificial Intelligence, Imbalanced Data, Large Language Models, Text Classification
İstanbul Üniversitesi Adresli: Evet

Özet

Political fake news fuels a significant epistemic crisis, yet detection in low-resource languages like Turkish is constrained by data scarcity and class imbalance. This study addresses these challenges by constructing the Turkish Political Fake News Dataset (TPFND) and employing a Turkish LLaMA-3 model to generate synthetic samples for data augmentation. The augmented dataset was used to train an XGBoost classifier, compared against baseline and Random Oversampling methods. Results demonstrate that LLM-based augmentation significantly enhances sensitivity to fake news. While overall accuracy remained high 89–90.5%, the fake news detection rate increased from 91.12% to 97.62%, effectively minimizing false negatives despite a slight precision trade-off. These findings confirm the methodology provides a robust “safety net” for the Turkish digital ecosystem and a scalable framework for other low-resource languages.