LLM-based data augmentation for text classification on imbalanced datasets: A case study on fake news detection


Arık A. O., Parlayandemir G., ÇELİK S.

Egyptian Informatics Journal, cilt.33, 2026 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 33
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.eij.2026.100886
  • Dergi Adı: Egyptian Informatics Journal
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Directory of Open Access Journals
  • Anahtar Kelimeler: Fake News, Generative Artificial Intelligence, Imbalanced Data, Large Language Models, Text Classification
  • İstanbul Üniversitesi Adresli: Evet

Özet

Political fake news fuels a significant epistemic crisis, yet detection in low-resource languages like Turkish is constrained by data scarcity and class imbalance. This study addresses these challenges by constructing the Turkish Political Fake News Dataset (TPFND) and employing a Turkish LLaMA-3 model to generate synthetic samples for data augmentation. The augmented dataset was used to train an XGBoost classifier, compared against baseline and Random Oversampling methods. Results demonstrate that LLM-based augmentation significantly enhances sensitivity to fake news. While overall accuracy remained high 89–90.5%, the fake news detection rate increased from 91.12% to 97.62%, effectively minimizing false negatives despite a slight precision trade-off. These findings confirm the methodology provides a robust “safety net” for the Turkish digital ecosystem and a scalable framework for other low-resource languages.