A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering

Mansurova, Aigerim; Tleubayeva, Arailym; Nugumanova, Aliya; Shomanov, Adai; ŞEKER, Şadi

doi:10.3390/info16110943

A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering

Mansurova A., Tleubayeva A., Nugumanova A., Shomanov A., ŞEKER Ş. E.

Information (Switzerland), cilt.16, sa.11, 2025 (ESCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 11
Basım Tarihi: 2025
Doi Numarası: 10.3390/info16110943
Dergi Adı: Information (Switzerland)
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, Compendex, INSPEC, Library, Information Science & Technology Abstracts (LISTA), Directory of Open Access Journals
Anahtar Kelimeler: information retrieval (IR), Kazakh language, language model evaluation, large language models (LLM), low-resource language, question answering (QA) system, retrieval augmented generation (RAG), sentence embeddings
İstanbul Üniversitesi Adresli: Evet

Özet

This paper presents a systematic evaluation of large language models (LLMs) and retrieval-augmented generation (RAG) approaches for question answering (QA) in the low-resource Kazakh language. We assess the performance of existing proprietary (GPT-4o, Gemini 2.5-flash) and open-source Kazakh-oriented models (KazLLM-8B, Sherkala-8B, Irbis-7B) across closed-book and RAG settings. Within a three-stage evaluation framework we benchmark retriever quality, examine LLM abilities such as knowledge-gap detection, external truth integration and context grounding, and measures gains from realistic end-to-end RAG pipelines. Our results show a clear pattern: proprietary models lead in closed-book QA, but RAG narrows the gap substantially. Under the Ideal RAG setting, KazLLM-8B improves from its closed-book baseline of 0.427 to reach answer correctness of 0.867, closely matching GPT-4o’s score of 0.869. In the end-to-end RAG setup, KazLLM-8B paired with Snowflake retriever achieved answer correctness up to 0.754, surpassing GPT-4o’s best score of 0.632. Despite improvements, RAG outcomes show an inconsistency: high retrieval metrics do not guarantee high QA system accuracy. The findings highlight the importance of retrievers and context grounding strategies in enabling open-source Kazakh models to deliver competitive QA performance in a low-resource setting.