Development of a hybrid span-QA model with ontology integration for semantic enrichment of answers


Yergesh M., Yergesh B., Sharipbay A., ŞEKER Ş. E., Maxutova K.

IEEE Access, cilt.13, ss.165927-165940, 2025 (SCI-Expanded) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 13
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1109/access.2025.3608820
  • Dergi Adı: IEEE Access
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.165927-165940
  • Anahtar Kelimeler: Extractive QA, Kazakh language, low-resource language, ontology, retrieval-augmentation, sentence-BERT
  • İstanbul Üniversitesi Adresli: Evet

Özet

In the expanding frontier of natural language processing, the challenge of building accurate and semantically rich question answering (QA) systems for low-resource languages remains largely unresolved. This study presents a hybrid extractive QA model tailored for Kazakh, a morphologically complex and digitally underrepresented language, by integrating dense retrieval mechanisms with ontology-based semantic prefixing mechanism. Unlike conventional approaches that rely solely on retrieval and reading comprehension, our architecture injects dynamically constructed semantic definitions for domain-specific terms into the answer context, enabling more profound understanding and improved accuracy. Leveraging a novel dataset of Kazakh QA pairs generated through GPT-4 with expert validation, we introduce a dual-stream hybrid model (Hybrid B) that combines ontology-driven enrichment with fine-tuned transformer models for the retriever and reader components. The proposed system achieves a significant leap in performance, with 88% F1 and 76% exact match scores, substantially outperforming established baselines, including KazQAD. By anchoring the QA pipeline in a custom-built educational ontology relevant to the target domain, this research demonstrates how semantic structure can compensate for linguistic and data scarcity, paving the way for scalable QA systems in other low-resource languages such as Turkish. Beyond its empirical results, the study contributes a reproducible and modular framework for retrieval-augmented, ontology-aware QA, with implications for educational platforms, intelligent tutoring systems, and domain-specific information access. Through this work, we argue that the future of multilingual QA lies in hybrid architectures that combine symbolic structure with neural adaptability.