Hallucination Detection in Large Language Models via Multi-Granular Uncertainty Quantification

ÖNDEN, ABDULLAH

doi:10.59543/comdem.v3i.17665

Hallucination Detection in Large Language Models via Multi-Granular Uncertainty Quantification

ÖNDEN A.

Computer and Decision Making, cilt.3, ss.805-819, 2026 (Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 3
Basım Tarihi: 2026
Doi Numarası: 10.59543/comdem.v3i.17665
Dergi Adı: Computer and Decision Making
Derginin Tarandığı İndeksler: Scopus
Sayfa Sayıları: ss.805-819
Anahtar Kelimeler: calibration, hallucination detection, large language models, temporal entropy dynamics, uncertainty quantification, XGBoost
İstanbul Üniversitesi Adresli: Evet

Özet

Hallucination, when large language models (LLMs) produce plausible but factually incorrect output, is a major challenge in high-stakes applications such as medicine, law, and education. Current detection methods involve a trade-off between accuracy and efficiency: multi-generation methods (e.g., semantic entropy) are effective but impose 5-10x increased latency, while single-pass methods are faster but attain only 63-68% AUROC. To balance these trade-offs, we propose a framework that aggregates 12 uncertainty features across token-level, sequence-level, temporal, and distributional granularities from a single autoregressive generation. The framework operates in Full Mode (12 features, open-source models with attention access) or API Mode (10 features, any model exposing token log-probabilities). The most novel component is F9, temporal entropy dynamics, which measures how the entropy of generated segments changes across four quarters of the generation process. On Llama-3-8B, the framework attains 89.27% AUROC on HaluEval, surpassing semantic entropy by 2.15 percentage points while reducing latency by 8.2x. Across four open-source model families and five benchmarks, Full Mode consistently improves over semantic entropy by 1.71 to 2.47 pp. On GPT-3.5-Turbo, API Mode achieves 88.63% AUROC, falling below semantic entropy (90.81%) on this model. These results demonstrate that a suitably chosen combination of single-pass uncertainty features can approach the discrimination offered by more computationally intensive multi-generation methods.