Аннотация

L. A. Erlygin, A. A. Zaytsev

Uncertainty Estimation for the Open-Set Text Classification systems

Accurate uncertainty estimation is essential for building robust and trustworthy recognition systems. In this paper, we consider the open-set text classification (OSTC) task — and uncertainty estimation for it. For OSTC a text sample should be classified as one of the existing classes or rejected as unknown. To account for the different uncertainty types encountered in OSTC, we adapt the Holistic Uncertainty Estimation (HolUE) method for the text domain. Our approach addresses two major causes of prediction errors in text recognition systems: text uncertainty that stems from ill formulated queries and gallery uncertainty that is related to the ambiguity of data distribution. By capturing these sources, it becomes possible to predict when the system will make a recognition error. We propose a new OSTC benchmark and conduct extensive experiments on a wide range of data, utilizing the authorship attribution, intent and topic classification datasets. HolUE achieves 40–365% relative improvement in Prediction Rejection Ratio (PRR) over the quality-based SCF baseline across datasets: 360% on Yahoo Answers (0.78 vs 0.17 at FPIR 0.1), 353% on DBPedia (0.86 vs 0.19), 240% on PAN authorship attribution (0.51 vs 0.15 at FPIR 0.5), and 40% on CLINC150 intent classification (0.73 vs 0.52), and consistently outperforms modern post-hoc logit-based uncertainty baselines (temperature-calibrated MSP and Entropy, and the top-1/top-2 margin) at less than 5% inference-time overhead. We make public our code and protocols https://github.com/Leonid-Erlygin/text_uncertainty.git

КЛЮЧЕВЫЕ СЛОВА: machine learning, uncertainty estimation, natural language processing, multimodal data, probabilistic representations.