By Insights Team in AI — 16 Jul 2025

Facing Unsolvable Problems, Large Models May Crash? HK Chinese & Huawei Jointly Propose First Reliability Benchmark for Large Model Inference}

Researchers from HK Chinese University and Huawei jointly introduce the first benchmark for evaluating the reliability of large model inference, addressing issues of model hallucination and unsolvable problems.

Author: Boyang Xue, a third-year PhD student at CUHK, supervised by Professor Huang Jinhui, currently visiting University College London. His research focuses on trustworthy large models, model uncertainty, dialogue systems, and more. He has published multiple papers at ACL, EMNLP, TASLP, and writes columns on large models and machine learning on Zhihu. Personal homepage: https://amourwaltz.github.io

Research Question

Can even the most powerful models be helpless against unsolvable problems?

Earlier this year, large models like DeepSeek-r1 demonstrated strong reasoning capabilities, sparking widespread interest. However, when faced with questions that are unsolvable or unanswerable, these models tend to hallucinate—fabricating false information, producing meaningless reasoning, and generating fictitious answers, as shown in figures (a). This severely wastes resources and misleads users, damaging the model’s reliability.

For complex reasoning tasks, a reliable model should analyze and provide correct answers for solvable problems, and indicate unsolvability for unsolvable ones. When a problem exceeds the model’s capability, the best response is to refuse to answer, avoiding misinformation, as shown in figures (b) and (c). This approach enhances reliability and suppresses hallucinations.

Recently, a joint effort by HKCU and Huawei Noah’s Ark Laboratory proposed ReliableMath, the first benchmark to evaluate the reliability of large models in mathematical reasoning. The dataset and paper are open-source, with ongoing updates on the leaderboard, including results from models like Qwen3, Doubao, and Gemini. Interested readers are welcome to follow for updates.

Paper: ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models
Authors: Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong
Dataset: https://huggingface.co/datasets/BeyondHsueh/ReliableMath
Leaderboard: https://huggingface.co/spaces/BeyondHsueh/ReliableMath-Leaderboard

Reliability Evaluation Criteria

Knowing what you know, and admitting what you don’t, is true knowledge

Previous research on model reliability mainly focused on knowledge tasks, exploring whether models know specific facts. Less attention was given to complex reasoning tasks, which can be unsolvable and require inference to determine solvability, increasing research difficulty.

Based on the definition of reasoning reliability, this work proposes an evaluation criterion that classifies problems as solvable (A) or unsolvable (U), and responses as success (S), refuse (R), or fail (F). Success means matching the correct answer for solvable problems or correctly indicating unsolvability; the second-best is refusal, which avoids misleading users. Other responses are failures.

Reliability is measured using precision (Prec.) and prudence (Prud.). When evaluating reliability, priority is given to precision, then prudence.

ReliableMath Dataset

The First High-Quality Unsolvable Math Problem Set

Due to the lack of unsolvable math problems, this dataset includes both solvable and unsolvable problems. Unsolvable problems are constructed by modifying solvable ones—either removing necessary conditions or adding contradictions, as shown below.

To ensure high-quality unsolvable problems, a three-step process is used: 1) modify solvable problems to make them unsolvable; 2) verify with models and filter out invalid ones; 3) manually verify the remaining to confirm unsolvability, forming the ReliableMath dataset.

ReliableMath covers various difficulty levels, including Olympiad-level AIME, AMC, Minerva, and high school-level MATH problems. Human annotations also include difficulty levels for unsolvable problems, with 0 for obvious cases and 1 for those requiring inference, as detailed in the original paper.

Experimental Analysis

Revealing Flaws in Large Model Reasoning Reliability

Experiments on both slow and fast reasoning models reveal key findings:

When directly inputting unsolvable problems (standard prompt), models almost lack the ability to refuse or identify unsolvability, leading to poor reliability. They tend to hallucinate or produce meaningless reasoning, wasting resources and damaging trust.
Adding instructions to allow refusal or indicate unsolvability (reliable prompt) improves reliability on unsolvable problems significantly, with minimal impact on solvable problems, and reduces overthinking.
For larger models, using reliable prompts generally results in higher reliability in slow reasoning models compared to fast ones, such as Deepseek-r1 vs. Deepseek-v3. For smaller models, reliability remains low, indicating room for improvement.
Simple math test sets have higher reliability than more difficult ones.

Additionally, the paper analyzes the ReliableMath dataset, showing that problems constructed by removing necessary conditions tend to have lower reliability because models often hallucinate missing information. The difficulty level correlates with reliability, with higher difficulty (level 1) being harder to identify as unsolvable, aligning with human reasoning.

Reliability Alignment

How to Improve Large Model Reliability?

The paper proposes an alignment strategy: construct unsolvable problems on open datasets, distill successful responses from stronger models, generate refusal responses from smaller models, and then train the smaller models via supervised learning. This alignment significantly improves reliability.

Conclusion and Outlook

This work introduces the first reliability benchmark for large model inference tasks, aiming to draw attention to the reliability of next-generation reasoning models and foster trust in AI outputs to better serve humans.