By Insights Team in AI — 27 Jul 2025

Are Your AI Housekeepers 'Breaking the House'? Latest Research Reveals Security Flaws in Household Embodied AI}

New research uncovers significant safety vulnerabilities in household embodied AI agents, highlighting risks like accidents and security breaches in AI-driven home automation systems.

This article is a joint effort by Shanghai AI Lab and Beihang University. Main authors include doctoral students Lu Xiaoya from Shanghai AI Lab and Shanghai Jiao Tong University, Chen Zeren from Beihang, and Hu Xuhao from Shanghai AI Lab and Fudan University (co-first authors). Corresponding authors are Liu Dongrui from Shanghai AI Lab, Professor Sheng Lu from Beihang, and Shao Jing from Shanghai AI Lab.

From Meta’s Habitat 3.0 perfectly recreating home environments, to Google’s SayCan enabling robots to understand complex household commands, and Tesla’s Optimus showcasing clothing-folding videos that went viral—today’s vision-language model (VLM)-based household assistants seem to have turned into "omnipotent housekeepers," capable of tidying kitchens, organizing clothes, and caring for pets!

But hold on—have you ever considered that letting these "smart housekeepers" operate freely might be as dangerous as letting a three-year-old play with a lighter?

To address this, Shanghai AI Lab and Beihang University jointly launched the first-ever benchmark focusing on safety evaluation of embodied intelligent agents in home environments—IS-Bench! This benchmark innovatively designs over 150 scenarios with hidden "safety hazards" (from stained dishes to dust-covered stoves), coupled with a dynamic evaluation framework that comprehensively tests the AI housekeeper’s safety awareness.

^{Image 1: Main structure of PRMBench, showing data organization process on the left; example evaluation scenarios and model performance on the right.}

PRMBench: A comprehensive health check for process-level reward models

PRMBench is not just a data expansion but a meticulously constructed "comprehensive health check" designed to systematically and multi-dimensionally evaluate the core capabilities of PRMs.

Unique advantages of PRMBench

Unprecedented volume and detailed annotations: Contains 6,216 carefully designed questions with up to 83,456 step-level fine-grained labels, ensuring deep and broad coverage of complex reasoning scenarios.
Innovative multi-dimensional evaluation system: From Simplicity (Simplicity), Soundness (Soundness), to Sensitivity (Sensitivity), subdivided into nine categories such as Non-Redundancy, Non-Circular Logic, Empirical Soundness, Step Consistency, Domain Consistency, Confidence Invariance, Prerequisite Sensitivity, Deception Resistance, and Multi-Solution Consistency. This comprehensive framework aims to capture various error types in PRMs.
First systematic reveal of deep flaws in existing PRMs: Extensive experiments on 25 representative models, including open-source PRMs and top language models prompted as critics (like GPT-4o, o1-mini, Gemini-2-Thinking), reveal significant weaknesses in fine-grained error detection.

Main findings

Overall performance far below expectations: Even the best model Gemini-2-Thinking scores only 68.8 PRMScore, well below human level (83.8) and barely above random guessing (50.0), indicating huge room for improvement.
Open-source PRMs generally lag behind: Compared to prompting powerful general language models as critics, current open-source PRMs show lower performance, raising concerns about reliability and training biases.
"Simplicity" is the biggest challenge: Even ReasonEval-34B, with relatively better performance, scores only 51.5 in the simplicity dimension, highlighting PRMs’ weakness in identifying redundant or unnecessary steps.
Significant "positive bias" phenomenon: Some models like ReasonEval-7B and RLHFlow-DeepSeek-8B show a strong bias towards labeling steps as correct (>95%), but only 17% accuracy in identifying errors, severely affecting reliability.
Impact of error position: Deeper analysis shows PRMs’ performance improves gradually as errors occur later in the reasoning chain, revealing challenges in early-stage error detection.
Serious "false positives": PRMs often fail to detect superficial or non-existent errors, making them vulnerable to "reward hacking" and "exploits".

Root causes: blind spots of current PRMs

In complex proofs requiring counterexamples, even powerful models like o1 recognize issues during reasoning but many PRMs fail to detect such fine-grained errors, raising the question: Can current PRMs truly detect subtle errors in reasoning?

Image showing errors in reasoning process

^{Image 2: Errors in reasoning process when asking o1 and PRM about the Lagrange Mean Value Theorem.}

However, existing benchmarks for PRM evaluation mainly focus on macro correctness of steps, neglecting detailed error types. This creates an urgent need for a comprehensive benchmark capable of evaluating PRMs’ performance in detecting fine-grained errors. This motivation led to the creation of PRMBench, aiming to break current evaluation limitations and identify models capable of detecting subtle errors, providing precise diagnostics for future development.

Comparison of PRMBench with other benchmarks

^{Image 3: Comparison of PRMBench with existing benchmarks.}

PRMBench construction: achieving comprehensive and rigorous evaluation

^{Image 4: PRMBench includes three main evaluation themes: "Simplicity," "Soundness," and "Sensitivity."}

Data sources and construction:

Metadata extraction: Based on the PRM800K dataset, selecting fully correct questions, answers, and reasoning steps as metadata for the benchmark.
Fine-grained error injection: Using advanced LLMs (especially GPT-4o) to strategically inject various error types into correct reasoning steps for most evaluation themes. For multi-solution consistency, multiple correct solutions are generated to test model consistency.
Manual validation: All injected errors are carefully reviewed to ensure quality and relevance, guaranteeing dataset authenticity.
Large-scale data set: The final benchmark contains 6,216 questions with 83,456 step-level labels, providing a robust evaluation resource.

Evaluation dimensions and metrics:

The evaluation system covers three main areas:

Simplicity: Ability to identify and eliminate redundant steps, including "Non-Redundancy" and "Non-Circular Logic."
Soundness: Accuracy of reward signals and error recognition, including "Empirical Soundness," "Step Consistency," "Domain Consistency," and "Confidence Invariance."
Sensitivity: Robustness against subtle changes or misleading info, including "Prerequisite Sensitivity," "Deception Resistance," and "Multi-Solution Consistency."

Experiments and key findings

Extensive testing on 25 models, including open-source PRMs (e.g., Skywork-PRM, Llemma-PRM, MATHMinos-Mistral, MathShepherd-Mistral, RLHFlow-PRM) and top language models prompted as critics (e.g., GPT-4o, o1-mini, Gemini-2-Thinking), reveal significant weaknesses in fine-grained error detection.

Evaluation metrics:

Negative F1 Score: Measures error detection accuracy.
PRMScore: A comprehensive score combining positive and negative F1 scores, providing an overall reliability measure.

Key findings:

Current PRMs perform poorly: Most models score only slightly above random, indicating large room for improvement.
Open-source PRMs lag behind: They show lower reliability compared to models prompted as critics, raising concerns about practical deployment.
"Simplicity" is the toughest challenge: Identifying redundant steps remains a major difficulty for PRMs.

In-depth analysis: revealing biases and influencing factors in PRMs

Strong preference for correct labels: Many PRMs tend to favor correct labels, making it difficult to detect errors in negative samples, which affects fairness and comprehensiveness.

^{Table 3: Scores and similarities of models on positive and negative samples in PRMBench.}

Impact of error position: Errors occurring later in the reasoning chain are detected more accurately, indicating challenges in early-stage error recognition.

Limited effect of few-shot ICL: Using different numbers of in-context examples has minimal impact on the performance of closed models, suggesting the need for deeper model or training paradigm innovations.

^{Table 4: Effect of different few-shot numbers on critic model performance.}

PRMs are vulnerable to "false positives" and "reward hacking": PRMs often fail to detect superficial or non-existent errors, making them susceptible to exploitation. Comparing PRMBench with other evaluation methods shows higher sensitivity and exposes weaknesses in current models.

^{Image 4: Comparison of PRMBench and other evaluation methods in detecting false positives.}

Conclusion and future outlook

The release of PRMBench not only sets a new high standard for evaluation but also serves as a wake-up call, emphasizing the need to reassess the capabilities of current PRMs and accelerate their development in fine-grained error detection in complex reasoning scenarios.

Research significance and future directions:

Driving paradigm shift in PRM evaluation: PRMBench offers an unprecedented comprehensive and fine-grained assessment tool, helping to identify potential flaws and blind spots in PRMs, and fostering fundamental improvements.
Guiding future PRM development: By revealing strengths and weaknesses across models, PRMBench provides critical insights for designing, training, and optimizing more robust and generalizable PRMs.
Building more reliable AI systems: Only with more accurate and trustworthy PRMs can LLMs perform better in complex reasoning tasks, ultimately leading to more dependable AI systems that approach human reasoning levels.

"We believe PRMBench will become a cornerstone for advancing process-level reward model evaluation and development, contributing to the creation of highly reliable next-generation AI systems!" the research team states.

Join us in exploring PRMBench and迎接未来的挑战！

Subscribe to QQ Insights