By Insights Team in AI — 27 Jun 2025

AgentAuditor: Achieving Human-Level Accuracy in AI Agent Safety Evaluation}

AgentAuditor, developed by researchers from multiple universities, enables AI evaluators to reach human-level precision in assessing the safety and security of intelligent agents.

LLM agents are evolving from simple text generators to autonomous decision-makers capable of complex actions, tool use, and real-time environment interaction, pushing towards artificial general intelligence (AGI). However, this autonomy raises safety concerns.

Researchers have proposed various benchmarks to evaluate agent safety, but these face a common challenge: lack of effective, precise evaluators. Traditional safety assessments based on content generation excel at simple tasks but struggle with complex environment interactions and decision-making processes. Existing methods, rule-based or large-model-based, often fail to accurately detect subtle risks, overlook small issues, or are confused by fuzzy rules, leading to unreliable safety evaluations.

To address this, researchers from NYU, NTU, UIUC, KTH, Sydney University, and NUS introduced AgentAuditor—a universal, training-free framework with memory-augmented reasoning that enables LLM evaluators to match human expert accuracy in identifying safety risks.

Paper Title: AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Project & Code: https://github.com/Astarojth/AgentAuditor-ASSEBench

How AgentAuditor Solves the Accuracy Challenge

AgentAuditor combines structured memory with Retrieval-Augmented Generation (RAG), endowing the evaluator with human-like learning and understanding of complex interaction records. It operates in three key stages:

1. Feature Memory Construction: Converts raw, messy agent interaction logs into a structured, vectorized experience database containing semantic information like scenarios, risk types, and behavior patterns.

2. Reasoning Memory Construction: Selects representative cases from feature memory and generates high-quality reasoning chains (CoT) using the same internal LLM, akin to expert judgment, guiding subsequent evaluations.

3. Memory-Augmented Reasoning: For new interaction cases, dynamically retrieves relevant reasoning cases from memory to assist the LLM evaluator in making precise, robust judgments.

Dataset: Building ASSEBench

To comprehensively validate AgentAuditor and fill the gap in safety and security benchmarks, the team created ASSEBench, which includes:

Large scale: 4 subsets with 2293 annotated real interaction logs.
Wide coverage: 15 risk types, 528 environments, 29 scenarios, 26 behavior patterns.
Fine annotations: Human-machine collaborative labeling with strict and lenient standards for ambiguous risks.
Dual focus: Evaluates both safety (avoiding errors) and security (resisting attacks).

Experimental Results: Human-Level Evaluation Accuracy

Extensive experiments on ASSEBench and R-Judge benchmarks show:

Significant improvements: AgentAuditor boosts performance across datasets, e.g., Gemini-2-Flash-Thinking’s F1 score on ASSEBench-Safety increased by 48.2%.
Near human performance: Gemini-2-Flash-Thinking achieves state-of-the-art results, with evaluation accuracy (e.g., 96.1% on R-Judge) approaching or surpassing human annotators.
Strong adaptability: Effectively adjusts reasoning strategies across different standards (strict/lenient), reducing performance gaps among models.

Comparison of AgentAuditor with Existing Methods and Human Evaluation

The images above compare AgentAuditor’s accuracy with existing methods and human evaluators on R-Judge benchmarks, showing superior performance and reliability.

Core Contributions of AgentAuditor

Systematic Analysis: Deeply analyzes core challenges in automated safety evaluation of agents.
Innovative Framework: Uses adaptive representative sampling, structured memory, RAG, and automatic CoT generation to significantly enhance evaluation capabilities.
First Dedicated Benchmark: ASSEBench fills a critical gap, providing a new paradigm for human-AI collaborative annotation.
Human-Level Performance: Demonstrated through experiments that its evaluation accuracy and reliability match professional human standards.

Conclusion

The introduction of AgentAuditor and ASSEBench provides powerful tools and a solid foundation for building more trustworthy and safe LLM agents. This work advances evaluation methods and guides future development of secure, reliable AI systems.

Subscribe to QQ Insights