By Insights Team in AI — 15 Jul 2025

Can Agents Score Without Doing Anything? Major Issues in AI Benchmarking}

Current AI benchmarks are flawed, allowing agents to score high without genuine effort, highlighting the need for more rigorous and reliable evaluation standards.

Everyone is studying examinees, but the exam papers are flawed.

Benchmark tests play a fundamental role in evaluating the strengths and limitations of AI systems, guiding research and industry development.

As AI agents move from prototypes to critical real-world applications, researchers and practitioners are developing benchmarks to assess capabilities and limitations.

These differ significantly from traditional model evaluations. Since tasks often require real-world scenarios and lack standard answers, designing and evaluating AI benchmarks is much more complex than conventional AI tests.

Clearly, existing agent benchmarks are not yet reliable.

For example:

In WebArena, a benchmark used by OpenAI and others to evaluate web interaction abilities, an agent answered “45 + 8 minutes” for a path duration task, while the correct answer was “63 minutes,” yet WebArena marked it as correct.
τ-bench, which assesses reliability in real-world environments, judged a “no-operation” agent with a 38% accuracy in a flight task, despite the agent having no understanding of ticket policies, thus overestimating its ability.

Recently, Nvidia’s Chief Scientist Banghua Zhu commented on this phenomenon, calling the fact that a do-nothing agent can score 38% “very interesting.”

Moreover, among the 10 commonly used AI benchmarks (such as SWE-bench, OSWorld, KernelBench), serious issues were found in 8, sometimes leading to a 100% overestimation of AI capabilities.

This clearly indicates that current benchmarks are problematic. To truly understand AI abilities, more rigorous benchmark construction is necessary.

A recent collaborative work by researchers from UIUC, Stanford, Berkeley, Yale, Princeton, MIT, Translucence, ML Commons, Amazon, and UK AISI systematically analyzed common failure modes of current benchmarks and proposed a checklist to minimize “gaming” and ensure they measure true intelligence.

Paper Title: Establishing Best Practices for Building Rigorous Agentic Benchmarks
Blog Link: https://ddkang.substack.com/p/ai-agent-benchmarks-are-broken
Paper Link: https://arxiv.org/abs/2507.02825
Project Link: https://uiuc-kang-lab.github.io/agentic-benchmarks/
GitHub: https://github.com/uiuc-kang-lab/agentic-benchmarks/

Problems with current AI agent benchmarks:

They require agents to complete complex end-to-end tasks like fixing code repositories or planning trips, which pose two main challenges:

Fragile simulation environments: Tasks run in virtual or containerized environments that can have vulnerabilities or outdated systems, allowing agents to exploit shortcuts or fail due to system issues.
Lack of clear “standard answers”: Tasks involve code, API calls, or lengthy plans, making evaluation subjective and difficult to standardize.

To address these, two criteria are proposed:

Task Validity: Is the task solvable only when the agent has specific abilities?
Result Validity: Does the evaluation truly reflect task completion?

^{The operation process and conceptual mechanism of AI agent evaluation rely heavily on task and result validity, ensuring that benchmark results accurately reflect the agent’s true capabilities.}

Research: AI Agent Benchmark Checklist (ABC)

This paper introduces the AI Agent Benchmark Checklist (ABC), containing 43 items derived from 17 mainstream AI benchmarks.

ABC consists of three parts: Result Validity Checks, Task Validity Checks, and a Benchmark Report Guide for cases where ideal validity is hard to achieve.

The complete, printable checklist is publicly available at:

https://uiuc-kang-lab.github.io/agentic-benchmarks/assets/checklist.pdf

Applying ABC: Findings

The checklist was applied to 10 leading AI benchmarks, including SWE-bench Verified, WebArena, OSWorld, etc.

^{Results of applying ABC to 10 widely used agent benchmarks}

Findings include:

7/10 contain shortcuts or tasks that cannot be completed by AI agents;
7/10 fail the result validity criterion, meaning the evaluation does not accurately reflect task completion;
8/10 lack transparency, not revealing known issues.

Examples of issues in current benchmarks:

SWE-bench and SWE-bench Verified use manually written unit tests to verify code patches generated by AI, but these tests may miss errors.

Expanding these tests changed the leaderboard rankings: 41% of agents in SWE-bench Lite and 24% in SWE-bench Verified experienced ranking shifts.

^{IBM SWE-1.0 generated an incorrect solution that was not detected due to incomplete unit test coverage.}

KernelBench evaluates CUDA kernel code correctness using random tensors, which may fail to detect certain errors like memory access issues.

τ-bench uses substring and database matching, allowing a “no-op” agent to pass 38% of tasks, as shown in the example where an agent does nothing but still passes.

^{Example task in τ-bench where an agent that does nothing can pass.}

WebArena uses strict string matching and an outdated LLM judge, leading to performance overestimations of 1.6% to 5.2%.

OSWorld’s evaluator relies on outdated website data, causing a 28% underestimation of performance. For example, it fails to recognize that a website’s search class has been removed, misjudging the agent’s actions.

^{OSWorld’s evaluator still searches for outdated class names, leading to incorrect assessments.}

SWE-Lancer fails to securely store test files, allowing agents to overwrite tests and pass all evaluations.

Future Directions for ABC

This work establishes ABC as a practical framework to:

Help benchmark developers identify potential issues and demonstrate rigor;
Assist AI/model developers in understanding evaluation benchmarks beyond just performance metrics.

Subscribe to QQ Insights