Open-Source Agent Benchmark: Tongyi WebSailor Dominates Multiple Leaderboards and Challenges OpenAI’s High-Difficulty BrowseComp Benchmark}

Tongyi WebSailor achieves top results across multiple benchmarks, significantly surpassing OpenAI’s high-difficulty BrowseComp, setting a new standard for open-source web agents.

Open-Source Agent Benchmark: Tongyi WebSailor Dominates Multiple Leaderboards and Challenges OpenAI’s High-Difficulty BrowseComp Benchmark}

图片

Background: The Challenges and Breakthroughs of Open-Source Web Agents in Complex Tasks

In the era of information explosion, traditional search engines struggle to meet users’ needs for deep, multi-step information retrieval. From medical research to technological innovation, solving complex problems requires extensive data mining and reasoning. Human capacity is limited, making it difficult to manually perform such detailed searches and reasoning within limited time and energy. Researchers aim to develop autonomous agents capable of independent thinking and decision-making to tackle these challenges.

Current open-source web agents perform poorly on extremely complex tasks. Systems like OpenAI’s DeepResearch have demonstrated superhuman abilities in benchmarks like BrowseComp, achieving “superhuman” performance. In contrast, open-source models often fail—accuracy on complex benchmarks like BrowseComp-en is nearly zero. This indicates that existing training paradigms do not equip open-source models with the reasoning patterns needed for high-uncertainty tasks. In short, open-source agents are hindered by their inability to effectively reduce extreme uncertainty.

How difficult is BrowseComp? Consider this example:

A TV series aired from 2018 to 2022. In the seventh episode of its first season, the opening song is from a music genre that emerged in Africa in 2012. An article from 2022 states that one creator, A, dropped out in 11th grade, while another creator, B, played football in high school and was a DJ elsewhere. Who is creator A?

The difficulty lies not in retrieving a fact but in performing multi-step reasoning and filtering based on dispersed, indirect clues to construct a specific fact. It tests reasoning, planning, and information integration—making it a gold standard for evaluating agent cognition and autonomy.

To address this, Alibaba’s Tongyi Labs RAG team introduced WebSailor. WebSailor offers a comprehensive post-training solution to bridge this gap, enabling open-source models to excel in ultra-complex information retrieval tasks. Through innovative data construction and training methods, WebSailor endows open-source web agents with superhuman reasoning abilities, making significant progress in longstanding challenges like BrowseComp and greatly narrowing the gap with top proprietary systems.

图片
图片

Technical Innovation: From High-Uncertainty Tasks to Efficient Training Paradigms

Data Construction and Reasoning Trajectory Acquisition

WebSailor’s success stems from systematic innovations. It focuses on creating challenging training tasks (“digging wells and making water”) and designing efficient training strategies (“teaching fish”). This includes building a high-uncertainty, high-complexity dataset SailorFog-QA, reconstructing reasoning trajectories to improve supervision signals, and combining cold-start RFT strategies with efficient reinforcement learning (DUPO) to develop a powerful post-training process.

The difficulty for open-source models to master tasks like BrowseComp lies in insufficient or easily reducible uncertainty. WebSailor classifies information retrieval tasks into three levels:

  • Level-1: Low uncertainty and easily resolved, such as questions answerable with internal knowledge or a single web search.
  • Level-2: High initial uncertainty but with clear solution paths, like multi-hop QA, where entities are logically connected, allowing structured reasoning to reduce uncertainty.
  • Level-3: High uncertainty and high difficulty to resolve, involving complex, emergent couplings without predefined reasoning paths. These require creative exploration and novel reasoning paradigms.
图片

Most open-source datasets involve only low-uncertainty or structured multi-hop questions (Level 1 or 2). Models have never faced Level-3 challenges—complex, uncertain problems without straightforward solutions. To address this, WebSailor constructs SailorFog-QA, significantly enhancing models’ ability to handle high-uncertainty tasks.

It simulates random walks in real web environments to build knowledge graphs:

  • Starting point: Select entities with sparse or ambiguous information from Wikidata, ensuring challenge.
  • Random expansion: Crawl information from the internet, extract related entities and relations, and expand the graph randomly.
  • Structural features: Generates a highly nonlinear knowledge network, unlike traditional linear multi-hop chains, requiring exploration and flexible reasoning.

Questions are generated by sampling subgraphs and designing questions based on entities and relations, often involving multiple crossing entities. To increase difficulty, key information is deliberately obfuscated (e.g., “late 20th century” instead of “1997,” or vague locations). This increases initial uncertainty, requiring deep reasoning and information integration.

Advantages of SailorFog-QA include:

  • Based on real internet data, ensuring training environments match real-world scenarios.
  • Supports diverse reasoning patterns due to complex graph topologies.
  • Highly scalable, as sampling costs grow linearly with graph size, enabling large-scale data generation.

For high-uncertainty Level-3 QA, high-quality trajectories are needed for RFT cold start. Open-source models perform poorly on these complex problems, but by rejecting samples, sufficient cold-start data can be generated. Powerful open-source reasoning models like QWQ and DeepSeek-R1 can generate solution trajectories, but directly mimicking them has issues: their reasoning is style-specific and lengthy, limiting model exploration. WebSailor proposes an innovative reasoning reconstruction method, retaining only the “action-observation” sequences—objective facts of “what was done” and “what was seen”—and using another LLM to generate concise, goal-oriented reasoning for each step, removing style bias and redundancy, resulting in clean, efficient supervision signals.

Two-Stage Training: Cold Start and Reinforcement Learning

WebSailor’s training involves two stages:

First: RFT cold start. Complex tasks requiring dozens of steps are hard for non-reasoning models to learn from scratch due to sparse rewards and instruction adherence issues. We use thousands of high-quality trajectories generated earlier for rejection sampling fine-tuning (RFT), giving the model a solid start and teaching basic tool use and reasoning patterns under ReAct.

Second: DUPO reinforcement learning. After basic capabilities are established, DUPO (Duplicating Sampling Policy Optimization) further improves generalization and sampling efficiency. Because RL involves extensive environment interaction, DUPO’s dual dynamic sampling strategy accelerates training:

  • Pre-filtering: Remove tasks the model can already solve perfectly, focusing resources on challenging areas.
  • In-training duplication: Randomly duplicate trajectories with reward variance within a batch, maintaining parallelism and boosting efficiency.

Compared to DAPO, DUPO speeds up training by 2-3 times. A strict reward mechanism evaluates format and accuracy, rewarding models that follow the ReAct paradigm and produce correct answers, discouraging reward hacking and encouraging complete, valid reasoning chains.

Experimental Results: Surpassing DeepSeek-R1, Grok-3, GPT-4.1 in Complex and Simple Tasks

WebSailor outperforms all open-source models and agents on four high-difficulty benchmarks: BrowseComp, BrowseComp-zh, DeepSearch, and GAIA. Its advantages are especially evident in the challenging BrowseComp-en and BrowseComp-zh tests. This confirms that training on complex, uncertain data enables the agent to develop robust, general reasoning strategies. WebSailor-3B and WebSailor-7B demonstrate that model size is less critical than training paradigm—despite smaller scale, WebSailor-7B achieves 6.7 accuracy on BrowseComp-en, far surpassing larger models, highlighting the importance of data synthesis and targeted reinforcement learning.

Compatibility with simple tasks:

Although trained mainly on high-complexity tasks, WebSailor also performs well on simpler benchmarks like SimpleQA. It outperforms other methods, showing high efficiency and compatibility in straightforward scenarios.

图片

SailorFog-QA Complexity Validation

The authors compare SailorFog-QA with previous open-source agent training data and tool-call distributions in BrowseComp. Results show SailorFog-QA exhibits a long-tail distribution, with many samples requiring over five tool calls, some over twenty, closely matching the complexity distribution of the BrowseComp-en benchmark. This targeted data construction ensures training on complex, representative reasoning tasks, laying a foundation for strong multi-step reasoning capabilities.

图片

Conclusion and Future Outlook

WebSailor aims to bridge the gap between open-source and top proprietary systems in complex information retrieval. The core issue is the lack of training data with “high and hard-to-reduce” uncertainty. The authors propose an innovative methodology: generating complex, topologically rich problems via SailorFog-QA, reconstructing reasoning chains to eliminate style bias, and combining cold-start RFT with efficient reinforcement learning (DUPO). This two-stage training process is both effective and stable.

WebSailor’s success demonstrates that advancing open-source agents depends not only on larger models but also on innovative training paradigms. The “complex task synthesis → logical supervision → efficient RL” blueprint offers valuable insights for developing advanced agents in other domains. It encourages the community to shift from mimicking human-solvable problems to actively constructing extreme challenges that inspire emergent strategies, pushing AI capabilities forward.

Despite progress, limitations remain: current training is constrained by a 32k context length, limiting long-chain task handling. Future work includes migrating to asynchronous RL frameworks for higher efficiency and exploring multi-modal, tool-using, and strategic reasoning tasks across knowledge domains, aiming to create agents capable of reasoning, discovery, and creativity—powerful partners for human intelligence.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe