Agent RL and Self-Evolution of Intelligent Agents: TaskCraft for Automated Complex Task Generation}

TaskCraft enables automatic generation of complex agent tasks, advancing reinforcement learning and self-evolving AI agents with scalable, multi-tool, verifiable task instances.

Agent RL and Self-Evolution of Intelligent Agents: TaskCraft for Automated Complex Task Generation}

图片

In recent years, agent-based reinforcement learning (Agent + RL) and agent optimization have garnered widespread attention in academia. However, a major bottleneck in training end-to-end intelligent agents with tool invocation capabilities is the scarcity of high-quality task data. Current mainstream datasets like GAIA and BrowserComp rely heavily on manual annotation, limiting their scale and task complexity—BrowserComp covers only about 1,300 search tasks, and GAIA provides around 500 multi-tool collaboration samples. Compared to instruction datasets with tens of thousands of examples used in foundational large model training, the gap is significant.

Although self-supervised methods like self-instruct have successfully built large-scale instruction data leveraging LLMs, these static instructions are insufficient for agent scenarios that require continuous dynamic interaction with environments, multi-tool collaboration, and multi-step reasoning. Traditional instruction data lack this interactivity and operability, resulting in poor transferability and limited applicability in training agents.

To address these challenges, OPPO Research Institute researchers proposed TaskCraft, an automated framework for generating agentic tasks. It aims to efficiently create scalable, multi-tool, verifiable task instances. TaskCraft employs a unified, systematic construction process that eliminates reliance on manual annotation, capable of producing complex scenarios involving tools like URLs, PDFs, HTML, and images, with automatic task goal verification to ensure data quality and execution loop integrity. Based on this framework, the team has built and open-sourced a synthetic dataset of approximately 41,000 agentic tasks, significantly expanding existing agent data resources for training and evaluation.

图片

Data Generation Process

The process is mainly divided into two parts: First, generating simple, verifiable atomic tasks; Second, expanding and deepening these to gradually increase complexity.

Atomic Task Generation

图片

      Atomic structure diagram

It involves extracting core questions from raw data, ensuring they require specific tools to solve. The process includes four key steps:

  1. Information Collection: Extract data from sources like web pages, PDFs, images, e.g., corporate reports, charts, or news articles.
  2. Key Content Identification: Use LLMs to extract candidate conclusions, e.g., “Apple’s total revenue in 2025 is $38.33 billion.”
  3. Question Generation: Convert conclusions into questions for tools, e.g., “What is Apple’s total revenue in 2025 according to the 2025 Annual Report?” (Answer: $38.33 billion)
  4. Task Verification: Each atomic task must meet two conditions: (1) rely on tools for answers, (2) be verified by agents to ensure executability.

Task Expansion

It aims to evolve simple tasks into hierarchical, more challenging ones, requiring multiple steps. Expansion methods include Deep Expansion and Width Expansion.

图片

      Deep expansion diagram

Deep expansion constructs decomposable, dependency-based task chains, following four steps:

  1. Identify main task and expansion key: Use specific keywords as context, e.g., for “Who directed the movie ‘Interstellar’?” the key is: “Interstellar.”
  2. Agent search for auxiliary atomic tasks: Search using the key, e.g., “Which American sci-fi movie released on Nov 7, 2014?” (Answer: “Interstellar”).
  3. Merge auxiliary tasks with main task: Formulate a combined question, e.g., “Who directed the sci-fi movie released on Nov 7, 2014?” (Answer: Christopher Nolan).
  4. Semantic validation: Use rules like superset, relation, information leakage, and substitution checks to verify task coherence.

Width expansion involves selecting two or more compatible atomic tasks from the same source (e.g., a webpage or PDF) with no causal dependency, then merging their semantics into a natural, coherent new task using LLMs.

图片

      Width expansion diagram

Enhancing Task Generation Efficiency with Prompt Learning

In TaskCraft’s process, prompt design is crucial. The team uses a bootstrap few-shot prompt optimization mechanism, iteratively refining prompts based on generated data. As shown in Table 1, the atomic task generation success rate increased from 54.9% to 68.1%, with nearly 20% reduction in average generation time. Deep expansion success rates improved from 41% to 51.2%, demonstrating the effectiveness of data-driven prompt optimization.

图片

      Table 1: Prompt Learning Experimental Results

Supervised Fine-tuning (SFT) of Base Models

The team further evaluated the practical impact of TaskCraft-generated data on large model capabilities. Using the Qwen2.5-3B series, they fine-tuned models on about 32,000 multi-hop tasks and trajectories derived from datasets like HotpotQA, Musique, and Bamboogle. Results in Table 2 show that after fine-tuning, the base model’s performance improved by 14%, and instruct models by 6%, indicating significant gains in reasoning and tool use. When combined with reinforcement learning (Search-R1), performance improved further, demonstrating the data’s value for both supervised and reinforcement learning.

图片

      Table 2: Fine-tuning Results

Additionally, the necessity of search agents was tested. Comparing two task construction methods—one using GPT-4.1 directly, and the other with GPT-4.1-based Search Agent—TaskCraft’s approach outperformed in multiple metrics, as shown in Table 3.

图片

      Table 3: Effectiveness of Task Construction Paradigm

Compared to direct GPT-4.1 task generation, TaskCraft’s method achieved higher success rates, shorter verification times, and more stable tool usage, demonstrating its efficiency in maintaining atomic task integrity.

图片

      Table 4: Performance in Safety Tasks

Further evaluations on safety tasks show that TaskCraft-generated tasks have higher success rates and lower rejection rates, indicating improved reliability and controllability of the agent.

图片

      Table 5: Privacy Task Performance

In privacy-related tasks, agents trained with TaskCraft data demonstrated lower rejection rates, indicating better trustworthiness and safety in sensitive scenarios.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe