By Insights Team in AI — 25 Jun 2025

Breaking News! Taotian and Aicheng Open Source Reinforcement Learning Framework ROLL, Supporting Billion-Parameter to Trillion-Parameter Large Models}

Taotian and Aicheng jointly open-sourced ROLL, a scalable, efficient reinforcement learning framework supporting models from billions to trillions of parameters, accelerating large model training.

Over the past few years, with the rise of Reinforcement Learning from Human Feedback (RLHF), reinforcement learning (RL) has become a key technology in the post-training phase of large language models (LLMs). RL not only significantly improves model alignment but also expands its application boundaries in reasoning enhancement and agent interactions. Various optimization strategies and algorithm variants like Agentic RL, RLAIF, GRPO, and REINFORCE++ have emerged, all requiring high-efficiency scheduling of multi-module (Actor, Critic, Reward, Ref) and multi-stage (generation, reasoning, training) processes. This demands training frameworks that support large-scale models efficiently, with good scalability and developer friendliness, making a truly high-performance, scalable, and user-friendly RL system essential.

Recently, Taotian Group and Aicheng Technology jointly open-sourced the new generation reinforcement learning training framework ROLL (Reinforcement Learning Optimization for Large-scale Learning). ROLL is designed with user experience at its core, focusing on efficiency, scalability, and ease of use, enabling RL training from small models to over 600B+ parameters.

Open source project: https://github.com/alibaba/ROLL
Paper Title: Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
Paper Link: https://arxiv.org/pdf/2506.06122

Key features of ROLL include:

Multi-task RL: Supports diverse RL tasks covering math, coding, reasoning, open QA, and instruction following, with flexible sampling and data weighting.
Agentic RL: Native support for multi-environment, multi-agent interactions like games and multi-turn dialogues, with flexible parallelization and management.
Algorithm-friendly: Offers flexible RL strategies, including variance reduction baselines, reward normalization, data masking, and supports PPO, GRPO, Reinforce++, etc.
Rich inference/training engines: Compatible with vLLM, SGLang, Megatron-Core, DeepSpeed, with parameter passthrough and seamless adaptation.
Elastic resource scheduling & distributed parallelism: Based on Ray, supports heterogeneous hardware, from single machines to clusters, with efficient utilization of DP/TP/PP/CP/EP parallelism.
Highly user-friendly & modular: Modules like Rollout Scheduler and AutoDeviceMapping simplify pipeline development and debugging, with flexible engine switching.
Sample-level scheduling & dynamic sampling: Supports asynchronous reward calculation, dynamic sampling, sample pruning, and early stopping, boosting efficiency and resource use.
Observability: Integrates wandb, swandb, tensorboard for real-time performance tracking across tasks, strategies, and rewards.

The project has already gained over 1000 stars on GitHub.

User-friendly ROLL Design

ROLL is designed for three user groups: tech pioneers, product developers, and algorithm researchers. It supports elastic scaling on large GPU clusters, flexible reward/configuration setup, and efficient experiments on limited resources.

Technical Details of ROLL

Innovations include:

Based on the "Hybridflow" framework with a modular controller architecture, simplifying RL experiments.
Enhanced parallel strategies and data transfer modules for resource-limited environments.
Rollout schedulers for managing sample lifecycle, response generation, environment interaction, and reward calculation.
Environment and reward workers for scalable agent-environment interaction and reward computation.
Resource pools and AutoDeviceMapping for efficient resource allocation and scheduling.

The architecture involves user input layers, distributed executors, schedulers, AutoDeviceMapping, and resource pools. During operation, resources are allocated from GPU/CPU pools, and training involves coordinated generation, reasoning, and optimization phases, with real-time visualization support via wandb, tensorboard, etc.

In training, samples are generated via the rollout scheduler, with environment interaction and reward calculation supporting multi-round and multi-modal RL tasks. During inference, models like Critic, Reward, and Ref are used for forward passes, with responses distributed to parallel workers. Training updates parameters, with synchronization via ModelUpdateGroup, ensuring consistency. Visualization tools like wandb are supported. More details are in the original paper.

Experimental Results of ROLL

On models like Qwen2.5-7B and Qwen3-30B, ROLL significantly improves multi-task performance across domains. For example, Qwen2.5-7B accuracy increased from 0.18 to 0.52, and Qwen3-30B from 0.27 to 0.62, demonstrating robustness and stability.

^{Performance trends in different environments, e.g., SimpleSokoban, FrozenLake, and WebShop, show marked improvements in success rates and efficiency.}

Sokoban: Success rate from 16.8% to 26.0%, with better decision-making and transferability to other environments like FrozenLake.

^{In FrozenLake, success rate from 16.8% to 26.0%, with effective actions increasing from 69.1% to 88.8%.}

WebShop: Success rate from 37% to over 85%, with fewer actions per task, indicating improved instruction understanding and operational efficiency.

^{Performance trend in WebShop environment shows significant accuracy improvements.}

ROLL continues to evolve, with over 1000 stars on GitHub. Future updates will include support for Qwen2.5 VL Agentic RL, asynchronous pipelines, FSDP2, DeepSeekV3, and more, inviting community participation.

Project homepage: https://github.com/alibaba/ROLL
Paper: https://arxiv.org/pdf/2506.06122

The open-source spirit drives technological progress. The ROLL team welcomes talented contributors to join, tackling technical challenges and shaping the future of RL.

Job openings:

Taotian Group’s 4th Tech Festival, Hardcore Youth Tech Festival 4.0, will be held from June 30 to July 4, 2025, in Beijing and Hangzhou, featuring tech markets, open days, AI games, and hackathons. Live streams of key sessions, including the AI special "BoJianShe," will feature experts like Liu Jing, Zuo Wangmeng, Wang Limin, Liu Zhiyuan, and Li Chongxuan. Join us online!

Live stream link:

Subscribe to QQ Insights