By Insights Team in AI — 03 Jul 2025

Breaking News! Fusion of World Model and Action Model with Fully Autoregressive WorldVLA}

Alibaba's new WorldVLA model integrates world and action models into a unified autoregressive framework, advancing multimodal understanding and generation capabilities.

Jun Cen, a researcher at Alibaba DAMO Academy specializing in embodied intelligence large models, holds a PhD from Hong Kong University of Science and Technology. His research focuses on embodied intelligence and world models.

Alibaba DAMO Academy has introduced WorldVLA, the first model to fuse the world model (World Model) and action model (VLA Model) into a single, fully autoregressive model supporting multimodal understanding and generation of text, images, and actions.

Paper Title: WorldVLA: Towards Autoregressive Action World Model
Code Repository: https://github.com/alibaba-damo-academy/WorldVLA

Research Overview

Recent developments in Vision-Language-Action (VLA) models focus on robot action modeling based on large-scale multimodal pretraining. These models excel in perception and decision-making but lack deep understanding of actions themselves, which are often treated as mere outputs rather than inputs for analysis and modeling.

World models, capable of predicting future visual states based on current observations and actions, offer a way to understand both visual information and behavior dynamics. However, they cannot directly generate actions, creating a gap in applications requiring explicit action planning.

To address these limitations, Alibaba proposes WorldVLA, a unified autoregressive model that combines world and action models. It uses three independent encoders for images, text, and actions, sharing a common vocabulary to enable cross-modal understanding and generation within a single large language model architecture.

The world model generates visual representations from actions, learning physical dynamics of the environment, which is crucial for action decision-making. Conversely, the action model enhances visual understanding, creating a robust bidirectional system.

Research shows that action chunking and parallel decoding significantly impact performance. However, generating multiple actions sequentially in autoregressive models can lead to error accumulation, as these models are less trained on actions and more prone to propagation of mistakes. To mitigate this, an attention mask strategy is introduced, allowing the model to generate actions based only on previous images, not past actions, reducing error accumulation.

Experimental results on the LIBERO benchmark show that without pretraining, WorldVLA surpasses traditional autoregressive models like OpenVLA in success rate by 4%. It also outperforms traditional world models in video generation quality, reducing FVD (Fréchet Video Distance) by 10%. The attention mask strategy further improves success rates by 4% to 23% in action chunking tasks.

Visualization of WorldVLA

Below, the system demonstrates its ability to generate actions based on instructions:

And its ability to generate subsequent frames based on actions and images:

Research Overview

Subscribe to QQ Insights