By Insights Team in AI — 26 Jun 2025

Breakthrough in Embodied World Models: Horizon & Jiji Present Geometrically Consistent Video World Models to Enhance Robot Strategy Learning}

Horizon and Jiji introduce a novel geometrically consistent video world model, significantly advancing embodied strategy learning and bridging the gap between simulation and real-world robotics.

In recent years, as AI evolves from perception to decision-making, world models have become a key research focus in robotics. These models aim to enable agents to understand and predict environment dynamics for more efficient planning and decision-making.

Meanwhile, embodied data has gained explosive attention. Currently, embodied algorithms heavily rely on large-scale real robot demonstration data, which is costly and time-consuming to collect, limiting scalability and generalization. Although simulation platforms offer a low-cost data generation method, the significant visual and dynamic differences between simulation and reality (the sim-to-real gap) hinder direct transfer of strategies trained in simulation to real robots, restricting practical deployment. Thus, how to efficiently acquire, generate, and utilize high-quality embodied data remains a core challenge in robot learning.

Recently, Horizon, Jiji Technology, and the Institute of Automation, Chinese Academy of Sciences proposed RoboTransfer, a diffusion model-based video generation framework that can augment training data for robot strategy models. Thanks to the diversity of synthetic data, downstream strategy models show a 251% performance boost in new scenarios, laying a solid foundation for the universality and generalization of embodied intelligence.

Paper Title: RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
Project Homepage: https://horizonrobotics.github.io/robot_lab/robotransfer/

Imitation learning has become a vital method in robot manipulation. By mimicking expert demonstrations, robots can quickly develop effective strategies. However, this approach often depends on large amounts of high-quality real robot data, which is costly and slow to collect.

To address this, RoboTransfer introduces a diffusion model-based video synthesis framework that explicitly incorporates multi-view geometric information and scene component control (like background and object attributes). It uses cross-view feature interaction and global depth and normal maps as conditions to ensure geometric consistency across multiple views. The framework supports fine-grained editing, such as changing backgrounds or target objects, enabling the generation of diverse, structurally coherent visual data.

Experimental results show RoboTransfer can generate multi-view videos with high geometric and visual quality. Using this synthetic data to train robot visual policy models results in significant performance improvements: a 33.3% success rate increase when changing foreground objects, and a 251% boost in more challenging scenarios with both foreground and background changes.

The overall framework of RoboTransfer emphasizes multi-view consistency, enabling the model to reason across different perspectives for more spatially coherent results. It decouples control signals into geometric (depth, normal maps) and appearance (reference images), ensuring detailed and consistent generation from multiple viewpoints.

In experiments, RoboTransfer demonstrates effective data augmentation via real-to-real and sim-to-real methods. It can replace backgrounds in real-world data or re-render simulated data with realistic textures, significantly reducing the sim-to-real gap and providing a new paradigm for training general robots. For example, it can swap backgrounds or modify foreground objects in real data, as shown in the images below:

It also supports sim-to-real data augmentation, transforming simulated data into realistic images by controlling scene structure and appearance. This reduces the domain gap and enhances training effectiveness, as shown in the comparison images below:

Quantitative experiments confirm that background and foreground augmentation significantly improve success rates in new scenarios, with a 251% performance increase in cases with complete background changes.

Overall, RoboTransfer constructs a data processing pipeline capable of generating structured, multi-view consistent data with geometric and appearance controls. The generated data effectively trains robot policies, greatly enhancing their generalization capabilities and practical deployment potential.

Subscribe to QQ Insights