LeCun Takes Action: Building a Video World Model to Challenge NVIDIA's COSMOS}
LeCun's new video world model aims to rival NVIDIA's COSMOS, leveraging advanced neural architectures to predict future video frames and enhance AI understanding of physical environments.

In 2018, Jürgen Schmidhuber, the father of LSTM, introduced the concept of the world model in his paper Recurrent World Models Facilitate Policy Evolution. This neural network predicts future environment states based on past observations and actions of an agent.
Recently, the concept of world models has gained increasing attention, including from Yann LeCun, one of the giants in deep learning. He views world models as a key pathway toward human-like intelligence.
However, developing effective world models remains challenging.
First, data issues: acquiring large-scale, high-quality video datasets is costly, especially with action annotations. Currently, successful applications are limited to specific domains like autonomous driving or video games.
Second, the task itself is complex: modeling physical laws and behaviors accurately in unconstrained, partially observable environments is still unresolved, even over short time scales. State-of-the-art pixel-based generative models, such as NVIDIA's COSMOS, require enormous computational resources, with training times up to 22 million GPU hours.
Moreover, such massive computation may be wasted on irrelevant details. For example, in autonomous driving, modeling every leaf's precise movement in the wind is unnecessary. Therefore, modeling environments at appropriate abstraction levels is crucial for improving efficiency and effectiveness of world models.
Finally, evaluating pre-trained video world models is also a significant challenge.
To address these issues, researchers from Meta have proposed a powerful general video world model called DINO-world for future frame prediction. DINO-world pre-trains on the potential space of visual encoders like DINOv2, then fine-tunes with action data for planning and control.
- Paper link: https://arxiv.org/pdf/2507.19468v1
- Title: Back to the Features: DINO as a Foundation for Video World Models
This approach offers several advantages:
- Decouples video pre-training from action-conditioned fine-tuning, enabling learning from large amounts of unlabeled videos, significantly reducing annotation needs;
- Models the latent environment, avoiding pixel-level modeling challenges, which are often unnecessary for downstream tasks;
- Uses a frozen encoder (like DINO) to provide strong semantic and geometric understanding, accelerating learning and simplifying training.
Additionally, the study introduces a more efficient world model architecture that reduces resource consumption during training and inference compared to current state-of-the-art models.
Training on a large dataset of about 60 million unfiltered online videos, the predictor learns general features transferable across domains. In the VSPW segmentation prediction task, predicting future 0.5 seconds improved mIoU by 6.3%, outperforming other models. Experiments with action data further validate the benefits of large-scale unsupervised pretraining.
Method Overview
Figure 1 illustrates the main components of DINO-world, including the frame encoder and future predictor.
Frame Encoder
Most world models now avoid pixel-level modeling, instead focusing on latent representations of video patches. For example, V-JEPA combines an encoder and predictor optimized jointly. In contrast, this work uses a self-supervised pre-trained encoder, DINOv2, to encode video frames, significantly reducing computational costs for training predictors. The model achieves effective training with fewer than 10 million parameters, compared to COSMOS's 1.2 billion.
Predictor Architecture and Training
The predictor is modeled as a decoder composed of N residual cross-attention blocks. It predicts future states based on current embeddings, initialized from learnable query tokens, and uses a teacher forcing strategy for training.
Position encoding is added via rotary position encoding (RoPE), which encodes spatial and temporal coordinates, allowing the model to understand spatiotemporal relationships across varying resolutions and frame rates.
Training uses a next-frame prediction objective, with the model predicting the next frame based on previous frames, enabling efficient parallel training. The loss focuses on predicting only a subset of tokens, reducing computational load.
Variable frame rate sampling ensures the model learns to handle different temporal dynamics, improving its ability to generalize to longer videos.
Action-Conditioned Fine-Tuning
The model can be adapted with a simple action module that updates query tokens based on observed actions, trained on small datasets, while keeping the core model frozen. This allows efficient transfer to downstream tasks with limited data.
Experimental Results
Dense Prediction Tasks
Evaluations on Cityscapes, VSPW, and KITTI datasets show that DINO-world outperforms models like V-JEPA and COSMOS in predicting future frames, with higher accuracy and robustness across benchmarks.
Results indicate the effectiveness of latent space modeling and large-scale pretraining for physical understanding and dynamic scene prediction.
Intuitive Physics
Using benchmarks like IntPhys, GRASP, and InfLevel, the models demonstrate physical reasoning capabilities, with DINO-world showing comparable performance to larger models like V-JEPA ViT-H. It excels in understanding physical interactions in videos.
These results highlight the importance of large-scale pretraining and latent space modeling for robust physical reasoning in AI.
Action-Conditioned Fine-Tuning and Planning
Fine-tuning on offline trajectories with limited epochs significantly improves success rates in various environments. Large-scale pretraining boosts performance, especially in complex scenarios close to real-world distributions.
More experimental details and ablation studies are available in the original paper.