On Robot Data: Sergey Levine, a Reinforcement Learning Expert, Just Wrote a Great Article}
Sergey Levine discusses the challenges of training large models with vast data, exploring alternatives like simulation and proxy data, and the limitations of current approaches in robotics.


We know that training large models is inherently challenging. As models grow in size and expand into new application areas, the difficulty increases, requiring massive amounts of data.
Large language models (LLMs) mainly rely on vast text data, while vision-language models (VLMs) need both text and images. In robotics, Visual-Language-Action models (VLA) demand large amounts of real-world robot task data.
Currently, agents are a crucial step toward Artificial General Intelligence (AGI). Training agents requires real interaction data with action labels, which is far more costly than collecting text and images from the web.
Researchers have been seeking alternative solutions to reduce data costs while maintaining the benefits of large-scale training. Sergey Levine, a leading reinforcement learning researcher at UC Berkeley and co-founder of Physical Intelligence, wrote an insightful article analyzing data composition for large models. He argues that achieving both low cost and high performance is very difficult, comparing it to the impractical combination of a fork and spoon as a "spork".

Levine's article, titled "Sporks of AGI", discusses the concept of substitute data. Despite success in vision and NLP tasks, in the realm of agents, especially robotic agents, researchers seek "alternative data"—cheaper proxy data that can replace costly real interaction data while still supporting generalization.
Simulation is a classic approach. Training robots in virtual environments or high-fidelity video games can avoid reliance on real-world data. These methods, though innovative, essentially construct a mapping between cheap proxy domains and real robots, using low-cost data to substitute for expensive real data.
Common methods include:
- Simulation: Relying on human-designed training environments with physical modeling and visual assets. Effective simulation often introduces environmental variability to improve robustness, defining not just "what" the task is but also "how" to do it.
- Human Videos: Training robots via videos of humans performing tasks, establishing mappings like hand positions or grasping actions, though this involves bridging differences in dynamics and appearance.
- Hand-held Grippers: Using physical devices for humans to mimic robot actions, which requires assumptions about robot kinematics and capabilities, such as six degrees of freedom and similar motion structures.
While these approaches have yielded many successes, they are fundamentally compromises—reducing data costs at the expense of potentially weakening the model’s generalization ability.
Crossing the Gap
In data collection, human judgment is unavoidable: task goals are set by us, even in "whiteboard" learning. Attempts to avoid real data often involve information hiding—reducing observation space, domain-invariant losses, or limited camera views—which ultimately weaken the model’s ability to integrate complex information and extract subtle human patterns.
As models improve, the ability to distinguish between substitute and real data domains increases, shrinking the intersection of effective behaviors. To counter this, researchers may hide information, but this diminishes the core advantage of models—integrating diverse data sources and recognizing subtle patterns.
In essence, stronger models tend to reduce the intersection of proxy and real data, and any attempt to prevent this trend weakens the model’s capabilities. The intersection size depends heavily on how substitute data is designed; poor design shrinks the effective strategy space.
Practically, efforts focus on carefully designing substitute data for specific applications to minimize differences from real robots, ensuring behaviors are as aligned as possible within those scenarios. However, outside these scenarios, this consistency is not guaranteed.
When training robots with human data, the model tends to predict "how humans would solve this" rather than "how robots can efficiently complete this." This conflicts with the core advantage of general models—broad applicability and strong generalization, enabling transfer to new domains.
Every new domain requires more manual effort to improve the proxy-real correspondence, turning the model’s generalization into a burden that amplifies the gap, making adaptation to new scenarios more difficult.

When aiming to optimize robot behavior—such as via reinforcement learning—these issues are further exacerbated.
Real-World Data
Trying to avoid real-world data is essentially seeking a "spork"—a low-cost proxy that can deliver the benefits of large-scale real data. Ultimately, this results in a "spork" that, while useful in some scenarios, is often just a poorly functioning spoon or a dull fork.
In machine learning, the most effective approach is to make training data as close to the test environment as possible. This allows the model to learn the true underlying mechanisms of the real world, enabling it to generalize and solve complex problems—patterns that are often subtle and hard for humans to perceive but can be inductively inferred by models.
Using substitute data is a suboptimal choice—only effective under specific conditions. It’s like trying to become a tennis expert solely by watching videos or hitting against a wall; real experience in the physical world is irreplaceable.
The key insight is: if we want robots with broad generalization in the real world, real-world data is indispensable, just like in LLMs and VLMs trained in virtual environments.
Adding diverse data sources—human demonstrations, simulations—beyond broad, representative real-world experience can help. Substitute data should be viewed as supplementary knowledge, not a replacement for real practice.
Instead of designing substitute data to resemble real robots physically (e.g., holding tools or mimicking actions), we should treat it as a knowledge source about "what might happen in the real world," similar to pretraining data for LLMs—informing the model about possible real-world scenarios rather than dictating specific actions.
The "Spork" Dilemma
This article explores the "spork"—a metaphor for substitute data—an attempt to balance low-cost data collection with large-scale training benefits. But in AI research, substitute data is not the only "spork."
Other "sporks" include hybrid systems combining manual design and learning, methods using human-imposed constraints to prevent undesirable behaviors, and neural network architectures embedding intuitive problem-solving approaches.
These methods aim to "get both"—benefiting from large-scale learning while avoiding high data costs or complex goal design. Their core is manual inductive bias, addressing incomplete training data.
However, they share a fundamental flaw: they require us to encode our own thinking into the system. Any manually designed component becomes a bottleneck, limiting the system’s scalability and adaptability.
"Sporks" are attractive because they seem to promise overcoming major AI challenges by solving problems in our way. But in reality, they often make the system less scalable—precisely what we initially aimed to improve.
For more details, see the original blog.