By Insights Team in AI — 21 Jul 2025

Toyota Research Institute and Nanyang Technological University Unveil the Most Rigorous VLA Validation Experiment for Robot 'GPT Moment'}

Toyota Research Institute and NTU conducted a highly rigorous VLA validation experiment, demonstrating advanced autonomous robot capabilities in complex tasks, signaling a potential 'GPT moment' for robotics.

When it comes to robotic arms, the first keyword that comes to mind is "grasping." Advanced robotic arms typically perform simple tasks like making ice cream or coffee.

However, if a robot is to autonomously complete heavy and complex tasks—such as setting a dining table or assembling bicycles—the difficulty increases exponentially. These tasks demand high-level coordination of perception, understanding, and motion control.

In recent years, with the rapid development of visual-language-action (VLA) models, robots have gradually gained the ability to integrate multimodal information (images, instructions, scene semantics) and perform complex tasks, moving toward greater intelligence and versatility.

But current VLA research has yet to reach a milestone, and embodied intelligence’s “GPT” moment still seems distant.

Until I saw these two videos:

Robots in the real world can now perform dual-arm writing, complete complex combined tasks, and even correct errors during operation. This marks a significant improvement over previous VLA achievements.

Delving deeper into this research, the authors have advanced the VLA approach by building a Large Behavior Model (LBM) based on diffusion model strategies. After training and fine-tuning, this model can autonomously execute complex robotic operations, achieving astonishing results.

Google researcher Ted Xiao said:

“If you work in robotics and AI, you must watch the recent Stanford video on multi-task robot operations—no hype, no marketing, just solid science supported by evidence. It’s a treasure in today’s context!”

Former NVIDIA scholar Jiafei Duan expressed:

“I am a loyal fan of TRI’s work. Rigorous evaluation is the true catalyst for progress in robotics.”

This work comes from the Toyota Research Institute (TRI) Large Behavior Model team. One of the authors is MIT professor and TRI’s Vice President of Robotics Research, Russ Tedrake.

It is said that Russ is a low-profile but extremely rigorous scholar. He only briefly introduced this paper via Twitter and LinkedIn. However, limited publicity does not diminish the work’s outstanding value— the paper itself speaks volumes.

The paper systematically evaluates a class of multi-task robotic manipulation strategies called Large Behavior Models, extending diffusion strategies in simulation and real robot datasets. It establishes a rigorous evaluation process, analyzing the models’ capabilities with statistical confidence. Through blind tests and randomized experiments, it compares multi-task strategies with single-task baselines in both simulated and real environments.

Paper Title: A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
Paper link: https://arxiv.org/pdf/2507.05331
Project link: https://toyotaresearchinstitute.github.io/lbm1/

Large Behavior Models (LBMs) are a class of visual-motor strategies trained on diverse simulation and real-world operation data.

Robots using traditional single-task strategies show limited generalization when facing task variations or environments outside training distribution. To overcome this fragility, the field increasingly adopts LBMs—visual-motor foundation models trained on large multi-task datasets containing action-level demonstrations. Despite rapid development, there remains uncertainty whether observed successes are mainly due to multi-task pretraining.

To rigorously study the impact of multi-task pretraining, the paper trained a series of diffusion-based LBMs on nearly 1,700 hours of robot data, with 1,800 real-world evaluations and over 47,000 simulated deployments, to analyze their capabilities.

The findings include:

Compared to from-scratch strategies, LBMs provide consistent performance improvements.
In challenging environments, LBMs learn new tasks with 3-5 times less data and exhibit robustness to various environmental factors.
With increasing pretraining data, LBMs’ performance steadily improves.

Even with only hundreds of hours of diverse data and a few hundred demonstrations per behavior, the model’s performance significantly improves. Pretraining yields continuous, stable performance gains at a scale far smaller than the “internet-scale” robot data. This positive cycle suggests that data acquisition and performance self-bootstrapping are entirely feasible.

This optimistic conclusion hints at the arrival of a universal large-scale model for embodied intelligence—how far is the “GPT moment” for robots?

LBM Architecture

The paper adopts a Diffusion Transformer architecture, conditioned on language, vision, and proprioception inputs, outputting a 20-dimensional action sequence covering the next 16 time steps.

During deployment, the policy runs at 10 Hz, predicting actions for the first 8 steps and replanning afterward.

These LBMs are extended multi-task diffusion strategy models with multimodal ViT (vision transformer) encoders, using AdaLN-based transformer denoising heads for conditional modeling. They process wrist and scene camera images, robot proprioception, and language prompts, predicting continuous 16-step action chunks (1.6 seconds).

The models are trained on a mixed dataset including:

468 hours of internal dual-arm teleoperation data,
45 hours of simulated teleoperation data,
32 hours of Universal Manipulation Interface (UMI) data,
and about 1,150 hours from the Open X-Embodiment dataset.

Despite the smaller proportion of simulation data, its inclusion ensures evaluation in both simulated and real environments using the same LBM checkpoint.

Experimental Details

Test Platform

The LBM was evaluated on a dual-arm workstation with Franka Panda FR3 arms and up to six cameras—two wrist cameras per arm and two static scene cameras.

^{Evaluation of the LBM on various tasks and environments in simulation and reality.}

Evaluation Metrics

The main metrics include Success Rate and Task Completion.

Success Rate is a standard measure in robotics, reflecting overall task success, but it doesn’t capture detailed performance differences. To address this, the paper introduces Task Completion, quantifying intermediate milestones with human-annotated rubrics and automated predicates.

While absolute success rates are reported, the authors emphasize that relative success rate is more meaningful for comparing methods, as absolute success depends heavily on task design and initial conditions.

The experiments deliberately set challenging tasks aiming for about 50% success rate to better distinguish method performance, though actual success may vary.

Experimental Results

LBM Performance on “Seen” Tasks

In experiments comparing models on real and simulated “seen” tasks, fine-tuned LBMs outperform single-task baselines and show greater robustness to distribution shifts. Even without fine-tuning, LBMs achieve non-zero success rates close to baselines.

LBM Performance on “Unseen” Tasks

For complex “unseen” tasks, the study compares fine-tuned LBMs with single-task baselines. Results show that fine-tuned LBMs outperform baselines, requiring less task-specific data to reach similar performance. Less than 30% of task-specific data can achieve full training performance, indicating efficient data utilization.

Evaluation of LBMs in simulated and real environments under normal and distribution-shift conditions, showing significant performance gains even with limited data.

Overall, the research demonstrates that large-scale diffusion-based LBMs, trained on diverse datasets, can generalize well, improve with more data, and potentially herald a new era of embodied AI with a “GPT moment” approaching.

Subscribe to QQ Insights