By Insights Team in AI — 12 Jul 2025

Simulating Brain Functional Differentiation! Peking University and CUHK Release Fast-in-Slow VLA for Unified Collaboration of 'Fast Action' and 'Slow Reasoning'}

Researchers from Peking University and CUHK introduce Fast-in-Slow VLA, a dual-system model that unifies rapid actions and deep reasoning, advancing robot control and AI cognition.

Author Introduction: A joint research by teachers and students from Peking University, CUHK, Beijing Academy of Artificial Intelligence, and Zhihui Pingfang. Authors include PhD students Chen Hao, Liu Jiaming, Gu Chenyang, Liu Zhuoyang, with corresponding author Zhang Shanghang from Peking University’s HMI Lab, which focuses on embodied intelligence and multimodal learning. Follow for updates.

Fast Action & Slow Thinking:

Achieving high-frequency response and complex reasoning simultaneously has been a major challenge in robotics. Recently, a team from Peking University and CUHK released the Fast-in-Slow (FiS-VLA) dual-system vision-language-action model. Unlike previous fast-slow systems requiring separate initialization, FiS-VLA embeds the fast execution module directly into a pre-trained vision-language model (VLM), enabling integrated fast and slow processing. It employs heterogeneous modal inputs and asynchronous operation strategies, allowing high-speed action generation and deep reasoning, with excellent performance on simulation and real robot platforms. Notably, FiS-VLA-7B achieves control frequencies up to 117.7Hz, far surpassing existing solutions, demonstrating broad practical potential.

Paper link: https://arxiv.org/pdf/2506.01953
Project homepage: https://fast-in-slow.github.io/
Code repository: https://github.com/CHEN-H01/Fast-in-Slow
PKU HMI Lab: https://pku-hmi-lab.github.io/HMI-Web/index.html

Research Background & Challenges:

Robotic control aims to generate precise control signals based on sensor inputs and language commands. While large-scale vision-language models (VLMs) have shown strong pretraining capabilities, their large size and slow inference limit real-time high-frequency control. Inspired by Kahneman’s “dual-system theory,” some approaches build dual systems: System 1 for fast, intuitive decisions, and System 2 for slow, deep reasoning. Existing methods often treat these systems separately, with limited sharing of System 2’s knowledge, leading to inefficiencies and weak semantic integration.

Fast-in-slow VLA (How to differentiate execution modules in the slow system)

FiS-VLA proposes an innovative architecture that reconstructs the final transformer layers of the VLM as System 1’s execution module, embedding it into System 2 for unified inference and control. System 2 processes visual and language inputs at low frequency, generating guiding features; System 1 responds at high frequency, handling real-time perception and action generation. The model is trained with a dual-modal perception strategy, combining diffusion modeling to enhance System 1’s action generation and retaining System 2’s semantic reasoning. Pretrained on over 860,000 robot trajectories, it is fine-tuned for various tasks, significantly improving success rates and control frequency.

1. Architecture Design: FiS-VLA is based on the Prismatic VLM architecture, including visual encoders (SigLIP and DINOv2), lightweight 3D tokenizers (for point clouds), large language models (LLaMA2-7B with last layers reused for System 1), and multiple MLP modules for modality fusion and diffusion modeling. System 1 inherits pretraining knowledge, enabling high-frequency execution within System 2’s semantic space, forming a “fast within slow, slow within fast” collaborative structure.

2. Dual-System Collaboration: Inspired by Kahneman’s theory, FiS-VLA’s structure comprises a slow System 2 and a fast System 1. System 2 processes visual and language inputs, converting them into high-dimensional features from the LLM’s intermediate layers. It provides conditional signals for System 1’s real-time action generation, which operates at each timestep, receiving current perceptions and periodically updating semantic understanding from System 2. The two systems are trained asynchronously, with System 2 running at a lower frequency, ensuring temporal consistency in action generation. Heterogeneous modal inputs include visual data, robot states, and 3D point clouds, which are fused with System 2’s features for action output.

3. Dual-System Training: System 1 uses diffusion modeling with noise-injected actions for continuous control, while System 2 employs autoregressive prediction to preserve reasoning. The combined training on over 860,000 trajectories, with task-specific language augmentation, enhances task adaptability.

Accuracy, Speed, & Generalization!

1. Simulation Tests: FiS-VLA achieves an average success rate of 69% on RLBench tasks, outperforming CogACT (61%) and π0 (55%), leading in 8 out of 10 tasks. Control frequency reaches 21.9Hz with action chunk size 1, more than double CogACT’s.

2. Real Robot Tests: On physical platforms (Agilex and AlphaBot), FiS-VLA achieves success rates of 68% and 74% across 8 tasks, significantly surpassing baseline π0. It excels in precise tasks like pouring, blackboard wiping, and towel folding.

3. Generalization Ability: Under unseen object, complex background, and lighting variations, FiS-VLA’s accuracy drop is much smaller than π0, demonstrating robustness of the dual-system architecture against visual disturbances.

4. Ablation Studies: Results show that performance peaks when sharing only 2 transformer layers, with optimal input modalities including robot states, images, and point clouds. The best system collaboration ratio is 1:4, with control frequency scaling proportionally. When predicting 8 actions per step, control frequency can reach up to 117.7Hz. Variants of FiS-VLA with different input configurations were also tested.

5. Real-World Visualization:

Summary & Outlook:

FiS-VLA’s innovative integration of reasoning and control within a unified VLM achieves high-frequency, high-precision, and robust robot manipulation. Future work may include dynamic adjustment of shared structures and collaboration frequencies to further enhance adaptability and robustness in real-world tasks.

Subscribe to QQ Insights