By Insights Team in AI — 17 Jul 2025

Google DeepMind's New Architecture MoR Achieves Double Inference Speed, Potentially a Transformer Killer}

DeepMind's MoR architecture, based on recursive transformers, delivers twice the inference speed and surpasses previous SOTA models like DeepSeek 671B, marking a major advancement in efficient large language models.

Large language models (LLMs) have demonstrated remarkable capabilities, but their deployment faces significant challenges due to high computational and memory costs. As model parameters grow into hundreds of billions, training and inference become prohibitively expensive, hindering practical application.

This has driven researchers to explore two main directions for efficiency: one is parameter sharing to improve parameter efficiency, and the other is adaptive computation that dynamically allocates resources based on input complexity.

All of this is built on the Transformer architecture. Recently, Google has once again evolved this foundation, releasing a new LLM architecture called Mixture-of-Recursions (MoR). Some scholars believe this new architecture “has the potential to be a Transformer killer”.

MoR introduces a unified approach that combines parameter sharing and adaptive computation. Unlike previous methods that treat these separately, MoR integrates dynamic token-level routing into a parameter-efficient recursive Transformer, creating a collaborative architecture that achieves “model quality without the massive cost.”

Paper title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Link to paper: https://alphaxiv.org/abs/2507.10524

This research proposes Mixture-of-Recursions (MoR), a unified framework designed to fully leverage the potential of recursive Transformers. MoR trains lightweight routing end-to-end, assigning each token a specific recursive depth based on its “thinking” requirement, thus precisely allocating computational resources where needed.

This token-based dynamic recursion naturally supports recursive key-value (KV) caching, selectively storing and retrieving KV pairs according to each token’s assigned depth. This targeted caching significantly reduces memory bandwidth pressure, boosting inference throughput without post-processing.

MoR simultaneously achieves three optimizations within a unified architecture:

1. Parameter sharing: Significantly reduces parameter count via weight tying;

2. Computation routing: Avoids redundant calculations by routing tokens through different paths;

3. Recursive KV caching: On-demand storage of KV pairs to lower memory access costs and improve throughput.

This makes MoR a crucial evolution of recursive Transformer architectures, balancing performance, efficiency, and resource utilization.

MoR Overview

The left diagram shows each recursive step composed of a fixed stack of layers and a router. The router decides whether each token continues through the recursive block or exits early. The gray boxes represent the recursive modules.

The middle diagram illustrates the full model structure. For each token, the model applies the shared recursive module up to N_r times based on routing decisions, enabling dynamic recursion depth.

The right diagram shows an example of token-level routing, with color-coded cells indicating the number of recursive steps each token undergoes. Darker colors mean more recursion, with different colors representing 1, 2, or 3 recursions.

Core Architecture and Methods

Recursive Transformer foundation: MoR is based on recursive Transformers, sharing weights across layers instead of maintaining unique parameters per layer.

Parameter sharing modes: Includes loop (reusing recursive blocks repeatedly), sequence (reusing the same layer sequentially), and hybrid (combining shared middle layers with unique first and last layers). The hybrid approach balances efficiency and expressiveness.

Dynamic routing mechanism

The key innovation is a dynamic routing system that allocates different recursive depths to tokens based on their computational needs. Two main routing strategies are used:

1. At each recursive step r, a router G evaluates each token’s hidden state and selects top-k tokens for further recursion. This forms a gradually narrowing computation funnel, where complex tokens get deeper processing, and simpler ones exit early. To address causality issues during training, techniques like auxiliary routers and auxiliary loss are employed.

2. Token-choice routing: Each token makes a one-time routing decision at the start, determining its full recursive depth. This satisfies causality but requires load balancing techniques like balancing loss and router bias.

KV cache optimization

MoR introduces two specialized KV cache strategies for dynamic-depth models:

Recursive KV cache: Caches KV pairs at each recursive depth, with attention limited to locally cached tokens.

Shared recursive KV: All tokens traverse the first recursive block, whose KV pairs are cached and reused in subsequent recursion steps.

Experimental Results

MoR outperforms baseline models with fewer parameters at the same compute budget

At an equivalent training cost (~16.5×10¹⁸ FLOPs), MoR with expert-choice routing and two recursive layers (N_r=2) achieved lower validation loss and higher few-shot accuracy (43.1% vs. 42.3%). Despite nearly halving parameters, MoR outperformed the baseline, thanks to higher computational efficiency, enabling it to process more tokens within the same FLOPs budget.

^{Comparison of MoR, recursive Transformer, and vanilla Transformer under fixed FLOPs (16.5e18) and fixed token count (20B).}

MoR achieves better performance with less computation

In a separate evaluation with fixed token count (20B), MoR (N_r=2) uses 25% fewer FLOPs than the baseline while maintaining lower validation loss and higher accuracy.

This efficiency translates into 19% shorter training time and 25% less peak memory usage, thanks to hierarchical filtering and recursive attention strategies that shorten sequence length and improve compute-accuracy trade-offs.

Impact of routing and caching strategies

Variants of MoR with different routing granularities and cache strategies were tested. Results show that “token choice routing” yields slightly lower accuracy (~40%) compared to “expert choice” (~42.6%), indicating routing granularity significantly affects performance.

Shared KV caching slightly reduces performance but greatly improves memory efficiency, making it suitable for deployment scenarios with strict resource constraints.

Performance comparison across different model sizes (135M, 360M, 730M, 1.7B parameters) under various compute budgets shows that MoR consistently outperforms baseline models, especially at larger scales, demonstrating its scalability and efficiency.

Despite slight performance gaps at the smallest scale, MoR surpasses the original Transformer as model size increases, proving it to be a scalable and parameter-efficient alternative suitable for pretraining and deployment.

Inference throughput enhancement via continuous depth batching

As shown in Figure 4a, all MoR variants outperform the baseline Transformer in inference throughput. Deeper recursion allows earlier token exit, reducing KV cache usage and significantly boosting speed. For example, MoR-4 achieves up to 2.06× throughput increase in large-batch settings.

While this introduces slight performance trade-offs, the substantial speed gains make it a valuable approach for real-world deployment scenarios.

Figure 4a: Pareto frontier comparison of inference throughput and log-likelihood for MoR and baseline models under fixed batch size and maximum batch size.

Figure 4b: Impact of parameter sharing strategies on recursive Transformer performance, evaluated on 100 billion tokens. Red dashed line: original Transformer; black dashed line: baseline with one-third parameters.

Figure 4c: Performance comparison of different architectures with and without KV cache sharing. Green: no recursive KV cache; blue: recursive KV sharing; MoR-E: expert routing; MoR-T: token routing. Ablation results are detailed in the original paper.

Summary and Outlook

MoR demonstrates that combining parameter efficiency with adaptive computation offers a new paradigm for scalable, high-performance LLMs. It maintains model quality while significantly reducing compute needs, addressing fundamental scalability challenges in language modeling.

Most importantly, MoR provides a foundation for developing more cognitively inspired AI systems. Its ability to adaptively allocate “thinking depth” per token aligns with emerging research on reasoning and internal thought processes in language models. This suggests MoR could serve as a platform for exploring how models learn to think more deeply while maintaining efficiency in routine tasks.

While MoR clearly boosts inference speed and reduces KV cache, whether it can truly be called a “Transformer killer” remains subject to debate.

For more details, please refer to the original paper.

Subscribe to QQ Insights