By Insights Team in AI — 27 Jun 2025

ICML 2025 | Breaking Residual Connection Bottlenecks: Caiyun Tech & Beijing University of Posts and Telecommunications Present MUDDFormer for Transformer Evolution}

Caiyun Tech and Beijing University of Posts and Telecommunications introduce MUDDFormer, a novel architecture that enhances Transformer performance by overcoming residual connection limitations, showcased at ICML 2025.

The first author is Xiao Da, Associate Professor at Beijing University of Posts and Telecommunications and Chief Scientist at Caiyun Tech. Other authors include algorithm researchers Meng Qingye and Li Shengping, and CEO Yuan Xingyuan.

Residual connections, introduced by He et al. in 2015 with ResNet, have become a cornerstone of deep learning and Transformer LLMs. However, current deep Transformers still face limitations, restricting efficient information flow across layers.

Caiyun Tech and Beijing University of Posts and Telecommunications recently proposed a simple yet effective alternative: Multiway Dynamic Dense (MUDD) connection, significantly improving cross-layer information transfer in Transformers.

Large-scale pretraining experiments show that with only 0.23% more parameters and 0.4% more computation, the 2.8B parameter MUDDPythia model can match the performance of 6.9B and 12B models in 0-shot and 5-shot evaluations, respectively, demonstrating significant improvements in core capabilities like context learning.

This work follows the DCFormer project from ICML 2024 and has been accepted at ICML 2025, with open access to the paper, code, and model weights.

Background

Residual streams in Transformers aggregate information across layers, supporting multi-path data flows such as query, key, value in attention modules, and residuals in FFN. While residual connections enable training very deep networks, they also have limitations:

Representation Collapse: Deep models often suffer from similar hidden states across layers, reducing the effectiveness of scaling parameters and computation.
Residual Overload: In very deep models, residual streams can become overloaded, hindering the formation of complex reasoning circuits necessary for advanced in-context learning.

To address these issues, MUDD dynamically constructs cross-layer connections based on current hidden states, alleviating representation collapse and reducing interference among multiple information streams, thus greatly expanding information bandwidth while maintaining high efficiency.

Core Architecture

As shown in Figure 1a, DenseNet connects each block densely to all previous blocks. Recently, Pagliardini et al. introduced DenseFormer (NeurIPS 2025), which applies this idea to Transformers, as shown in Figure 1b. DenseFormer uses static weights to combine previous layer outputs, but this limits expressiveness across tokens.

Researchers proposed DynamicDenseFormer (Figure 1c), which introduces a Depth-wise Aggregate module for each token, where the connection weights are dynamically generated based on the current hidden state, enabling adaptive, token-specific cross-layer information extraction. This resembles a form of depth-wise attention, akin to token-wise attention in the original Transformer.

Building on DynamicDenseFormer, the MUDDFormer (Figure 1d) decouples the multiway connections into four separate dynamic pathways for Q, K, V, and residual stream R, each with its own cross-layer aggregation module. This allows each information stream to adaptively extract relevant features, effectively implementing a deep, multi-head attention mechanism across layers.

Experimental Evaluation

Scaling Law

On the Pile dataset, MUDDFormer outperforms baseline models in zero-shot and few-shot tasks, with performance gains increasing with model size. Notably, the 834M version surpasses the 1.89x larger Transformer++ baseline in efficiency.

Further experiments with deeper models (up to 42 layers) show that deep MUDDFormer maintains performance gains, significantly alleviating the depth-related bottleneck observed in standard models.

Downstream Tasks

When trained on 300B tokens and compared across models from 1.4B to 12B, MUDDPythia achieves higher accuracy in zero-shot and few-shot settings, with performance scaling favorably as model size increases, demonstrating enhanced contextual reasoning.

Figures 4 and 5 illustrate these improvements, with MUDDFormer models showing superior accuracy and efficiency over traditional Transformer baselines.

Attention analysis (Figure 6) reveals that, unlike standard models where many attention heads become inactive in deeper layers, MUDDFormer maintains high activation rates across layers, indicating more effective utilization of attention mechanisms.

Conclusion

MUDDFormer introduces a simple yet powerful modification to residual connections, establishing independent dynamic cross-layer pathways for Q, K, V, and residual streams. This significantly enhances inter-layer communication and contextual reasoning, with minimal overhead, promising to be a foundational architecture for future large models.

References

[1] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. CVPR 2016.

[2] Xiao, D., et al. Improving transformers with dynamically composable multi-head attention. ICML 2024.

[3] Liu, L., et al. Understanding the difficulty of training transformers. EMNLP 2020.

[4] Gromov, A., et al. The unreasonable ineffectiveness of the deeper layers. arXiv 2024.

[5] Huang, G., et al. Densely connected convolutional networks. CVPR 2017.

[6] Pagliardini, M., et al. Denseformer: Enhancing information flow in transformers via depth weighted averaging. NeurIPS 2024.

[7] Vaswani, A., et al. Attention is all you need. NeurIPS 2017.

[8] Zhu, D., et al. Hyper-connections. ICLR 2025.

[9] Xiao, G., et al. Efficient streaming language models with attention sinks. ICLR 2024.

Subscribe to QQ Insights