ICML 2025 | CoTo: Making LoRA Training Progressively Better with Model Fusion and Pruning}
A novel progressive training strategy, CoTo, enhances LoRA's robustness in model fusion, pruning, and multi-task learning, winning recognition at ICML 2025 for advancing parameter-efficient fine-tuning.


Hong Kong City University and Southern University of Science and Technology researchers propose CoTo, a simple progressive training strategy for LoRA, addressing instability, poor fusion, and performance drops after pruning. This work was accepted at ICML 2025.

- Paper Title: Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation
- Link: https://openreview.net/forum?id=Zha2m39ZoM
- Code Repository: https://github.com/zwebzone/coto
- Poster: https://icml.cc/virtual/2025/poster/44836
Limitations of Standard LoRA Training
Parameter-efficient fine-tuning techniques like LoRA face issues such as lazy training, layer imbalance, and difficulty in model fusion and pruning, limiting their effectiveness in multi-task scenarios.
CoTo Strategy: Gradually Improving LoRA
CoTo introduces a progressive activation approach: early training involves randomly deactivating some adapters with high probability, encouraging balanced gradient flow; later, activation probability linearly increases until full participation.
This "gradual activation" enhances layer balance, improves linear mode connectivity (LMC), and stabilizes dropout, facilitating model fusion and pruning.

Figure 1: CoTo progressive activation schematic. Early training (t < 3T/4) involves random deactivation of adapters (gray), with activation probability p(t) increasing linearly; later, all adapters are active.
Experimental Results
CoTo significantly improves model fusion and pruning robustness, boosting single-task generalization and training efficiency. It maintains high performance across various tasks and models.
Enhanced Model Fusion
- Linear interpolation accuracy: In reasoning and image classification, standard LoRA performance drops sharply at the midpoint (λ=0.5). CoTo models show smooth, high-quality performance along the entire interpolation path, demonstrating superior linear mode connectivity.
- Multi-task fusion: On GLUE, CoTo-trained LoRA modules outperform baseline models in multi-task merging, whether using LLaMA-2 (7B, 13B) or DeBERTa-v3.

Figure 2: Accuracy of linear interpolation in reasoning tasks.

Figure 3: Accuracy of linear interpolation in image classification tasks.

Figure 4: Multi-task LoRA fusion accuracy with LLaMA-2 models.
More Robust Pruning
CoTo’s training method inherently improves pruning fault tolerance. It outperforms standard LoRA in structured pruning (removing layers, low-rank adapters) and unstructured sparsity, maintaining performance at high sparsity levels.

Figure 5: Comparison of structured and unstructured pruning.
Performance and Efficiency Gains
- Stronger performance: In benchmarks across vision, reasoning, and math, CoTo consistently boosts performance of various LoRA variants like LoRA, DoRA, HiRA.
- Faster training: Early skipping of adapters reduces training costs; for example, applying CoTo on HiRA accelerates training by over 24%.

Figure 6: Performance improvements in reasoning tasks with different LoRA variants and training strategies.
Ablation Studies
Extensive experiments validate the effectiveness of CoTo’s design choices, including training ratio, activation curve, nested dropout, learning rate, and LoRA rank, providing insights into regularizing LoRA.
Code implementation is simple; users can modify existing LoRA training workflows in three steps to adopt CoTo. The code repository is available for experimentation.
Summary
CoTo’s progressive training strategy effectively addresses layer imbalance and lazy optimization in LoRA, enhancing model fusion, pruning, and robustness. It’s a plug-and-play approach with broad applicability, inspiring future research in parameter-efficient fine-tuning.