Fully Exploiting Modal Collaboration: MokA Crafts a New Paradigm for Multimodal LLM Fine-Tuning}
MokA introduces a novel, efficient multimodal fine-tuning approach that balances unimodal and cross-modal learning, boosting performance across diverse tasks and models.


The first author, Wei Yake, is a fourth-year PhD student at Renmin University of China, focusing on multimodal learning mechanisms and large multimodal models, supervised by Associate Professor Hu Di. The authors are affiliated with Renmin University and Shanghai Artificial Intelligence Laboratory.
In recent years, multimodal large models (MLLMs) have made significant progress in tasks like vision-language and audio-language. However, current mainstream fine-tuning methods often directly transfer strategies from pure text-based large language models (LLMs), such as LoRA. But is this "copy-paste" approach truly suitable for multimodal models?
Research teams from GeWu-Lab at Renmin University and Shanghai AI Lab propose a new perspective in their latest paper. They point out that most current MLLM fine-tuning schemes simply adapt single-modal strategies to multimodal scenarios without deep consideration of multimodal learning characteristics. In fact, in multimodal settings, independent modeling of unimodal information (Unimodal Adaptation) and modeling of inter-modal interactions (Cross-modal Adaptation) are equally important. Yet, existing paradigms often neglect these factors, limiting the full utilization of unimodal data and cross-modal interactions.
To address this, the team proposes MokA (Multimodal Low-Rank Adaptation), which emphasizes both unimodal independence and cross-modal interaction within a parameter-efficient fine-tuning framework. Experiments cover audio-visual-text, vision-text, and speech-text scenarios, evaluated on mainstream LLM bases like LLaMA and Qwen. Results show significant performance improvements across multiple benchmarks.

- Paper Title: MokA: Multimodal Low-Rank Adaptation for MLLMs
- Paper Link: https://arxiv.org/abs/2506.05191
- Project Homepage: https://gewu-lab.github.io/MokA
Performance improvements across multiple bases and scenarios

Overlooked modal features
The team highlights a key limitation in current efficient multimodal fine-tuning methods: they directly borrow from single-modal LLM designs. For example, LoRA uses the same learnable parameters W for multiple modalities, as shown in the formula below, where

represents the input of the i-th modality.

In real scenarios, different modalities exhibit heterogeneity. Directly applying single-modal methods ignores these differences, making it hard for models to fully leverage all modal information. Therefore, the team proposes that effective multimodal fine-tuning must include both unimodal modeling (Unimodal Adaptation) and cross-modal interaction modeling (Cross-modal Adaptation).

As shown, this requires modality-specific parameters to ensure unimodal adaptation without interference, and cross-modal parameters to align interactions between modalities.
MokA: A Multimodal Fine-Tuning Method Focused on Modal Features
Based on this idea, the team introduces MokA, which balances unimodal independence and inter-modal interaction.

MokA inherits the core idea of LoRA, maintaining high efficiency. It redefines the roles of A and B projection matrices for multimodal scenarios: three key modules include modality-specific A matrices, cross-modal attention, and shared B matrices.
Modality-specific A matrices: MokA uses modality-specific A matrices to preserve modality independence, ensuring each modality’s information compression does not interfere with others, crucial for unimodal modeling.
Cross-modal attention: This module explicitly enhances interaction between modalities. During instruction tuning, text often contains specific questions or tasks, while other modalities provide context. This module models the relationships between text and non-text modalities to strengthen task and scene relevance.
Shared B matrices: Finally, all modalities are projected into a common space via a shared low-rank B matrix, enabling implicit cross-modal alignment through shared parameters.

The formal expression of MokA ensures effective modeling of both unimodal features and inter-modal interactions in multimodal scenarios.
Experimental Results
Evaluations on three representative multimodal tasks—audio-visual-text, vision-text, and speech-text—demonstrate that MokA significantly improves performance across multiple benchmarks on popular language model bases like LLaMA and Qwen, showing strong generalization and effectiveness.

Table 1: Experimental results on audio-visual-text.

Table 2: Experimental results on vision-text.

Table 3: Experimental results on speech-text.
Summary
In conclusion, MokA is an efficient multimodal fine-tuning method that balances unimodal feature modeling and inter-modal interaction, overcoming the limitations of directly transferring single-modal strategies. It maintains LoRA’s efficiency advantages while explicitly modeling modality-specific features and cross-modal interactions, achieving significant performance gains across various tasks and models, and providing a new direction for multimodal large model fine-tuning.