Fully Exploiting Modal Collaboration: MokA Crafts a New Paradigm for Multimodal LLM Fine-Tuning}

MokA introduces a novel, efficient multimodal fine-tuning approach that balances unimodal and cross-modal learning, boosting performance across diverse tasks and models.

Fully Exploiting Modal Collaboration: MokA Crafts a New Paradigm for Multimodal LLM Fine-Tuning}

image

The first author, Wei Yake, is a fourth-year PhD student at Renmin University of China, focusing on multimodal learning mechanisms and large multimodal models, supervised by Associate Professor Hu Di. The authors are affiliated with Renmin University and Shanghai Artificial Intelligence Laboratory.

In recent years, multimodal large models (MLLMs) have made significant progress in tasks like vision-language and audio-language. However, current mainstream fine-tuning methods often directly transfer strategies from pure text-based large language models (LLMs), such as LoRA. But is this "copy-paste" approach truly suitable for multimodal models?

Research teams from GeWu-Lab at Renmin University and Shanghai AI Lab propose a new perspective in their latest paper. They point out that most current MLLM fine-tuning schemes simply adapt single-modal strategies to multimodal scenarios without deep consideration of multimodal learning characteristics. In fact, in multimodal settings, independent modeling of unimodal information (Unimodal Adaptation) and modeling of inter-modal interactions (Cross-modal Adaptation) are equally important. Yet, existing paradigms often neglect these factors, limiting the full utilization of unimodal data and cross-modal interactions.

To address this, the team proposes MokA (Multimodal Low-Rank Adaptation), which emphasizes both unimodal independence and cross-modal interaction within a parameter-efficient fine-tuning framework. Experiments cover audio-visual-text, vision-text, and speech-text scenarios, evaluated on mainstream LLM bases like LLaMA and Qwen. Results show significant performance improvements across multiple benchmarks.

image

Performance improvements across multiple bases and scenarios

image

Overlooked modal features

The team highlights a key limitation in current efficient multimodal fine-tuning methods: they directly borrow from single-modal LLM designs. For example, LoRA uses the same learnable parameters W for multiple modalities, as shown in the formula below, where

image

represents the input of the i-th modality.

image

In real scenarios, different modalities exhibit heterogeneity. Directly applying single-modal methods ignores these differences, making it hard for models to fully leverage all modal information. Therefore, the team proposes that effective multimodal fine-tuning must include both unimodal modeling (Unimodal Adaptation) and cross-modal interaction modeling (Cross-modal Adaptation).

image

As shown, this requires modality-specific parameters to ensure unimodal adaptation without interference, and cross-modal parameters to align interactions between modalities.

MokA: A Multimodal Fine-Tuning Method Focused on Modal Features

Based on this idea, the team introduces MokA, which balances unimodal independence and inter-modal interaction.

image

MokA inherits the core idea of LoRA, maintaining high efficiency. It redefines the roles of A and B projection matrices for multimodal scenarios: three key modules include modality-specific A matrices, cross-modal attention, and shared B matrices.

Modality-specific A matrices: MokA uses modality-specific A matrices to preserve modality independence, ensuring each modality’s information compression does not interfere with others, crucial for unimodal modeling.

Cross-modal attention: This module explicitly enhances interaction between modalities. During instruction tuning, text often contains specific questions or tasks, while other modalities provide context. This module models the relationships between text and non-text modalities to strengthen task and scene relevance.

Shared B matrices: Finally, all modalities are projected into a common space via a shared low-rank B matrix, enabling implicit cross-modal alignment through shared parameters.

image

The formal expression of MokA ensures effective modeling of both unimodal features and inter-modal interactions in multimodal scenarios.

Experimental Results

Evaluations on three representative multimodal tasks—audio-visual-text, vision-text, and speech-text—demonstrate that MokA significantly improves performance across multiple benchmarks on popular language model bases like LLaMA and Qwen, showing strong generalization and effectiveness.

image

      Table 1: Experimental results on audio-visual-text.

image

      Table 2: Experimental results on vision-text.

image

      Table 3: Experimental results on speech-text.

Summary

In conclusion, MokA is an efficient multimodal fine-tuning method that balances unimodal feature modeling and inter-modal interaction, overcoming the limitations of directly transferring single-modal strategies. It maintains LoRA’s efficiency advantages while explicitly modeling modality-specific features and cross-modal interactions, achieving significant performance gains across various tasks and models, and providing a new direction for multimodal large model fine-tuning.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe