ICCV2025 | One Image Is All You Need: Multimodal Instruction Data Synthesis, Leave the Rest to Oasis}
Discover how Oasis enables multimodal instruction data synthesis using just images, eliminating manual prompts and boosting data quality for AI training at ICCV2025.


In recent years, multimodal instruction data synthesis methods have heavily relied on manually designed prompts, consuming significant human effort and time. In the field of text data synthesis, MAGPIE [1] is a highly successful method that does not require user-provided prompts, only special tokens like <|im_start|> as model input to complete data synthesis.
Inspired by MAGPIE, researchers from Tongji University, ByteDance, and the University of Edinburgh proposed a new multimodal instruction data synthesis approach. Users only need to provide images (using the <image> token in VLM), and Oasis automatically completes instruction synthesis, quality control, and reply generation, producing high-quality data.
To support further research, the study released an open-source codebase, MM-INF, which includes Oasis and other common multimodal data synthesis methods, continuously updated and maintained. Everyone is welcome to try and provide feedback.

- Paper link: https://arxiv.org/abs/2503.08741
- Code link: https://github.com/Letian2003/MM_INF
- Dataset link: https://huggingface.co/datasets/WonderThyme/Oasis
Research Motivation

As shown above, common data synthesis pipelines and Oasis differ significantly. Typical pipelines face three main issues:
- Lack of Diversity: Relying on fixed prompts and synthesis processes limits coverage and complexity, leading to homogeneous data.
- Low Quality: Existing methods struggle to produce high-quality, representative multimodal data, often resorting to secondary caption-based strategies with subpar results.
- Manual Dependence: The process is complex and costly, requiring extensive manual design of data patterns and prompts, making data synthesis inefficient and cumbersome.
The proposed Oasis method generates data solely from images, breaking the dependency on preset text prompts (<image> + [text prompt]). It leverages the knowledge and autoregressive capabilities of powerful MLLMs, which autonomously generate diverse, image-related instructions without manual prompts. Researchers analyzed attributes of high-quality instructions and designed filtering standards to automatically exclude low-quality data.
Method Overview
Summary

The method involves three main steps:
- Constructing a “hooking prompt” to induce autoregressive sampling;
- Classifying sampling results, keeping only instruction-following data;
- Performing instruction quality control and reply generation.
Using Qwen2-VL as an example, the process is detailed below.
Step 1: Constructing the Hooking Prompt
For image description, a typical MLLM input is <|im_start|>User <image>Describe the image.<|im_end|> <|im_start|>Assistant. When the model perceives the role as Assistant, it generates instruction-related responses.
We extract the prefix <|im_start|>User <image> from the full input and feed it into VLM for sampling. The generated reply is unbiased, influenced only by the image itself.
During sampling, data roughly splits into two types: instruction-following and caption. This phenomenon is explained by the interleaved training of image-text multimodal models.
Step 2: Classifying Results
To filter only instruction data, a large language model (LLM) is used as a classifier. If instructions are present, it’s classified as instruction-following, and an instruction is extracted. Otherwise, it’s caption data and discarded. Few-shot prompting improves accuracy, with templates provided in the appendix.
Filtered instruction data undergoes quality control and reply generation.
Step 3: Quality Control & Reply Generation
Instructions are evaluated on four dimensions: solvability, clarity, hallucination, and nonsensicality. High-quality instructions are used with Qwen2-VL to generate replies, forming instruction-reply training pairs. Ratings are from 1 (poor) to 5 (excellent):
- Solvability: Whether the image provides enough info to answer fully.
- Clarity: How well the question’s intent is conveyed.
- Hallucination: Consistency between question and image content.
- Nonsense: Grammatical and semantic coherence.
Details are in the appendix. Ablation studies show that reply quality control has limited effect; focusing on instruction quality is sufficient.
Oasis-500k
Using Cambrian-10M images, Oasis synthesizes about 500k training data, called Oasis-500k. Since Oasis depends only on images, scaling is straightforward as data volume grows linearly over time.
Data Characteristics Analysis
Compared Oasis data with open-source LLaVA-NeXT data in attributes like instruction and reply length, language diversity, and verb-noun combinations.
Instruction & Reply Length

Oasis data generally has longer instructions and replies than LLaVA-NeXT, with higher variability, indicating richer and more diverse information.
Language Diversity

Oasis’s autoregressive process avoids explicit language bias, resulting in broad language coverage, including English (78.52%), Chinese (18.66%), Korean, Norwegian, Vietnamese, French, German, and others.
Verb-Noun Combinations

Analysis using spaCy shows Oasis data’s root verbs are more natural and expressive, with richer semantic diversity compared to LLaVA-NeXT, which heavily relies on “answer question” patterns.
Sample Data

Oasis’s sample data demonstrates strong instruction generation capabilities, producing detailed, high-density instructions based on image themes across various tasks like object recognition, scene description, and code comprehension.
Experimental Results

Oasis Effectiveness
Compared to LLaVA-NeXT baseline, incremental data from Oasis significantly improves performance across 14 benchmarks, with average gains of 3.1%, 1.8%, and 3.2% on Vicuna1.5, Qwen2.5, and Llama3 respectively.
In tasks like OCR and document analysis, improvements of 2.7%, 2.1%, 4.3%, and 6.3% are observed, confirming Oasis’s role in enhancing model generalization.
Comparison with Other Synthesis Methods
Additional data sources like MMEvol and DenseFusion-1M were tested, showing Oasis’s superior performance in diversity and effectiveness, even when added as incremental data.
Data Scaling Experiments
Scaling Oasis data from 0 to 500k shows consistent performance improvements, with a 5.2% increase after adding 500k data, demonstrating strong scalability.
Vertical Data Synthesis Capabilities

Oasis’s ability to synthesize domain-specific data is validated by OCR tasks, where it produces high-quality, context-aware data that improves OCR benchmarks significantly.
Ablation Studies

Filtering out low-quality data based on multiple criteria improves overall model performance, especially in tasks like DocVQA and InfoVQA, confirming the importance of data quality control.
Response Quality Control
Two methods were tested: negative log-likelihood rejection sampling and multi-modal LLM scoring. Both resulted in slight performance drops, indicating that high-quality instructions alone suffice for effective training.
Open-Source Codebase MM-INF

- Code link: https://github.com/Letian2003/MM_INF
This project also open-sources a data synthesis engine based on ms-swift [6], capable of chaining multiple LLM/VLM-based steps, including image captioning and description-based QA, supporting diverse multimodal data generation. Feedback is welcome.
References
[1] Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing, ICLR 2025.
[2] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, NeurIPS 2024.
[3] MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct, arxiv 2409.05840.
[4] DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception, NeurIPS 2024.
[5] SimPO: Simple preference optimization with a reference-free reward, NeurIPS 2024.
[6] https://github.com/modelscope/ms-swift