By Insights Team in AI — 07 Jul 2025

Stream-Omni: Supporting Multi-Modal Interactions with a Unified Text-Visual-Speech Large Model}

Stream-Omni is a versatile multimodal large model that supports combined text, visual, and speech interactions, enabling flexible and efficient multi-modal communication.

GPT-4o-style multimodal large models (LMMs) demonstrate comprehensive capabilities across text, vision, and speech modalities. Its online speech service can also provide intermediate text transcripts during speech interactions, offering users a flexible “see and hear” experience. Therefore, constructing multimodal large models supporting text, vision, and speech has become a recent research focus. Existing multimodal models typically use multiple encoders to extract representations for each modality, then concatenate these along the sequence dimension and input to a large language model (LLM) to generate responses. While this concatenation-based approach simplifies integration, it heavily relies on large-scale data for modality alignment and lacks flexibility, unlike GPT-4o, which can generate intermediate transcripts during speech interactions.

To address this challenge, the Natural Language Processing team at the Institute of Computing Technology, Chinese Academy of Sciences, proposed Stream-Omni, a multimodal large model supporting flexible text-vision-speech interactions. By modeling the relationships between modalities more targetedly, Stream-Omni achieves more efficient and flexible modality alignment. With only 23,000 hours of speech data, it can support text, speech, and vision-based interactions. Its innovative speech modeling allows it to output intermediate transcripts during speech interactions, similar to GPT-4o, providing a comprehensive multimodal experience.

Paper: https://arxiv.org/abs/2506.13642
Open-source code: https://github.com/ictnlp/Stream-Omni
Model download: https://huggingface.co/ICTNLP/stream-omni-8b

Modality Alignment in Stream-Omni

Existing multimodal models concatenate representations across sequence dimensions for modality alignment. To reduce reliance on large-scale data, Stream-Omni models the relationships more targetedly: visual-text alignment via sequence concatenation, and speech-text alignment via hierarchical mapping.

Visual-text alignment: sequence concatenation of visual and text representations.
Speech-text alignment: hierarchical mapping of speech to text.

It uses the LLM core, with speech layers at the bottom and top, employing Connectionist Temporal Classification (CTC) to model speech-to-text mapping. This supports external speech interaction and internal content control, enabling the transfer of text capabilities to speech with minimal data, and allows intermediate transcript output during speech interactions.

Stream-Omni aligns visual and speech modalities with text efficiently. Visual representations are extracted via a visual encoder and concatenated with text. Speech is discretized with CosyVoice Tokenizer, mapped to text via CTC, and used for generating responses and synchronized speech output.

It supports flexible multimodal interactions by combining visual encoders, speech layers, and the LLM, with intermediate text transcripts during speech-to-speech generation, providing a comprehensive multimodal experience. More examples are available at https://github.com/ictnlp/Stream-Omni.

Experimental Results

Visual Understanding

Stream-Omni achieves comparable performance to visual large models with similar scale and data volume.

Speech Interaction

In factual speech interactions, Stream-Omni outperforms existing methods due to hierarchical speech-text mapping, transferring LLM’s text capabilities to speech modality.

Vision-based Speech Interaction

In scenarios with conflicting instructions via text and speech, Stream-Omni demonstrates more consistent responses thanks to its precise semantic alignment between speech and text. More demos are available at https://github.com/ictnlp/Stream-Omni.

Summary

Stream-Omni is a GPT-4o-like multimodal model supporting combined text, vision, and speech interactions.
It can output intermediate text during speech interactions for a more comprehensive experience.
Focuses on modality alignment; limitations include human-like expressiveness and voice diversity.

Subscribe to QQ Insights