By Insights Team in AI — 01 Jul 2025

Alibaba's Tongyi Opens Source Audio Generation Model, Capable of Imitating Sound Engineers for Video Dubbing}

Alibaba Tongyi has open-sourced its first audio generation model, ThinkSound, which applies Chain-of-Thought reasoning to produce high-fidelity, synchronized audio like a professional sound engineer.

On July 1st, Alibaba Tongyi Laboratory announced the open-source release of its first audio generation model, ThinkSound. This model is the first to incorporate Chain-of-Thought (CoT) reasoning into audio generation, enabling AI to think step-by-step like a professional sound engineer, capturing visual details and generating high-fidelity audio synchronized with video. Currently, the code and model are available on GitHub, HuggingFace, and the ModelScope community for free download and experimentation.

Alibaba open-sources audio generation model ThinkSound

Video-to-Audio (V2A) technology is one of the most important in multimedia editing and content creation, but it faces many technical challenges. Existing V2A methods lack deep understanding of visual-acoustic details, often producing generic sounds that are misaligned with key visual events, failing to meet the strict temporal and semantic coherence required in professional creative scenarios.

To address these issues, the Tongyi team first introduced Chain-of-Thought reasoning into multimodal large models, allowing AI to mimic the multi-stage creative process of sound engineers. This enables precise modeling of the deep connection between visual events and corresponding sounds, such as analyzing visual dynamics, inferring acoustic properties, and synthesizing environment-appropriate sound effects in chronological order. Furthermore, they built the first audio dataset with Chain-of-Thought annotations, AudioCoT, which integrates 2,531.8 hours of multi-source heterogeneous data, helping models understand not just what is happening but why.

On the open-source VGGSound test set, ThinkSound outperformed mainstream methods like MMAudio, V2A-Mappe, and V-AURA, achieving over 15% improvement in key metrics. For example, in the Fréchet Distance (FD) in the openl3 embedding space, ThinkSound reduced FD from 43.26 (MMAudio) to 34.56, approaching real audio distribution with over 20% better similarity. It also achieved top scores of 1.52 and 1.32 in KL divergence metrics for sound event classification and feature discrimination, respectively, both the best among comparable models.

ThinkSound ranks first in multiple core metrics on the open-source VGGSound test set

On the MovieGen Audio Bench, ThinkSound significantly outperformed Meta’s MovieGen Audio, demonstrating its potential in film sound effects, audio post-production, gaming, and virtual reality sound generation applications.

ThinkSound surpasses Meta’s MovieGen Audio by a large margin

Currently, Tongyi Laboratory has also released large models for speech synthesis (Cosyvoice) and multimodal audio understanding (MinMo), covering scenarios such as speech synthesis, audio generation, and audio comprehension.

Open-source links:

Subscribe to QQ Insights