By Insights Team in AI — 18 Jul 2025

Cutting-Edge Speech Generation Technology: PresentAgent Converts Text into Presentation Videos}

PresentAgent, developed by a collaborative team, can transform long documents into synchronized presentation videos with voice explanations, showcasing advanced multimodal AI capabilities.

This project is a joint effort by AI Geeks, the Australian Institute for Artificial Intelligence, the University of Liverpool, and La Trobe University.

We introduce PresentAgent, a multimodal intelligent system capable of converting lengthy documents into narrated presentation videos. Unlike existing methods that are limited to static slides or text summaries, our approach generates highly synchronized visual content and speech, realistically simulating human-style presentations.

Paper Title: PresentAgent: Multimodal Agent for Presentation Video Generation
Paper Link: https://arxiv.org/abs/2507.04036
Code Repository: https://github.com/AIGeeksGroup/PresentAgent

To realize this integrated process, PresentAgent adopts a modular workflow, as shown in Figure 1, including the following steps: 1. Systematic document segmentation; 2. Planning and rendering slide-style visual frames; 3. Generating contextual speech with large language models and text-to-speech systems; 4. Precise alignment of audio and visual content to produce a complete video.

^{Figure 1: Overview of PresentAgent. The system takes documents (e.g., web pages) as input and follows steps including document processing, structured slide generation, synchronized subtitles, and speech synthesis to output a combined presentation video. The highlighted purple parts indicate key intermediate outputs during the process.}

Given the difficulty in evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework driven by vision-language models, which scores across three key dimensions: Content Fidelity, Visual Clarity, and Audience Comprehension.

The evaluation uses prompt-based methods. We tested on a curated dataset of 30 document-video pairs, covering education, product descriptions, research summaries, and policy briefs. Results show PresentAgent approaches or surpasses human performance across all metrics.

These results demonstrate the enormous potential of controllable multimodal agents to transform static texts into dynamic, accessible, and effective presentations.

Our main contributions include:

Introducing a new task: First to propose "Document to Presentation Video Generation," aiming to automatically produce structured slide videos with narration from long texts.
Designing PresentAgent: A modular framework covering document parsing, layout-aware slide creation, script generation, and audiovisual synchronization, enabling controllable and interpretable video synthesis.
Developing PresentEval: A multi-dimensional evaluation framework driven by vision-language models, assessing content accuracy, visual quality, and comprehension.
Building a high-quality dataset: Created a dataset of 30 document-video pairs. Experiments and ablations show PresentAgent performs close to or better than human benchmarks.

Presentation Benchmark

To support document-to-video generation evaluation, we built the Doc2Present Benchmark, a multi-domain, multi-style dataset with paired documents and videos. Unlike previous benchmarks focusing only on summaries or slides, our dataset includes:

Business reports
Product manuals
Policy briefs
Educational tutorials

Each document is paired with a manually produced video explanation, as shown in Figure 2.

We evaluate all generated and human-produced videos using the PresentEval framework. Due to current limitations in model capabilities for long multimodal videos, we adopt a segmented evaluation approach:

Objective assessment: Using Qwen-VL-2.5-3B to answer multiple-choice questions based on content understanding;
Subjective scoring: Extracting video and audio segments, scored on content quality, visual and audio design, and clarity by Qwen-Omni-7B.

Main Results

In accuracy, most PresentAgent variants match or outperform human benchmarks (0.56). Claude-3.7-sonnet achieved the highest accuracy of 0.64, indicating strong content consistency. Models like Qwen-VL-Max and Gemini-2.5-flash scored around 0.52, showing room for improvement in factual alignment.

In subjective quality, human demonstrations still lead overall. However, some PresentAgent variants, such as GPT-4o-Mini, achieved near top scores (~4.8) in content and visual appeal, while Claude-3.7-sonnet balanced audio quality (~4.53). Gemini-2.5-flash excelled in visual quality (score 5.0) but scored lower on comprehension, illustrating the trade-offs in design. These results highlight the effectiveness of our modular generation process and the practical value of the PresentEval framework in capturing multiple quality dimensions.

Case Studies

Figure 5 shows a complete example of an automatically generated presentation video by PresentAgent, transforming a technical blog into a narrated slide deck. The system identifies structural segments (introduction, technical explanation, etc.) and generates speech with speech-style subtitles synchronized with slides, covering topics like "Parallel Workflow" and "Agent System Architecture," demonstrating its ability to deliver accurate, clear, and conversational information.

Our main contributions include:

Case Studies

Subscribe to QQ Insights