Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Preacher: Paper-to-Video Agentic System

Created by
  • Haebom

Author

Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang

Outline

This paper addresses the "paper-to-video" task of converting research papers into structured video summaries. Existing state-of-the-art video generation models suffer from limitations such as limited context windows, fixed video duration constraints, limited style diversity, and an inability to represent domain-specific knowledge. To address these limitations, we propose "Preacher," the first paper-to-video agent system. Preacher uses a top-down approach to decompose, summarize, and reconstruct papers, and then uses bottom-up video generation to synthesize diverse video segments into coherent summaries. To align cross-modal representations, we define key scenes and introduce Progressive Chain of Thought (P-CoT) for fine-grained iterative planning. Preacher successfully generates high-quality video summaries across five research areas, demonstrating expertise beyond existing video generation models. The code will be made available at https://github.com/GenVerse/Paper2Video .

Takeaways, Limitations

Takeaways:
Proposal for Preacher, the first paper-to-video agent system.
Generating effective video summaries through top-down and bottom-up approaches.
Cross-modal representation alignment and refined planning using P-CoT.
Success in generating high-quality video summaries across five research areas.
Overcoming the limitations of existing models, such as limited context window, fixed video duration, limited style diversity, and difficulty in representing domain-specific knowledge.
Increased research reproducibility and scalability through open code.
Limitations:
The fact that Preacher's performance evaluation was conducted only in a limited number of research areas (five).
Further research may be needed into creating videos in different styles.
System improvements may be needed through feedback from actual users.
Further research may be needed to determine the generalizability of P-CoT.
👍