This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang
Outline
This paper addresses the "paper-to-video" task of converting research papers into structured video summaries. Existing state-of-the-art video generation models suffer from limitations such as limited context windows, fixed video duration constraints, limited style diversity, and an inability to represent domain-specific knowledge. To address these limitations, we propose "Preacher," the first paper-to-video agent system. Preacher uses a top-down approach to decompose, summarize, and reconstruct papers, and then uses bottom-up video generation to synthesize diverse video segments into coherent summaries. To align cross-modal representations, we define key scenes and introduce Progressive Chain of Thought (P-CoT) for fine-grained iterative planning. Preacher successfully generates high-quality video summaries across five research areas, demonstrating expertise beyond existing video generation models. The code will be made available at https://github.com/GenVerse/Paper2Video .
Takeaways, Limitations
•
Takeaways:
◦
Proposal for Preacher, the first paper-to-video agent system.
◦
Generating effective video summaries through top-down and bottom-up approaches.
◦
Cross-modal representation alignment and refined planning using P-CoT.
◦
Success in generating high-quality video summaries across five research areas.
◦
Overcoming the limitations of existing models, such as limited context window, fixed video duration, limited style diversity, and difficulty in representing domain-specific knowledge.
◦
Increased research reproducibility and scalability through open code.
•
Limitations:
◦
The fact that Preacher's performance evaluation was conducted only in a limited number of research areas (five).
◦
Further research may be needed into creating videos in different styles.
◦
System improvements may be needed through feedback from actual users.
◦
Further research may be needed to determine the generalizability of P-CoT.