This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Jingwei Liu, Ling Yang, Hao Luo, Fan Wang, Hongyan Li, Mengdi Wang
Outline
This paper addresses the "paper-to-video" task of converting research papers into structured video abstracts. To address the limitations of existing state-of-the-art video generation models (limited context window, fixed video duration constraints, limited style diversity, and inability to represent domain-specific knowledge), we propose Preacher, the first "paper-to-video" agent system. Preacher decomposes, summarizes, and reconstructs papers using a top-down approach, and synthesizes diverse video segments into coherent abstracts using bottom-up video generation. We define key scenes to align cross-modal representations and introduce Progressive Chain of Thought (P-CoT) for fine-grained iterative planning. We successfully generate high-quality video abstracts across five research areas, demonstrating expertise that surpasses existing video generation models. The code will be made available at https://github.com/GenVerse/Paper2Video .
Takeaways, Limitations
•
Takeaways:
◦
We propose a novel agent system, Preacher, that overcomes the limitations of existing video generation models, such as limited context windows, fixed video duration, and limited style diversity.
◦
Effectively convert the main content of a paper into a video by combining top-down and bottom-up approaches.
◦
Align cross-modal representations and perform granular planning using Progressive Chain of Thought (P-CoT).
◦
Success in generating high-quality video abstracts across a variety of research fields.
◦
Ensuring reproducibility and expandability of research through open source code disclosure.
•
Limitations:
◦
Possible lack of specific metrics and analytics to evaluate the performance of the Preacher system.
◦
Further validation of generalization performance across various research fields is needed.
◦
Applicability and performance limitations may exist for papers with extremely complex or specialized terminology.
◦
Possible lack of analysis of errors and biases that may occur during the video creation process