This paper addresses the "paper-to-video" task of converting research papers into structured video summaries. We highlight the limitations of existing state-of-the-art video generation models, which suffer from limited context windows, fixed video duration constraints, limited style diversity, and an inability to represent domain-specific knowledge. To address these limitations, we present "Preacher," the first paper-to-video agent system. Preacher decomposes, summarizes, and reconstructs papers using a top-down approach, combining various video segments to generate coherent summary videos. We define key scenes to align cross-modal representations and introduce Progressive Chain of Thought (P-CoT) for fine-grained iterative planning. Preacher successfully generates high-quality video summaries across five research areas, demonstrating expertise that surpasses existing video generation models.