This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
VGoT is a framework that automatically synthesizes multi-shot videos from a single sentence. It aims to overcome the limitations of existing video generation models, which are limited to producing short clips due to fragmented visual dynamics and disjointed storylines. VGoT leverages dynamic storyline modeling, ID-aware cross-shot propagation, and an adjacent latent transition mechanism to address the challenges of storytelling, visual consistency, and transition artifacts, outperforming robust baselines without training.
Takeaways, Limitations
•
Takeaways:
◦
Automatic multi-shot video generation based on single sentences.
◦
Structured storytelling through dynamic storyline modeling.
◦
Maintain character consistency through ID-aware cross-shot propagation.
◦
Smooth visual flow through adjacent potential transition mechanism.
◦
Improved performance over strong baselines (20.4% face consistency, 17.4% style consistency).
◦
Reduced manual tuning requirements (10x less).
•
Limitations:
◦
The specific Limitations is not specified in the paper.