Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models

Created by
  • Haebom

Author

Yifan Jiang, Yibo Xue, Yukun Kang, Pin Zheng, Jian Peng, Feiran Wu, Changliang Xu

Outline

In this paper, we present the first publicly available dataset for slide animation generation and demonstrate how it can be used to improve the performance of a Vision-Language Model (VLM). Using a dataset of 12,000 natural language descriptions, animation JSON files, and rendered videos, we fine-tune the Qwen-2.5-VL-7B model with Low-Rank Adaptation (LoRA) to achieve better performance than the GPT-4.1 and Gemini-2.5-Pro models on BLEU-4, ROUGE-L, SPICE, and the newly proposed CODA metric. The CODA metric evaluates the motion coverage, temporal order, and detail fidelity of the animation. We demonstrate that the LoRA technique provides reliable temporal inference and generalization ability beyond synthetic data. The provided dataset, LoRA-based model, and CODA metric provide a rigorous benchmark and foundation for future research on VLM-based dynamic slide generation.

Takeaways, Limitations

Takeaways:
The first public dataset for generating slide animations
Presenting an effective fine-tuning method for VLM using LoRA
New evaluation metric CODA enables qualitative evaluation of animation creation
Demonstrating the possibility of generating VLM-based slide animations by improving performance compared to existing models (GPT-4.1, Gemini-2.5-Pro)
Confirmation of improved temporal reasoning and generalization capabilities through LoRA
Limitations:
There is a need for further expansion of the dataset in the future.
Support for various animation effects is limited due to the dataset configuration limited to PowerPoint's built-in effects.
Further verification of the objectivity and generalizability of evaluation indicators, including CODA indicators, is needed.
There is a possibility that it may not fully reflect actual user experiences and requirements.
👍