This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
DreamStory presents an open-domain story visualization framework leveraging a Large-Scale Language Model (LLM) and an innovative Multi-Subject Consistency Diffusion Model (MSD). The LLM generates descriptive prompts for topics and scenes relevant to the story and annotates the topics of each scene to support consistent topic generation. MSD uses the detailed topic descriptions generated by the LLM to create topic portraits and utilizes these portraits and their corresponding textual information as multimodal anchors (guides). MSD ensures appearance and semantic consistency with reference images and text, including Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules, and employs a masking mechanism to prevent topic mixing. This study established the DS-500 benchmark for performance evaluation and verified the effectiveness of DreamStory through subjective and objective evaluations.
Takeaways, Limitations
•
Takeaways:
◦
A new story visualization framework combining LLM and MSD is presented.
◦
Creating effective images that maintain multi-subject consistency
◦
New benchmark DS-500 introduced for evaluating story visualization performance
◦
Validating the effectiveness of DreamStory through subjective and objective evaluations.
•
Limitations:
◦
Further research is needed on the scale and diversity of the DS-500 benchmark.
◦
Need to improve visualization performance for complex or ambiguous stories
◦
Generalization performance evaluation is needed for diverse real-world stories.