Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

Created by
  • Haebom

Author

Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang

Outline

This paper addresses the threat to information authenticity and the decline in public trust posed by the advancement of AI-generated content, particularly human-centered video synthesis. Unlike existing DeepFake technologies that focus on facial manipulation, recent technologies can control the movement of the entire body, synthesizing complex interactions with the environment, objects, and other people. Existing detection methods tend to overlook the risks of such full-body synthetic content. In this paper, we propose AvatarShield, a novel multimodal human-centered synthetic video detection framework that employs Group Relative Policy Optimization to enable LLMs to develop inference capabilities without dense text supervision. AvatarShield combines a discrete vision tower for high-dimensional semantic mismatch detection and a residual extractor for fine-grained artifact analysis. We also present FakeHumanVid, a large-scale benchmark containing 15,000 real and synthetic videos, utilizing nine state-of-the-art human-generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both within- and cross-domain settings.

Takeaways, Limitations

Takeaways:
We demonstrate that Group Relative Policy Optimization can be used to build effective synthetic image detection models without dense text supervision.
Achieving higher accuracy than existing methods by integrating multimodal information (visual and semantic information).
Contribute to future research by providing a large-scale benchmark dataset, FakeHumanVid.
Contributing to the development of detection technology for human-centered images generated by artificial intelligence.
Limitations:
Further validation of the diversity and generalizability of the FakeHumanVid dataset is needed.
Continuous monitoring and model updates are needed to address the emergence of new synthetic image generation technologies.
Further research is needed on the efficiency and scalability of Group Relative Policy Optimization.
Performance evaluation is needed in complex real-world situations.
👍