This paper addresses the threat to information authenticity and the decline in public trust posed by the advancement of AI-generated content, particularly human-centered video synthesis. Unlike existing DeepFake technologies that focus on facial manipulation, recent technologies can control the movement of the entire body, synthesizing complex interactions with the environment, objects, and other people. Existing detection methods tend to overlook the risks of such full-body synthetic content. In this paper, we propose AvatarShield, a novel multimodal human-centered synthetic video detection framework that employs Group Relative Policy Optimization to enable LLMs to develop inference capabilities without dense text supervision. AvatarShield combines a discrete vision tower for high-dimensional semantic mismatch detection and a residual extractor for fine-grained artifact analysis. We also present FakeHumanVid, a large-scale benchmark containing 15,000 real and synthetic videos, utilizing nine state-of-the-art human-generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both within- and cross-domain settings.