To overcome the limitations of existing frame-by-frame vertex generation methods in audio-based 3D facial animation, this paper proposes 3DFacePolicy, which introduces the concept of "action." We define an action as the change in a vertex trajectory between consecutive frames, and predict the action sequence of each vertex using a diffusion policy-based robot control mechanism conditioned on audio and vertex states. This reconfigures the vertex generation method with an action-based control paradigm, enabling the generation of more natural and continuous facial movements. Experimental results on the VOCASET and BIWI datasets demonstrate that our approach outperforms existing state-of-the-art methods and is particularly effective for dynamic, expressive, and natural facial animation.