While CLIP-based face anti-spoofing (FAS) methods have demonstrated remarkable cross-domain performance, existing models fail to fully utilize CLIP's patch embedding tokens, failing to detect important spoofing cues. Furthermore, their reliance on a single text prompt limits generalization. To address these challenges, we propose MVP-FAS, a novel framework that integrates two core modules: Multi-Perspective Slot Attention (MVS) and Multi-Text Patch Alignment (MTPA). MVS leverages multiple texts from different perspectives to extract local fine-grained spatial features and global context from patch embeddings, while MTPA aligns multiple text representations with patches to enhance semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance compared to existing state-of-the-art methods. The code is available on GitHub.