This paper addresses the limitations of existing static, text-based ToM benchmarks, which lack the ability to simulate real-world social interactions, and proposes the SoMi-ToM benchmark to assess multi-perspective ToM in complex social interactions. Drawing on rich multimodal interaction data generated in a SoMi environment, we comprehensively validate the model's ToM capabilities through first-person and third-person evaluations. We construct a dataset consisting of 35 third-person perspective videos, 363 first-person perspective images, and 1,225 expert-annotated multiple-choice questions, and compare the performance of human and state-of-the-art LVLM. The results show that LVLM significantly underperforms humans, highlighting the need for further improvement of LVLM's ToM capabilities.