Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Created by
  • Haebom

Author

Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap

Outline

This paper addresses the limitations of existing static, text-based ToM benchmarks, which lack the ability to simulate real-world social interactions, and proposes the SoMi-ToM benchmark to assess multi-perspective ToM in complex social interactions. Drawing on rich multimodal interaction data generated in a SoMi environment, we comprehensively validate the model's ToM capabilities through first-person and third-person evaluations. We construct a dataset consisting of 35 third-person perspective videos, 363 first-person perspective images, and 1,225 expert-annotated multiple-choice questions, and compare the performance of human and state-of-the-art LVLM. The results show that LVLM significantly underperforms humans, highlighting the need for further improvement of LVLM's ToM capabilities.

Takeaways, Limitations

Takeaways:
Presenting a new benchmark for assessing ToM abilities in complex social interactions.
Comprehensive assessment of the model's ToM abilities using both first-person and third-person perspectives.
The need to improve the ToM ability of LVLM is emphasized through a comparison of the performance of humans and LVLM.
Limitations:
Further research is needed on generalizability using data from the SoMi environment.
Limitations of evaluation due to limited dataset size (35 third-person videos).
Performance evaluation for a specific LVLM model, lack of generalized results across different models.
👍