Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Can Vision Language Models Understand Mimed Actions?

Created by
  • Haebom

Author

Hyundong Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May

Outline

This paper focuses on mime, a subset of nonverbal communication (NVC), and proposes MIME, a novel evaluation metric for improving the NVC understanding of visual-language models. MIME is a video-based question-answering benchmark that includes 86 mime movements. It evaluates the robustness of models by adding various transformations and noises based on motion capture data. Experimental results show that existing visual-language models significantly underperform humans on MIME, suggesting the need for models with more robust human gesture understanding capabilities.

Takeaways, Limitations

Takeaways:
Presenting a new standard for assessing nonverbal communication understanding using mime (MIME).
Clearly demonstrates the lack of nonverbal communication understanding in existing visual-verbal models and suggests future research directions.
Evaluate the robustness of models, including various deformations and noise, based on motion capture data.
Limitations:
MIME is a MIME-specific benchmark, which may have limitations in general NVC understanding.
It is based on motion capture data and may not perfectly reflect various NVC situations in the real world.
The types of visual-language models used to evaluate the current model performance and their specific performance figures are not provided, which may make generalization difficult.
👍