Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Zero-shot Emotion Annotation in Facial Images Using Large Multimodal Models: Benchmarking and Prospects for Multi-Class, Multi-Frame Approaches

Created by
  • Haebom

Author

He Zhang, Xinyi Fu

Outline

This study investigated the feasibility and performance of automatically annotating human emotions in everyday scenarios using large-scale multimodal models (LMMs). We conducted experiments on the DailyLife subset of the publicly available FERV39k dataset, using the GPT-4o-mini model for rapid zero-shot labeling of key frames extracted from video segments. Under seven emotion classification schemes ("anger," "disgust," "fear," "happiness," "neutral," "sadness," and "surprise"), LMMs achieved an average precision of approximately 50%. However, when restricted to three emotion classifications (negative/neutral/positive), the average precision increased to approximately 64%. Furthermore, we explored a strategy of merging multiple frames within 1-2 second video clips to improve labeling performance and reduce costs. The results indicate that this approach can slightly improve annotation accuracy. Overall, our preliminary results highlight the potential of zero-shot LMMs for human facial emotion annotation tasks, providing a novel approach to reducing labeling costs and expanding the applicability of LMMs in complex multimodal environments.

Takeaways, Limitations

Takeaways:
We present the possibility of automatic human emotion annotation using zero-shot LMM.
The ternary classification (negative/neutral/positive) showed higher accuracy than the 7-terminal classification.
We demonstrate the potential for improving annotation accuracy and efficiency through a multi-frame integration strategy.
Suggesting the possibility of cost reduction and expansion of application scope of LMM-based sentiment analysis.
Limitations:
Relatively low average precision of around 50% (based on heptadic classification)
Results are for a specific dataset (the DailyLife subset of FERV39k) and further research is needed to determine generalizability.
The performance improvement of the multi-frame integration strategy is minimal.
Possible performance degradation due to limitations of the GPT-4o-mini model.
Further research is needed on more diverse and extensive datasets and models.
👍