This study investigated the feasibility and performance of automatically annotating human emotions in everyday scenarios using large-scale multimodal models (LMMs). We conducted experiments on the DailyLife subset of the publicly available FERV39k dataset, using the GPT-4o-mini model for rapid zero-shot labeling of key frames extracted from video segments. Under seven emotion classification schemes ("anger," "disgust," "fear," "happiness," "neutral," "sadness," and "surprise"), LMMs achieved an average precision of approximately 50%. However, when restricted to three emotion classifications (negative/neutral/positive), the average precision increased to approximately 64%. Furthermore, we explored a strategy of merging multiple frames within 1-2 second video clips to improve labeling performance and reduce costs. The results indicate that this approach can slightly improve annotation accuracy. Overall, our preliminary results highlight the potential of zero-shot LMMs for human facial emotion annotation tasks, providing a novel approach to reducing labeling costs and expanding the applicability of LMMs in complex multimodal environments.