This paper addresses the importance and challenges of recognizing abstract concepts (e.g., justice, freedom, and solidarity) in the automatic understanding of video content. Unlike previous research that has focused on recognizing concrete objects, actions, and events, this paper focuses on understanding abstract concepts in video by mimicking human abstract reasoning. We propose the potential of solving this problem by leveraging recently developed foundational models, examine various related works and datasets, and suggest future research directions based on past research experiences. This approach is significant not only for technological advancement but also for enhancing the model's consistency with human reasoning and values.