3D scene understanding is a core competency in embodied AI and robotics, supporting reliable perception for interaction and exploration. This paper performs zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via Vision-Language Models (VLMs) and projecting them into 3D. This study utilizes SemanticSAM and progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the oversegmentation problem commonly observed in mask generation models. Furthermore, a context-aware CLIP encoding strategy integrates multiple contextual views of each mask to enrich visual context. We evaluate the effectiveness of the proposed approach on various 3D scene understanding tasks, demonstrating significant improvements over existing methods.