This paper highlights that existing dense visual recognition tasks rely on predefined categories, limiting their real-world applicability. While Vision-Language Models (VLMs) like CLIP are promising for open-vocabulary tasks, their limitations in local feature representation make direct application to dense visual recognition challenging due to limitations in local feature representation. To address this, we observe that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local identifiability and spatial consistency. Therefore, we propose DeCLIP, a novel framework that separates self-attention mechanisms to obtain "content" and "context" features, respectively. Context features jointly distill object integrity cues from the Vision Foundation Model (VFM) and diffusion models to enhance spatial consistency, while content features are aligned with image crop representations and constrained by local correlations from the VFM to enhance local identifiability. Extensive experiments demonstrate that DeCLIP achieves state-of-the-art performance on a variety of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.