Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Created by
  • Haebom

Author

Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian

Outline

This paper highlights that existing dense visual recognition tasks rely on predefined categories, limiting their real-world applicability. While Vision-Language Models (VLMs) like CLIP are promising for open-vocabulary tasks, their limitations in local feature representation make direct application to dense visual recognition challenging due to limitations in local feature representation. To address this, we observe that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local identifiability and spatial consistency. Therefore, we propose DeCLIP, a novel framework that separates self-attention mechanisms to obtain "content" and "context" features, respectively. Context features jointly distill object integrity cues from the Vision Foundation Model (VFM) and diffusion models to enhance spatial consistency, while content features are aligned with image crop representations and constrained by local correlations from the VFM to enhance local identifiability. Extensive experiments demonstrate that DeCLIP achieves state-of-the-art performance on a variety of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.

Takeaways, Limitations

Takeaways:
By overcoming the limitations of CLIP, we significantly improve the performance of open-vocabulary dense visual recognition tasks.
We present a general-purpose framework applicable to various tasks, including 2D and 3D object detection and segmentation, video instance segmentation, and 6D object pose estimation.
We created synergy effects by utilizing the Vision Foundation Model and the Diffusion Model.
Reproducibility has been improved through open source code.
Limitations:
There may be a lack of quantitative analysis of the relative importance of each component in contributing to the performance improvement of DeCLIP.
Additional experiments may be needed to evaluate generalization performance on different datasets.
There may be a lack of analysis of potential performance degradation for certain types of images or objects.
👍