Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Created by
  • Haebom

Author

Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang

Outline

This paper presents the PropVG model, a proposed model that overcomes the limitations of existing methods that overlook the benefits of latent targets. Considering recent trends in visual grounding research that utilize an efficient end-to-end direct referencing paradigm instead of the existing inefficient proposal-based two-step approach, this paper proposes PropVG to overcome the limitations of existing methods that overlook the benefits of latent targets. PropVG is an end-to-end proposal-based framework that seamlessly integrates foreground object proposal generation and reference object understanding without requiring additional detectors. It enhances multi-granularity target discrimination by introducing a Contrastive-based Refer Scoring (CRS) module that utilizes sentence- and word-level contrastive learning, and a Multi-granularity Target Discrimination (MTD) module that improves the recognition of absent targets by integrating object- and semantic-level information. We present extensive experimental results demonstrating the effectiveness of PropVG on the gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks. The code and model are publicly available on GitHub.

Takeaways, Limitations

Takeaways:
We address the inefficiencies of the traditional two-step approach through an end-to-end proposal-based framework.
We integrated foreground object proposal generation and reference object understanding without additional detectors.
The CRS module improves the ability to understand and distinguish reference objects through sentence- and word-level contrastive learning.
The recognition rate of absent objects has been improved by strengthening the multi-particle classification function through the MTD module.
It has demonstrated excellent performance in various benchmarks.
Limitations:
The Limitations presented in this paper is not explicitly mentioned. Additional experiments or analyses may suggest future research directions (e.g., vulnerability to specific types of reference representations, generalization performance across diverse visual environments, etc.).
👍