Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification

Created by
  • Haebom

Author

Jeffrey Liu, Rongbin Hu

Outline

This paper demonstrates that competitive or superior performance can be achieved in a zero-shot approach for the Referring Expression Comprehension (REC) task, without REC-specific training, instead of using task-specific ground truth models. We reconstruct REC using box-wise visual-language verification, where a general-purpose visual language model (VLM) independently answers true/false queries for proposed boxes in the COCO-clean generic detector (YOLO-World). This simple procedure reduces inter-box interference, supports abstention and multiple matches, and requires no fine-tuning. It outperforms the zero-shot GroundingDINO baseline on the RefCOCO, RefCOCO+, and RefCOCOg datasets, as well as reported results from GroundingDINO and GroundingDINO+CRG, which are trained on REC. Control studies using the same proposal demonstrate that verification significantly outperforms choice-based prompting, and the results hold for open-source VLMs. In conclusion, we demonstrate that workflow design, rather than task-specific pretraining, determines robust zero-shot REC performance.

Takeaways, Limitations

Takeaways:
Demonstrates the possibility of achieving superior performance in a zero-shot manner without specialized training for REC tasks.
Reduced interference between boxes and support for abstention/multiple matches through box-by-box verification
Emphasize the importance of workflow design: Workflow design has a greater impact on performance than task-specific pretraining.
We present the possibility of building an efficient system by leveraging existing models such as general-purpose VLM and YOLO-World.
Limitations:
The performance of the proposed method may depend on the performance of the underlying detector, such as YOLO-World.
Further research is needed on generalization performance for complex or ambiguous reference expressions.
Possible bias towards specific domains or datasets.
Further experiments are needed to determine performance changes when using other VLMs or detectors.
👍