Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Created by
  • Haebom

Author

Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu

Outline

This paper presents a comprehensive study of fine-grained multimodal features in multimodal large-scale language models (MLLMs), specifically addressing the visual ground truth (VG) problem. While previous studies have employed various design choices, systematic validation to support these designs has been lacking. Using LLaVA-1.5, this study analyzes various design choices that affect the VG performance of MLLMs. Through exploration of VG paradigms in MLLMs and an ablation study of the ground truth design, we propose a method to optimize VG performance. As a result, we achieve performance gains of +5.6%, +6.9%, and +7.0% on RefCOCO/+/g compared to LLaVA-1.5.

Takeaways, Limitations

Takeaways:
We provide a systematic analysis of various design choices to improve the visual ground truth (VG) performance of MLLM.
Provides insights into effective VG paradigms and grounded data design.
Results based on LLaVA-1.5 are likely applicable to other architectures as well.
We achieved notable performance improvements on the RefCOCO/+/g dataset.
Limitations:
Our study was conducted based on LLaVA-1.5, and further research is needed to determine the generalizability of our results to more recent models.
The range of design choices used in the analysis may be limited.
Further verification of generalizability to other MLLM architectures is required.
👍