This paper presents a comprehensive study of fine-grained multimodal features in multimodal large-scale language models (MLLMs), specifically addressing the visual ground truth (VG) problem. While previous studies have employed various design choices, systematic validation to support these designs has been lacking. Using LLaVA-1.5, this study analyzes various design choices that affect the VG performance of MLLMs. Through exploration of VG paradigms in MLLMs and an ablation study of the ground truth design, we propose a method to optimize VG performance. As a result, we achieve performance gains of +5.6%, +6.9%, and +7.0% on RefCOCO/+/g compared to LLaVA-1.5.