This paper addresses the limitations of multimodal large-scale language models (MLLMs) in visual text-based labeling, particularly in document images. Unlike existing benchmarks that focus on natural images, we present a new benchmark task, TRIG, that focuses on the complex layout and text content of text-rich document images, such as scanned forms or infographics. Using a novel guidance dataset containing 800 manually annotated question-answer pairs and 90,000 synthetic data points generated from four diverse datasets, we evaluate and improve MLLM's text-rich image-based labeling capabilities. Furthermore, we propose two effective TRIG methods: general guidance fine-tuning and plug-and-play efficient embeddings. Fine-tuning MLLM on synthetic datasets enhances its spatial inference and labeling capabilities.