In this paper, we propose a DOcument Grounding and Referring data engine (DOGR-Engine) to improve the grounding and referring capabilities of multimodal large-scale language models (MLLMs), which are still underdeveloped due to the lack of fine-grained datasets and comprehensive benchmarks in the field of visual document understanding. DOGR-Engine generates two types of high-quality fine-grained document data: multi-grain analysis data for improving text localization and recognition, and instruction-tuning data for activating the grounding and referring capabilities of MLLMs in conversations and inferences. Using the generated data, we construct DOGR-Bench, a benchmark that covers seven grounding and referring tasks in three document types (charts, posters, and PDF documents), and develop DOGR, a powerful baseline model that excels in text localization and recognition and accurately grounds and refers to important text information during conversations and inferences. As a result, it enables more fine-grained document understanding and flexible interaction paradigms.