This paper highlights the insufficient development of grounding and referring capabilities of multimodal large-scale language models (MLLMs) for fine-grained understanding and flexible user interaction in the field of visual document understanding. To address this, we propose the DOcument Grounding and Referring data engine (DOGR-Engine). DOGR-Engine generates two types of high-quality, fine-grained document data: (1) multi-particle analysis data for improving text localization and recognition, and (2) instruction-tuning data to enhance the grounding and referring capabilities of MLLMs in conversation and inference. Based on this data, we build DOGR-Bench, a benchmark encompassing seven grounding and referring tasks across three document types (charts, posters, and PDF documents). Leveraging the generated data, we develop DOGR, a robust baseline model that excels at text localization and recognition and accurately grounds and refers to important textual information during conversation and inference. DOGR advances document understanding to a more fine-grained level and enables flexible interaction paradigms.