This paper proposes LIRA, a novel framework for improving the accuracy of large-scale multimodal models (LMMs). While LMMs excel in segmentation and understanding, they suffer from two limitations: inaccurate segmentation and hallucination. LIRA overcomes these limitations by leveraging the complementary relationship between visual understanding and segmentation. Its main component, the Semantic-Enhanced Feature Extractor (SEFE), fuses semantic and pixel-level features to improve object attribute inference and enable more accurate segmentation. Another component, Interleaved Local Visual Coupling (ILVC), extracts local features based on segmentation masks and then autoregressively generates local descriptions, providing fine-grained supervision to mitigate hallucinations. To quantify the correlation between the accuracy of object segmentation and the potential associated meaning of tokens, we introduce the Attributes Evaluation (AttrEval) dataset. Experimental results show that LIRA achieves state-of-the-art performance on both segmentation and understanding tasks.