This paper proposes MLLMSeg, a novel approach to the Reference Representation Segmentation (RES) problem. Existing RES methods suffer from a trade-off between performance and cost. We focus on achieving efficient performance without an additional visual encoder by leveraging the fine-grained visual features inherent in the visual encoder of a large-scale multimodal model (MLLM). Specifically, we propose a Detailed Enhancement and Semantic Consistency Feature Fusion Module (DSFF), which fully integrates the detailed visual features with the semantic features output by the MLLM's large-scale language model (LLM). Furthermore, we perform accurate mask prediction using a lightweight mask decoder with a small parameter count of 34M. Experimental results demonstrate that MLLMSeg outperforms both SAM-based and non-SAM-based competing methods, striking a good balance between performance and cost.