This paper proposes MLLMSeg, a novel approach to the Reference Representation Segmentation (RES) problem. Existing RES methods trade off between performance and cost, either using the heavily parameterized Segment Anything Model (SAM) or using lightweight SAM-free pipelines that sacrifice accuracy. MLLMSeg achieves high performance without an additional vision encoder by leveraging visual detail features already embedded in the vision encoder of the Multimodal Large-Scale Model (MLLM). Accurate mask prediction is achieved through a detail-enhanced and semantic-consistent feature fusion (DSFF) module that combines detail and semantic information, and a lightweight mask decoder (34M parameters). Experimental results demonstrate that MLLMSeg outperforms both SAM-based and SAM-free methods, striking a good balance between performance and cost.