In this paper, we propose a distributed Encode-Prefill-Decode (EPD) framework to address the performance degradation issue of large-scale multi-modal models (LMMs). LMMs process diverse inputs such as images, audios, and videos, but their multi-modal encoding steps incur increased computation and memory overhead, which degrades key service-level objectives (SLOs) such as response time. The EPD distributed framework addresses these issues by separating the encoding, prefill, and decoding steps into dedicated resources. Through multimedia token caching, encoding load parallelization, an optimal resource allocation module, and a role switching mechanism, it significantly improves memory efficiency, batch size, number of images per request, and KV cache size, thereby improving the SLO achievement rate and response time.