This paper addresses the problem of learning and fine-tuning expressive policies with online reinforcement learning (RL) using offline datasets. Expressive policies (such as diffusion and flow-matching policies) are parameterized by long denoising chains, which makes it difficult to maximize the stable value. To address this, this paper proposes a method to construct an online RL policy that maximizes the Q-value instead of directly optimizing the value with the expressive policy. Specifically, we propose an algorithm called 'expressive policy optimization (EXPO)' that uses a pre-trained expressive base policy with a stable imitation learning objective and a lightweight Gaussian editing policy that enhances the value distribution. EXPO optimizes actions sampled from the base policy with the learned editing policy, and selects the action that maximizes the value among the base and edited actions for both sampling and temporal lag (TD) backup.