This paper addresses motion customization, which generates videos with motion concepts specified by a set of video clips with the same motion concept using a diffusion model (DM). Previous studies have explored various methods for representing and embedding motion concepts into large-scale pretrained text-to-video diffusion models (e.g., learning motion LoRAs and using latent noise residuals). However, these methods inevitably encode the appearance of reference videos, which weakens the appearance generation capability. This paper follows the common approach of learning motion LoRAs to encode motion concepts, but proposes two novel strategies: temporal attention refinement (TAP) and appearance highways (AH) to improve action-appearance separation. In TAP, we assume that pretrained value embeddings are sufficient building blocks for generating new motions. We reconstruct the value embeddings by reconstructing temporal attention solely from motion LoRAs to generate new motions. In AH, we change the starting point of each skip connection in the U-Net from the output of each temporal attention module to the output of each spatial attention module. Experimental results show that the proposed method can generate videos with appearances more consistent with text descriptions and motions more consistent with reference videos than existing studies.