This paper addresses multi-sensory human motion generation, a critical challenge in diverse fields such as computer vision, human-computer interaction, and animation. While using diffusion models for text-motion synthesis has successfully generated high-quality motion, detailed expressive motion control remains a challenging task. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. This study aims to generate interpretable and expressive human motion by integrating methods for quantifying Laban effort and shape components into a text-based motion generation model. The proposed zero-shot, inference-time optimization method guides the motion generation model to acquire the desired Laban effort and shape components without additional motion data by updating the text embeddings of a pretrained diffusion model during the sampling phase.