This paper studies the problem of embodied agents operating in a hypothetical environment that requires them to interpret ambiguous and incomplete human instructions. We introduce an "Ask-to-Act" task, which requires the agent to ask relevant questions to resolve ambiguity, navigate under partial observation, and perform single or multiple object relocation tasks. The proposed approach fine-tunes a multimodal large-scale language model (MLLM) with a vision-language-action (VLA) policy using online reinforcement learning (RL). This utilizes rewards generated by the LLM, eliminating the need for large-scale human demonstrations or manually designed rewards. The proposed method outperforms powerful zero-shot and supervised MLLMs, including GPT-4o, and generalizes well to new scenes and tasks.