This paper proposes a system for intelligent agents that autonomously interact with their environment to perform routine tasks following human-level instructions. This system requires a fundamental understanding of the world to accurately interpret human-level instructions, as well as precise low-level movement and interaction skills to execute the derived actions. We present the first complete system that synthesizes physically plausible, long-term human-object interactions for object manipulation in contextual environments. Leveraging a large-scale language model (LLM), we interpret input instructions into detailed execution plans. Unlike previous work, we generate finger-object interactions that seamlessly coordinate with full-body movements. Furthermore, we train a policy that tracks motions generated from physics simulations using reinforcement learning (RL) to ensure the physical plausibility of the motions. Experimental results demonstrate the system's effectiveness in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for practical applications.