This paper proposes JARVIS, a neural symbolic common sense inference framework for building conversational embodied agents that perform real-world tasks. To overcome the limitations of existing symbolic methods and end-to-end deep learning models, we utilize a large-scale language model (LLM) to acquire symbolic representations for language understanding and subgoal planning, and construct semantic maps from visual observations. The symbolic module then performs subgoal planning and action generation based on task- and action-level common sense. Experimental results using the TEACh dataset demonstrate that JARVIS achieves state-of-the-art performance on three dialogue-based embodied tasks (EDH, TfD, and TATC), significantly improving the success rate in the EDH task from 6.1% to 15.8%. Furthermore, we systematically analyze the key factors affecting task performance and demonstrate superior performance even in small-shot settings. Furthermore, we achieved first place in the Alexa Prize SimBot Public Benchmark Challenge.