In this paper, we propose a multi-scenario inference architecture that addresses the technical limitations of multi-modal understanding to enhance the cognitive autonomy of humanoid robots. We conduct experiments by building a simulator called Maha through a simulation-based experimental design that adopts multi-modal synthesis such as vision, hearing, and touch. The experimental results demonstrate the feasibility of the proposed architecture for multi-modal data. This provides a reference experience for exploring cross-modal interaction strategies of humanoid robots in dynamic environments. In addition, multi-scenario inference simulates the high-level inference mechanism of the human brain to humanoid robots at the cognitive level, facilitating practical task transfer and semantic-based action planning between scenarios. This heralds the future development of humanoid robots that learn and act autonomously in changing scenarios.