This paper addresses the problem of ensuring the safety of physical agents in visual-language navigation (VLN). In particular, we focus on drone navigation based on human-computer interaction, which must understand natural language commands, perceive the environment, and avoid dangers in real time. To this end, we propose a novel scene recognition CBF that utilizes egocentric observation information from RGB-D cameras by utilizing control barrier function (CBF) and model predictive control (MPC). The baseline system, which does not use the existing CBF, plans the path using a visual-language encoder and an object detection model. In addition, we propose an adaptive safety margin algorithm (ASMA) to track moving objects and perform scene recognition CBF evaluation in real time, which is utilized as an additional constraint within the MPC framework. When applied to a Parrot Bebop2 quadrotor in a Gazebo environment, we confirm that the success rate increases by 64%-67% compared to the baseline system, and the path length increases by only 1.4%-5.8%.