Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Created by
  • Haebom

Author

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang

VLN-PE: Physically Realistic Vision-and-Language Navigation Platform

Outline

To overcome the limitations of real-world vision-and-language navigation (VLN) tasks for robots, this paper introduces VLN-PE, a physically realistic VLN platform supporting humanoids, quadrupedal robots, and wheeled robots. VLN-PE systematically evaluates various VLN methods, including a classification model for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a training-free, supervised Large Language Model (LLM) integrated with path planning.

Takeaways, Limitations

We found that performance degradation was caused by physical challenges such as the robot's limited viewing space, environmental lighting changes, collisions, and falls.
We exposed the movement constraints of legged robots in complex environments.
VLN-PE seamlessly integrates new scenes beyond MP3D, enabling more comprehensive VLN evaluation.
We have identified generalization weaknesses in the current model in real-world deployments.
VLN-PE offers a novel way to improve adaptability across cross-embodiment.
The results and tools of this study will contribute to reconsidering the limitations of VLN and developing robust and practical VLN models.
👍