Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning

Created by
  • Haebom

Author

Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Hanwang Zhang, Liang Lin, Bokui Chen, Cewu Lu, Xiaodan Liang

Outline

This paper proposes a novel approach to improve the performance of visual-language navigation (VLN) using an open-source large-scale language model (LLM). This model, called EvolveNav, utilizes a two-step process: training with formalized Chain-of-Thought (CoT) labels to activate the model's inference capabilities and accelerate inference on VLN tasks. Furthermore, it uses the model's own inference output as self-reinforcing CoT labels to increase supervision diversity and encourage the learning of accurate inference patterns by contrasting them with incorrect inference patterns. Experimental results demonstrate the superiority of EvolveNav over existing LLM-based VLN approaches on various benchmarks.

Takeaways, Limitations

Takeaways:
A novel two-stage training method is presented to improve the inference ability of LLM in VLN tasks.
Improving the accuracy and interpretability of navigation decisions using the CoT approach.
Increase the diversity of supervision and improve generalization ability through self-reflection follow-up training.
It outperforms existing LLM-based VLN approaches in various benchmarks.
Increased reproducibility and usability of research through open code.
Limitations:
Focus on addressing the overfitting problem caused by the absence of perfect CoT labels.
This may be a model specialized for a specific VLN working environment, and further research is needed on its performance in other environments.
There is no mention of the model's complexity and computational cost, requiring further analysis of its practical applicability.
👍