[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

Created by
  • Haebom

Author

Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang

Outline

This paper discusses the development of a Vision-Language Navigation (VLN) agent that finds paths based on natural language commands. Recent studies have shown the possibility of improving the exploration performance and simultaneously reducing the domain gap between the training data of LLM and the VLN task by leveraging the inference ability of open-source large-scale language models (LLMs). However, existing approaches mainly adopt a direct input-output mapping method, which has the disadvantages of difficult mapping learning and unexplainable exploration decisions. In this paper, we propose EvolveNav, a novel self-improving implementation inference framework for improving LLM-based VLNs. EvolveNav consists of two stages: formalized Chain-of-Thought (CoT) supervision fine-tuning and self-reflective post-training. In the first stage, formalized CoT labels are used to activate the model’s exploration inference ability and increase its inference speed. In the second stage, the model’s own inference output is repeatedly trained with self-enriched CoT labels to enhance the supervision diversity. Self-reflective auxiliary tasks are also introduced to encourage learning correct inference patterns by contrasting incorrect inference patterns. Experimental results show that EvolveNav outperforms previous LLM-based VLN approaches on popular VLN benchmarks.

Takeaways, Limitations

Takeaways:
We present a novel framework (EvolveNav) that contributes to improving inference ability and navigation accuracy in LLM-based VLN.
Presenting effective learning strategies through formalized CoT labels and self-reflective post-training.
Inducing learning of correct reasoning patterns through self-reflective auxiliary tasks.
Demonstrates superior performance over existing LLM-based VLN approaches.
Limitations:
Due to the complexity of the exploration task, it may be difficult to obtain perfect CoT labels, and pure CoT supervised fine-tuning may lead to overfitting.
Further validation of the generalization performance of the proposed framework is needed.
Robustness assessment for diverse environments and complex navigation tasks is needed.
👍