Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Created by
  • Haebom

Author

Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu

Outline

This paper proposes a self-evolutionary VLN framework (SE-VLN) to overcome the limitations of large-scale language models (LLMs) in vision-language exploration (VLN). SE-VLN consists of a hierarchical memory module that utilizes experiential knowledge through continuous learning and evolution, transforming success and failure cases into reusable knowledge; a retrieval-augmented thinking-based reasoning module that retrieves experience and enables multi-step decision-making; and a reflection module that enables continuous evolution. It achieves 23.9% and 15.0% performance improvements over the previous state-of-the-art models on the R2R and REVERSE datasets, respectively, and achieves 57% and 35.2% success rates in unknown environments. This demonstrates that performance improves as the experience repository grows, suggesting its great potential as a self-evolutionary VLN agent framework.

Takeaways, Limitations

Takeaways:
The first attempt to implement experiential knowledge utilization and self-evolution capabilities in an LLM-based VLN.
Significant performance improvements over previous state-of-the-art methods on R2R and REVERSE datasets.
Demonstrating the potential of self-evolving agents through performance improvement based on experience accumulation.
Limitations:
Lack of analysis of the computational cost and complexity of the proposed framework.
Further evaluation of generalization performance across different environments and tasks is needed.
Further research and development is needed for real-world applications.
👍