Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Created by
  • Haebom

Author

Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome

Outline

In this paper, we propose VIPER, a novel framework for visually guided planning. VIPER integrates perception based on a Vision-Language Model (VLM) and inference based on a Large Language Model (LLM). It uses a modular pipeline where the VLM generates textual descriptions of image observations, and the LLM policy predicts actions based on the task objective. We fine-tune the inference module using action replication and reinforcement learning to enhance the agent's decision-making ability. Experimental results on the ALFWorld benchmark demonstrate that VIPER significantly outperforms existing state-of-the-art visually guided planning and narrows the performance gap with purely text-based oracles. By leveraging text as an intermediate representation, we enhance explainability and enable detailed analysis of the perception and inference components.

Takeaways, Limitations

Takeaways:
We present a novel approach to visually directed planning problems by integrating VLM and LLM.
It shows improved performance compared to existing state-of-the-art models and reduces the performance gap with text-based oracles.
The explanatory potential of the planning process has been increased through text-intermediate representations.
Detailed analysis of the perception and reasoning components is now possible.
Limitations:
Only results for the ALFWorld benchmark are presented, and generalization performance in other environments has not been verified.
There may be a lack of detailed explanation of how VLM and LLM are integrated and the fine-tuning process.
There is a lack of review of real-world applications.
👍