This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper introduces VIPER, a novel multimodal framework for visually guided planning. VIPER integrates perception based on a Vision-Language Model (VLM) and inference based on a Large Language Model (LLM). It utilizes a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then used by an LLM policy to predict actions based on the task objective. Action replication and reinforcement learning are used to fine-tune the inference module to enhance the agent's decision-making capabilities. Experimental results on the ALFWorld benchmark demonstrate that VIPER significantly outperforms state-of-the-art visually guided planners and closes the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER enhances explainability and paves the way for fine-grained analysis of the perception and inference components.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel framework that effectively solves visually directed planning problems by integrating VLM and LLM.
◦
Using text as an intermediate representation to improve the explainability of models and facilitate analysis of perception/inference processes.
◦
Performance improvement over previous top-performing models in the ALFWorld benchmark.
◦
Improving agent decision-making through action replication and reinforcement learning.
•
Limitations:
◦
Due to the dependence on the ALFWorld benchmark, generalization performance in other environments requires further verification.
◦
Further research is needed to address potential performance degradation and efficiency issues that may arise during the integration of VLM and LLM.
◦
A performance gap still exists with pure text-based Oracle.