Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Planning with Reasoning using Vision Language World Model

Created by
  • Haebom

Author

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung

Outline

This paper presents the Visual Language World Model (VLWM). VLWM is a foundational model trained for language-based world modeling based on natural image data. Based on visual observations, VLWM first infers overall goal achievement and then predicts trajectories where actions and world state changes intersect. These goals are extracted through iterative LLM self-improvement conditioned on compressed future observations represented as caption trees. VLWM learns both action policies and dynamic models, enabling reactive System-1 plan decoding and reflective System-2 planning via cost minimization, respectively. Costs are measured by a self-supervised, trained evaluator model, evaluating the semantic distance between the hypothetical future states generated by VLWM development and the expected goal states. VLWM achieves state-of-the-art visual planning assistance (VPA) performance on both benchmark evaluations and the proposed PlannerArena human evaluation, with System-2 improving Elo scores by 27% compared to System-1. It also outperforms the powerful VLM baseline model on the RoboVQA and WorldPrediction benchmarks.

Takeaways, Limitations

Takeaways:
Presenting the possibility of effective planning through language-based world modeling using natural image data.
Improving performance by combining System-1 (reactive) and System-2 (reflective) planning approaches.
Achieving state-of-the-art performance in benchmarks and human evaluations.
Validation of the effectiveness of compressed future observation representation and iterative LLM self-improvement techniques.
Limitations:
Lack of detailed discussion of Limitations (not explicitly mentioned in the paper).
Potential bias towards specific domains (natural images).
Consideration must be given to computational costs and processing time.
Further validation of the model's generalization performance is needed.
👍