Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

Created by
  • Haebom

Author

Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi

Outline

This paper studies the problem of embodied agents operating in a hypothetical environment that requires them to interpret ambiguous and incomplete human instructions. We introduce an "Ask-to-Act" task, which requires the agent to ask relevant questions to resolve ambiguity, navigate under partial observation, and perform single or multiple object relocation tasks. The proposed approach fine-tunes a multimodal large-scale language model (MLLM) with a vision-language-action (VLA) policy using online reinforcement learning (RL). This utilizes rewards generated by the LLM, eliminating the need for large-scale human demonstrations or manually designed rewards. The proposed method outperforms powerful zero-shot and supervised MLLMs, including GPT-4o, and generalizes well to new scenes and tasks.

Takeaways, Limitations

Takeaways:
In a home environment, we demonstrated that agents can understand ambiguous instructions and perform tasks effectively by asking relevant questions.
This is the first attempt to adapt MLLM as a VLA agent and perform online RL using rewards generated by MLLM.
It shows significant performance improvements over existing strong base models.
Excellent ability to generalize to new environments and tasks.
Limitations:
The specific Limitations is not mentioned in the paper.
👍