Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

Created by
  • Haebom

Author

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

Outline

This paper aims to develop a semantically explicit, spatially sensitive, domain-independent, and intuitive target assignment method for guiding agent-to-agent interactions in 3D environments. In particular, we propose a novel cross-view target alignment framework that allows users to assign target objects using segmentation masks from their own camera views rather than the agent’s observations. We highlight that when the camera views of humans and agents differ significantly, action replication alone fails to align the agent’s actions with human intentions. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, to explicitly enhance the agent’s spatial reasoning ability. Based on this, we develop ROCKET-2, a state-of-the-art agent trained on Minecraft, which improves the inference efficiency by 3x to 6x compared to ROCKET-1. ROCKET-2 demonstrates that it can improve human-agent interactions by directly interpreting targets from the human camera view. Notably, ROCKET-2 demonstrates zero-shot generalization ability. Despite being trained exclusively on the Minecraft dataset, it can adapt and generalize to other 3D environments such as Doom, DMLab, and Unreal with simple action space mapping.

Takeaways, Limitations

Takeaways:
A cross-view goal alignment framework considering the difference in viewpoints between humans and agents is presented.
Improving the agent's spatial reasoning ability through auxiliary goals (loss of cross-view consistency, loss of target visibility)
Development of ROCKET-2 improves inference efficiency by 3-6 times and confirms zero-shot generalization ability
Contribute to improving human-agent interaction
Limitations:
Dependency on the Minecraft dataset: Additional experiments are needed to evaluate generalization performance across different environments.
Limitations of Zero-Shot Generalization: Further Research Needed on Dependency on Action Space Mapping and Limitations on Generalization Performance
Need for strengthened quantitative analysis of the effectiveness of secondary objectives
👍