Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VITA: Vision-to-Action Flow Matching Policy

Created by
  • Haebom

Author

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani

VITA: Vision-To-Action Policy

Outline

This paper presents VITA (VIsion-To-Action policy), a noise-free, unconditional policy learning framework that directly maps visual information to actions. VITA uses flow matching to treat latent visual representations as sources of flow, eliminating conditioning mechanisms and reducing time and memory overhead. Because actions are lower-dimensional, unstructured, and sparse than visual representations, we introduce an action autoencoder to map raw actions into a structured latent space aligned with the visual latent. Furthermore, to prevent latent space collapse, we propose flow latent decoding, which backpropagates the action reconstruction loss through a flow-matching ODE step. VITA outperforms existing generative policies in simulations and real-world environments, achieving 1.5-2.3 times faster inference speed than existing methods using conditioning.

Takeaways, Limitations

Takeaways:
We present a novel approach that directly maps visual information to actions, eliminating the need for conditioning and achieving faster inference speeds than existing methods.
We present a method to overcome the vision-action gap using action autoencoders and flow latent decoding.
Demonstrated state-of-the-art performance in simulation and real-world environments.
Limitations:
The specific Limitations is not explicitly mentioned in the paper. (Based on the paper abstract)
👍