Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Created by
  • Haebom

Author

Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang

Outline

To address the challenge of building large-scale Vision-Language-Action (VLA) models that perform robotic manipulation tasks based on natural language instructions, we developed a training framework focused on generating a small number of action tokens to reduce inference latency and training costs. Furthermore, we introduced a voting-based ensemble strategy that combines current and previous action predictions to increase the usability of generated actions and improve overall performance. As a result, we achieved superior performance compared to state-of-the-art VLA models, achieving an inference speed 39x faster than OpenVLA at 46 Hz on an edge platform demonstrating the fastest inference speed and practicality.

Takeaways, Limitations

Takeaways:
Reduced inference latency and training costs: Improve the efficiency of VLA models by generating fewer action tokens.
Performance Improvement: Improve overall performance by increasing the utilization of actions generated through voting-based ensemble strategies.
Practical Deployment Potential: Demonstrated at 46Hz Throughput and 39x Faster Inference Speed on Edge Platforms.
Limitations:
The specific Limitations is not specified in the paper.
👍