This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
Created by
Haebom
Author
Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Outline
This paper presents a novel training framework and inference optimization technique that address two drawbacks of large-scale Vision Verbal Action (VLA) models: high inference latency and increased training costs due to the generation of a large number of tokens, and poor performance due to the underutilization of generated actions. The proposed framework effectively reduces inference latency and training costs by fine-tuning the VLA model to generate a much smaller number of action tokens with high parallelism. Furthermore, an inference optimization technique utilizing a novel voting-based ensemble strategy combines current and previous action predictions to improve the utilization of generated actions and overall performance. Experimental results demonstrate that the proposed framework outperforms state-of-the-art VLA models, demonstrating significantly higher success rates and 39x faster inference speed (46 Hz throughput) than OpenVLA on edge platforms, demonstrating its potential for real-world deployment. The code is available on GitHub.
Takeaways, Limitations
•
Takeaways:
◦
We present an efficient training framework that significantly reduces the inference latency and training cost of VLA models.
◦
Improved utilization of generated actions and overall performance through voting-based ensemble strategies.
◦
Demonstrating high throughput (46Hz) and real-world deployment feasibility on edge platforms.
◦
Achieve superior performance than state-of-the-art VLA models.
•
Limitations:
◦
Further verification of the generalization performance of the proposed method is needed.
◦
Scalability evaluation for various robot manipulation tasks is needed.
◦
Potential limitations in portability to other platforms due to optimizations for specific edge platforms.