This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
Created by
Haebom
Author
Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Outline
In this paper, we propose an efficient action prediction method without additional high-performance visual representation or diffusion techniques to address the limited generalization ability of the recently developed Vision Language Action (VLA) model. Existing VLA models lack generalization ability to novel objects or unfamiliar environments, and to improve this, additional components such as depth estimation, segmentation, and diffusion are integrated, which leads to a significant increase in computational cost. The proposed VOTE in this paper is an efficient and general framework that reduces computational cost, increases inference speed, and improves generalization performance through a novel fine-tuning technique and ensemble voting strategy that does not require tokenization. Experimental results show that it achieves 35x faster inference speed and 145 Hz throughput, outperforming the previous state-of-the-art performance. Full details and code will be disclosed in the future.
Takeaways, Limitations
•
Takeaways:
◦
We present a new framework (VOTE) that significantly improves the efficiency and generalization performance of the VLA model.
◦
Reduced computational cost and improved inference speed through fine-tuning techniques and ensemble voting strategies that do not require tokenization.
◦
Achieve performance that surpasses the previous state-of-the-art (35x faster inference, 145 Hz throughput).
◦
Increase reproducibility and scalability of research through code openness.
•
Limitations:
◦
Further validation is needed to determine how robust the proposed method's generalization performance is across a variety of environments and tasks.
◦
There is a possibility of being biased towards certain types of work or environments.
◦
Evaluation of performance and stability when applied to actual robot systems is required.