[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Created by
  • Haebom

Author

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum III, Andrey Kolobov, Furong Huang, Jianwei Yang

Outline

Large-scale vision-language-action (VLA) models have been pre-trained on massive robot datasets and provide promising generalization policies for robot learning, but they still struggle with the spatiotemporal dynamics of interactive robotics, making them ineffective for handling complex tasks such as manipulation. In this study, we present visual tracking prompting, a simple yet effective approach to enhance the spatiotemporal awareness of VLA models by visually encoding state-action trajectories. Using our self-collected dataset of 150,000 robot manipulation trajectories, we develop a novel TraceVLA model by fine-tuning OpenVLA with visual tracking prompting. Evaluations of TraceVLA on 137 configurations of SimplerEnv and four tasks of a real WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real robot tasks, demonstrating robust generalization across a variety of implementations and scenarios. To further verify the effectiveness and generality of our method, we present a 4B Phi-3-Vision-based compact VLA model pre-trained on Open-X-Embodiment and fine-tuned on our study dataset. This model achieves comparable performance to the 7B OpenVLA baseline model while significantly improving the inference efficiency.

Takeaways, Limitations

Takeaways: We demonstrate that visual tracking prompting can significantly improve the performance of VLA models by enhancing their spatiotemporal awareness in complex robotic manipulation tasks. We suggest the possibility of using smaller models to improve inference efficiency without compromising performance.
Limitations: Currently, it has only been evaluated on a specific robot dataset and task, so further research is needed on generalization performance to other robot platforms or tasks. Further research is needed on whether the effects of visual tracking prompting can be applied to other types of VLA models. There is a lack of analysis on the impact of the size and diversity of the collected dataset on model performance.
👍