Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Created by
  • Haebom

Author

Shresth Grover, Akshay Gopalkrishnan, Bo Ai, Henrik I. Christensen, Hao Su, Xuanlin Li

Outline

This paper presents a Vision-Language-Action (VLA) model that promises to leverage rich pre-trained representations to build generic robots across diverse tasks and environments. While existing VLA models are fine-tuned from vision-language models (VLMs), direct fine-tuning on robot data often hinders these representations and limits generalization. This study presents a framework that adapts pre-trained features for robot manipulation while retaining them. It consists of three components: (i) a dual encoder design with a fixed vision encoder to maintain pre-trained features and another trainable encoder for task adaptation; (ii) a string-based action tokenizer that converts continuous actions into character sequences aligned with the model's pre-training domain; and (iii) a joint training strategy that combines a vision-language dataset with robot demonstrations, emphasizing spatial reasoning and context. Evaluation results from simulations and real robots demonstrate that the proposed method improves robustness to visual distractions, generalization to new instructions and environments, and overall task success rate compared to baselines.

Takeaways, Limitations

Takeaways:
We present a novel framework that effectively leverages the representations of pre-trained vision-language models to improve robot manipulation performance.
Enhanced resilience to visual distractions and the ability to generalize to new directions and environments.
Improving the success rate of robot manipulation tasks.
Proof of the effectiveness of dual encoders, string-based action tokenizers, and joint training strategies.
Limitations:
The performance of the proposed framework may depend on the quality of the pre-trained VLMs used.
Further research is needed to determine generalizability across different robotic platforms and tasks.
There is a possibility that it may not fully capture the complexity of robot manipulation in real-world environments.
Further consideration is needed on the expressive power of string-based behavioral tokenizers.
👍