This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Align-Then-StEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Created by
Haebom
Author
Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li
Outline
This paper presents the Align-Then-StEer (ATE) framework to address the challenges of applying pre-trained Vision-Language-Action (VLA) models to downstream tasks using large, diverse datasets. ATE builds a unified latent space using a variational autoencoder constrained by inverse KL divergence to incorporate adaptive actions into the modes of the pre-trained action latent distribution. It then controls the generation process of a diffusion- or flow-based VLA during fine-tuning via a guidance mechanism that shifts the model's output distribution toward the target domain. Extensive experiments on cross-implementation and cross-task manipulation in simulation and real environments demonstrate that, compared to direct fine-tuning of conventional VLAs, the proposed approach improves the average multi-task success rate by up to 9.8% in simulation and 32% in real-world cross-implementation settings.
Takeaways, Limitations
•
Takeaways:
◦
It provides a general, lightweight solution that significantly enhances the applicability of VLA models to real-world robotic platforms and tasks.
◦
VLA models can be adapted to new robot platforms and tasks in a data-efficient manner.
◦
Significantly improves cross-implementation and cross-task manipulation performance in both simulation and real-world environments.
•
Limitations:
◦
Further research is needed to determine the generalization performance of the ATE framework presented in this paper. Further testing on a variety of tasks and robotic platforms may be necessary.
◦
Further research is needed to determine whether constraints using inverse KL divergence are the optimal approach, or whether better performance can be achieved using other constraint methods.
◦
In real-world applications, there may be a lack of consideration for sample size limitations or environmental factors.