Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Created by
  • Haebom

Author

Christos Ziakas, Alessandra Russo

Outline

To leverage the potential of Vision-Language Models (VLMs) as a zero-shot objective-based value function while overcoming the limitations of pre-trained fixed representations, we introduce VITA, which enhances generalization and temporal inference capabilities through test-time adaptation. VITA improves value estimation by updating a lightweight adaptive module at inference time via gradient steps on a meta-learned self-supervised loss. We address the limitations of temporal inference by sequentially updating along trajectories, and propose a difference-based sampling strategy that selects semantically diverse trajectory segments to mitigate shortcut learning. In a real-world robotic manipulation task, VITA generalizes across diverse external distribution tasks, environments, and implementations within a single training environment, outperforming state-of-the-art zero-shot methods using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimation can be utilized for reward formation in offline reinforcement learning, yielding multi-task policies that outperform policies trained with fuzzy logic dense rewards in simulations on the Meta-World benchmark.

Takeaways, Limitations

Takeaways:
Improving the generalization ability of VLM in zero-shot environments.
Improved temporal reasoning skills.
Potential applications in offline reinforcement learning.
Generalization from a single training environment to a variety of environments and tasks.
Limitations:
Limitations is not directly mentioned in the paper. (No limitations of the paper were presented.)
👍