To leverage the potential of Vision-Language Models (VLMs) as a zero-shot objective-based value function while overcoming the limitations of pre-trained fixed representations, we introduce VITA, which enhances generalization and temporal inference capabilities through test-time adaptation. VITA improves value estimation by updating a lightweight adaptive module at inference time via gradient steps on a meta-learned self-supervised loss. We address the limitations of temporal inference by sequentially updating along trajectories, and propose a difference-based sampling strategy that selects semantically diverse trajectory segments to mitigate shortcut learning. In a real-world robotic manipulation task, VITA generalizes across diverse external distribution tasks, environments, and implementations within a single training environment, outperforming state-of-the-art zero-shot methods using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimation can be utilized for reward formation in offline reinforcement learning, yielding multi-task policies that outperform policies trained with fuzzy logic dense rewards in simulations on the Meta-World benchmark.