Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Created by
  • Haebom

Author

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

Outline

This paper describes Temporal Token Fusion (TTF), a proposed method to overcome the limitations of the Vision-Language-Action (VLA) model, which cannot utilize time-series information in robotic manipulation tasks. TTF intelligently integrates visual representations of past and present time points without learning, thereby improving the quality of VLA inference. It performs selective temporal token fusion via hard fusion strategies and keyframe anchoring, utilizing a dual-dimensional detection method that combines efficient gray-scale pixel disparity analysis with attention-based semantic relevance assessment. It demonstrates consistent performance improvements across a variety of experiments, including LIBERO, SimplerEnv, and real-world robotic tasks, and is model-independent and applicable to both OpenVLA and VLA-Cache architectures. TTF demonstrates that selective query matrix reuse within the attention mechanism improves performance, suggesting the potential for computational acceleration via the KQV matrix reuse strategy.

Takeaways, Limitations

Takeaways:
A novel methodology is presented to improve the performance of VLA models without training (TTF).
Effectively integrates temporal information to enhance robustness against visual noise.
A model-independent approach applicable to a variety of environments and architectures.
Suggesting the possibility of improving computational efficiency through query matrix reuse in attention mechanisms.
Limitations:
The specific Limitations is not specified in the paper. (This part is not specifically mentioned in the paper.)
👍