[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

Created by
  • Haebom

Author

Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen

Outline

In this paper, we propose EgoPrune, a training-free token pruning method to improve the efficiency of ego-motion video inference. Ego-motion videos are first-person videos with continuously changing viewpoints according to the agent’s movements, and serve as the main visual input for AI agents implemented in real environments. Existing vision-language models provide powerful multi-modal inference capabilities, but suffer from excessive computational costs for long and redundant video inputs. The proposed EgoPrune consists of three components: a keyframe selector borrowed from EmbodiedR, a viewpoint-aware redundancy filtering (PARF), and an MMR-based token selector, leveraging the spatiotemporal continuity and motion constraints of the ego-motion setting. Experimental results show that EgoPrune outperforms existing training-free methods at various pruning ratios, while significantly reducing FLOPs, memory usage, and latency. Additionally, we deployed EgoPrune on an implementation agent powered by a Jetson Orin NX 16GB edge device to demonstrate its efficiency in real-world settings and its suitability for on-device egomotion video inference.

Takeaways, Limitations

Takeaways:
We present EgoPrune, a new training-free token pruning method that significantly improves the efficiency of egomotion video inference.
Experimentally verified superior performance and efficiency over existing methods
Demonstrating real-world applicability by demonstrating practical implementation feasibility on edge devices
Limitations:
EgoPrune's performance is evaluated based on results for specific egomotion video benchmarks, and generalization to other types of videos or tasks requires further study.
Further research may be needed on parameter optimization during keyframe selection and token selection processes.
Consideration of the features of various egomotion videos may be lacking, and experiments on more diverse datasets are needed.
👍