This paper proposes Aha, a novel framework for real-time video stream understanding. Aha is an autoregressive highlight detection framework that predicts the relevance of each video frame to a task described in natural language. It utilizes a multimodal vision-language model and a lightweight, decoupled head, without access to future frames, and is trained on a large, refined human-centric video labeled dataset. Scalability is achieved by introducing a Dynamic SinkCache mechanism that maintains constant memory usage even for infinite-length streams. It outperforms existing offline methods and video-language models on the TVSum and Mr. Hisum benchmarks. Its potential as a real-time inference module for robotics applications is also experimentally confirmed.