This paper presents HybridMamba, a novel architecture for traffic accident detection. HybridMamba integrates visual transformers and state-space temporal modeling to achieve high-precision accident temporal localization. Multi-layer token compression and hierarchical temporal processing maintain computational efficiency without sacrificing temporal resolution. Evaluated on a large-scale dataset from the Iowa Department of Transportation, HybridMamba achieves a mean absolute error of 1.50 seconds (p<0.01 compared to baseline models) on 2-minute videos, with 65.2% of predictions being within 1 second of the actual value. Despite a significantly smaller number of parameters (3 billion vs. 13.72 billion ), it outperforms state-of-the-art video-language models such as TimeChat and VideoLLaMA-2 by up to 3.95 seconds. It demonstrates effective temporal localization across a range of video durations (2 to 40 minutes) and environmental conditions, highlighting the potential of fine-grained temporal localization in traffic surveillance but also presenting challenges for scaled deployment.