With the proliferation of surveillance cameras, the demand for automatic violence detection is increasing. This model is proposed to overcome the limitations of CNNs and Transformers, which struggle with spatial-temporal feature extraction. In this paper, we propose Dual-Branch VideoMamba, which utilizes Gated Class Token Fusion (GCTF), combining a dual-branch design with a State-Space Model (SSM) backbone. This model enhances the detection of violent acts even in challenging surveillance scenarios by performing fusion through a gating mechanism between branches that capture spatial features and branches that focus on temporal dynamics. Furthermore, we present a new benchmark by merging the RWF-2000, RLVS, SURV, and VioPeru datasets, and achieve state-of-the-art performance on the DVD dataset, achieving a balance between accuracy and computational efficiency.