This paper proposes a Two-stage Cross-modal Video Anomaly Detection System (TCVADS) to address the Weakly Supervised Learning-Based Anomaly Detection (WSMAD) problem for smart city monitoring. This system enables efficient, accurate, and interpretable anomaly detection on edge devices. TCVADS consists of two stages: coarse-grained classification and fine-grained analysis. In the first stage, a time-series analysis module (teacher model) extracts features and transfers them to a simplified convolutional neural network (student model) through knowledge distillation for binary classification. Once anomalies are detected, the second stage is activated to perform fine-grained multi-classification through cross-modal contrastive learning using CLIP and enhance interpretability through specially designed triplet text relationships. Experimental results demonstrate that TCVADS outperforms existing methods in terms of model performance, detection efficiency, and interpretability.