This paper highlights the potential of event-based semantic segmentation in autonomous driving and robotics, leveraging the advantages of event-based cameras (high dynamic range, low latency, and low power consumption). Existing ANN-based segmentation methods suffer from high computational requirements, image frame requirements, and high energy consumption, limiting their efficiency and applicability on resource-constrained edge/mobile platforms. To address these issues, we present SLTNet, a lightweight spike-based Transformer network designed for event-based semantic segmentation. SLTNet extracts rich semantic features while reducing model parameters based on efficient spike-based convolutional blocks (SCBs), and enhances long-range contextual feature interactions through spike-based Transformer blocks (STBs) and binary mask operations. Extensive experiments on the DDD17 and DSEC-Semantic datasets demonstrate that SLTNet achieves up to 9.06% and 9.39% mIoU improvements over state-of-the-art SNN-based methods, while consuming 4.58x less energy and achieving an inference speed of 114 FPS. The source code is publicly available.