This paper proposes the Unified Spatio-Temporal State-Space Model (UST-SSM) to address the spatio-temporal chaos problem in point cloud videos. UST-SSM extends the Selective State-Space Model (SSM) to point cloud videos and introduces the Spatio-Temporal Selective Scanning (STSS) technique, which reconstructs chaotic points into semantically recognized sequences through prompt-based clustering. Furthermore, it utilizes Spatio-Temporal Structure Aggregation (STSA) to compensate for missing 4D geometric and motion information, and proposes Temporal Interaction Sampling (TIS) to enhance fine-grained temporal dependencies by leveraging non-anchor frames and expanding receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets demonstrate the effectiveness of the proposed method. The source code is available publicly.