This paper aims to address the issues of multimodal data mixing, activity heterogeneity, and complex model distribution in sensor-based human activity recognition (HAR). To this end, we propose a spatiotemporal attention modal decomposition alignment fusion strategy to address the mixed distribution problem of sensor data, capture key discriminative features of activities through multimodal spatiotemporal separate representations, and combine gradient modulation to mitigate data heterogeneity. In addition, we build a wearable deployment simulation system and demonstrate the effectiveness of the model through experiments using a number of public datasets.