In this paper, we propose a novel 1D-CNN-based speech emotion recognition (SER) framework to capture subtle emotional changes and improve generalization performance on diverse datasets. We use Mel-Frequency Cepstral Coefficients (MFCCs) as features and leverage a 1D Convolutional Neural Network (CNN) architecture with data augmentation techniques and channel and spatial attention mechanisms to improve the performance of the model. Experimental results on various datasets (SAVEE, RAVDESS, CREMA-D, TESS, EMO-DB, EMOVO) demonstrate that the proposed method achieves high accuracy, outperforming the existing state-of-the-art performance. (SAVEE 97.49%, RAVDESS 99.23%, CREMA-D 89.31%, TESS 99.82%, EMO-DB 99.53%, EMOVO 96.39%) This suggests that the integration of advanced deep learning techniques can significantly improve the generalization performance on various datasets, and has potential for applying SER to assistive technologies and human-computer interaction in real environments.