In this paper, we propose a novel large-scale pre-training framework, UniEmoX, based on psychological theories to solve the generalization problem of visual sentiment analysis. UniEmoX integrates scene-centric and person-centric low-level image spatial structure information to derive more subtle and discriminative emotion expressions, and extracts rich semantic information from CLIP models to enhance emotion embedding representations. We also present a new emotion dataset, Emo8, which contains images of various styles (cartoon, nature, realistic, sci-fi, and advertising). Experimental results on multiple benchmark datasets demonstrate the effectiveness of UniEmoX.