This study develops and deploys a multimodal hierarchical classification framework to address the industrial challenges of e-commerce product classification, such as platform heterogeneity and the structural limitations of existing classification systems. Using a dataset of 271,700 products collected from 40 international fashion e-commerce platforms, we integrate textual features (RoBERTa), visual features (ViT), and a joint visual-linguistic representation (CLIP). We explore early, late, and attention-based fusion strategies within a hierarchical structure, and enhance dynamic masking to ensure consistency of the classification system. As a result, the CLIP embedding using the MLP-based late fusion strategy achieved the highest hierarchical F1 score (98.59%), outperforming a single-modal baseline model. To address shallow or inconsistent categories, we introduce a self-supervised learning-based "product reclassification" pipeline using SimCLR, UMAP, and cascade clustering. This pipeline discovers new, fine-grained categories (e.g., subtypes of "shoes") with a cluster purity of over 86%. Cross-platform experiments demonstrate deployment tradeoffs. While complex late-fusion methods maximize accuracy by utilizing diverse training data, simple early-fusion methods generalize more effectively to unseen platforms. Finally, we demonstrate industrial scalability by deploying the framework on EURWEB's commercial transaction information platform using a two-stage inference pipeline combining a lightweight RoBERTa stage and a GPU-accelerated multi-modal stage.