Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Improving the quality of labeled training data and cross-modal fusion significantly improves model performance, impacting key metrics such as quality view-through rate and advertising revenue. High-quality annotations are crucial for advancing content modeling, but existing statistically-based active learning (AL) methods struggle to detect overconfident misclassifications and are less effective at distinguishing semantically similar items in deep neural networks. Furthermore, audio information plays an increasingly important role, especially in short-form video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training all three modalities from scratch is feasible, this sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based latent space expansion (LSB) to improve active learning efficiency, and audio-enhanced visual-language modeling (VLMAE), an intermediate fusion approach that integrates audio into the VL model. This system has been deployed in real-world systems and has achieved significant business results.