Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Created by
  • Haebom

Author

Yu Sun, Yin Li, Ruixiao Sun, Chunhui Liu, Fangming Zhou, Ze Jin, Linjie Wang, Xiang Shen, Zhuolin Hao, Hongyu Xiong

Outline

Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Improving the quality of labeled training data and cross-modal fusion significantly improves model performance, impacting key metrics such as quality view-through rate and advertising revenue. High-quality annotations are crucial for advancing content modeling, but existing statistically-based active learning (AL) methods struggle to detect overconfident misclassifications and are less effective at distinguishing semantically similar items in deep neural networks. Furthermore, audio information plays an increasingly important role, especially in short-form video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training all three modalities from scratch is feasible, this sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based latent space expansion (LSB) to improve active learning efficiency, and audio-enhanced visual-language modeling (VLMAE), an intermediate fusion approach that integrates audio into the VL model. This system has been deployed in real-world systems and has achieved significant business results.

Takeaways, Limitations

Takeaways:
Improving the efficiency of active learning by utilizing kNN-based LSB.
Improving multimodal model performance by integrating audio information through VLMAE.
Prove business performance through real-world system deployments
Limitations:
Lack of details on specific LSB and VLMAE implementations.
Lack of information on comparison and performance evaluation with other multimodal models.
Lack of detailed information about the characteristics of audio data and audio processing processes.
👍