Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Created by
  • Haebom

Author

Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li

Outline

Considering the trend of utilizing various forms of multimedia such as images, short videos, and live streams in e-commerce, this paper proposes a vectorized product representation learning method that integrates various domains. We point out that existing visual information alone is not effective in a wide range of domains with high intra-product variation and inter-product similarity, and propose a method that utilizes automatic speech recognition (ASR) texts obtained from short videos or live streams. In particular, we propose the AMPere (ASR-enhanced Multimodal Product Representation Learning) model, which extracts product-related information from noisy ASR texts using an LLM-based ASR text summarizer and inputs it to a multi-branch network together with visual data to generate compressed multimodal embeddings. We verify the effectiveness of AMPere through experiments using a large-scale triple-domain dataset and demonstrate that it improves cross-domain product retrieval performance.

Takeaways, Limitations

Takeaways:
We present a method to effectively extract product information from noisy ASR texts by utilizing an LLM-based text summarizer.
Proposing AMPere, a multi-modal learning model that comprehensively represents products from various domains.
Validate the superiority of AMPere and confirm improved cross-domain product search performance through experiments using large-scale datasets.
Limitations:
May be highly dependent on the performance of LLM-based summarizers.
Generalization performance may be limited depending on the characteristics of the dataset used.
Additional comparative analysis with other multimodal learning models is needed.
👍