Considering the trend of utilizing various forms of multimedia such as images, short videos, and live streams in e-commerce, this paper proposes a vectorized product representation learning method that integrates various domains. We point out that existing visual information alone is not effective in a wide range of domains with high intra-product variation and inter-product similarity, and propose a method that utilizes automatic speech recognition (ASR) texts obtained from short videos or live streams. In particular, we propose the AMPere (ASR-enhanced Multimodal Product Representation Learning) model, which extracts product-related information from noisy ASR texts using an LLM-based ASR text summarizer and inputs it to a multi-branch network together with visual data to generate compressed multimodal embeddings. We verify the effectiveness of AMPere through experiments using a large-scale triple-domain dataset and demonstrate that it improves cross-domain product retrieval performance.