Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMs

Created by
  • Haebom

Author

Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip JB Jackson, Imran Razzak, Muhammad Awais

Outline

This paper presents an efficient method for transferring rich audio semantics from an audio encoder to an LLM, focusing on integrating audio recognition into a large-scale language model (LLM). To address the efficiency issues of the existing PLITS (Prepend to the LLM's input token space) approach, we propose a novel integration method, Lightweight Audio LLM Integration (LAL). LAL integrates audio representations using the LLM's attention mechanism to reduce computational costs. We propose a Probing Audio Encoder via LLM (PAL) approach that applies PLITS to speech encoders such as Whisper and LAL to general audio encoders. Experimental results show that LAL performs equivalently or better than existing approaches across multiple LLMs and tasks, while improving memory usage and throughput.

Takeaways, Limitations

Takeaways:
The LAL method improves computational efficiency over the existing PLITS method.
The PAL format allows for efficient integration with voice encoders such as Whisper and general audio encoders.
LAL has shown up to 30% performance improvement over existing methods in general audio tasks, achieving up to 64.1% reduction in memory usage and up to 247.5% increase in throughput.
PAL achieves comparable performance to PLITS-based systems while significantly improving computational and memory efficiency.
Limitations:
Specific performance comparisons and improvements may vary depending on the experimental environment and task.
Generalizability to other models beyond the specific LLM and audio encoder presented in the paper requires further research.
Analysis of potential Limitations and additional performance improvements for the LAL method was not included.
👍