Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Created by
  • Haebom

Author

Ruiyang Qin, Dancheng Liu, Gelei Xu, Zheyu Yan, Chenhui Xu, Yuting Hu, X. Sharon Hu, Jinjun Xiong, Yiyu Shi

Outline

In this paper, we propose a personal assistant system that enables personalized voice-based interactions by leveraging a combination of large-scale language models (LLMs) and automatic speech recognition (ASR) running on edge devices (edge ASR-LLM). Existing ASR-LLM models are trained in high-performance computing environments and have large model sizes, making them difficult to deploy on edge devices. Instead of fine-tuning ASR or LLM individually, in this paper, we present a resource-efficient framework for efficient cross-modal alignment on edge devices. Our framework enables efficient ASR-LLM alignment even on resource-constrained devices such as NVIDIA Jetson Orin (8GB RAM), reducing the training time by 50x while improving the alignment quality by more than 50%. This is the first study to investigate efficient ASR-LLM alignment on resource-constrained edge devices.

Takeaways, Limitations

Takeaways:
We present an ASR-LLM framework for efficient personalized voice-based interactions on edge devices.
Reduced training time and improved alignment quality in resource-constrained environments (50x speedup, more than 50% quality improvement).
Presenting the possibility of effective processing of personalized voice input.
Presenting new possibilities for cross-modal alignment research in edge devices.
Limitations:
Results are presented only for specific edge devices, such as NVIDIA Jetson Orin (8GB RAM), and generalizability to other hardware environments needs to be verified.
Further research is needed on robustness to different types of voice data and user characteristics.
Further evaluation of the performance and stability of the proposed framework in real-world usage environments is needed.
Lack of specific analysis of energy efficiency.
👍