This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
In this paper, we propose a personal assistant system that enables personalized voice-based interactions by leveraging a combination of large-scale language models (LLMs) and automatic speech recognition (ASR) running on edge devices (edge ASR-LLM). Existing ASR-LLM models are trained in high-performance computing environments and have large model sizes, making them difficult to deploy on edge devices. Instead of fine-tuning ASR or LLM individually, in this paper, we present a resource-efficient framework for efficient cross-modal alignment on edge devices. Our framework enables efficient ASR-LLM alignment even on resource-constrained devices such as NVIDIA Jetson Orin (8GB RAM), reducing the training time by 50x while improving the alignment quality by more than 50%. This is the first study to investigate efficient ASR-LLM alignment on resource-constrained edge devices.
Takeaways, Limitations
•
Takeaways:
◦
We present an ASR-LLM framework for efficient personalized voice-based interactions on edge devices.
◦
Reduced training time and improved alignment quality in resource-constrained environments (50x speedup, more than 50% quality improvement).
◦
Presenting the possibility of effective processing of personalized voice input.
◦
Presenting new possibilities for cross-modal alignment research in edge devices.
•
Limitations:
◦
Results are presented only for specific edge devices, such as NVIDIA Jetson Orin (8GB RAM), and generalizability to other hardware environments needs to be verified.
◦
Further research is needed on robustness to different types of voice data and user characteristics.
◦
Further evaluation of the performance and stability of the proposed framework in real-world usage environments is needed.