Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

LLMs on a Budget? Say HOLA

Created by
  • Haebom

Author

Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem

HOLA: Efficient LLM Deployment on Edge Devices

Outline

HOLA is an end-to-end optimization framework for efficiently deploying large-scale language models (LLMs) on edge devices. HOLA uses Hierarchical Speculative Decoding (HSD) to enable faster inference without compromising quality, AdaComp-RAG to adjust contextual search complexity, and LoBi, a combination of structural pruning (LoRA) and quantization, to improve performance. As a result, it achieves a 17.6% EMA improvement on GSM8K, a 10.5% MCA improvement on ARC, and reduced latency and memory usage on edge devices like the Jetson Nano.

Takeaways, Limitations

Takeaways:
HOLA provides a comprehensive solution for efficient deployment of LLM on edge devices.
The combination of HSD, AdaComp-RAG, and LoBi improves both speed and accuracy.
It shows performance improvements even in constrained environments like Jetson Nano.
Increases the potential of using your LLM in real-world applications.
Limitations:
There is a lack of information on how HOLA's performance varies with specific datasets and model sizes.
Comparative analysis with other optimization techniques is limited.
A detailed analysis of the individual contributions of each component of HOLA (HSD, AdaComp-RAG, LoBi) is lacking.
👍