Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Unplug and Play Language Models: Decomposing Experts in Language Models at Inference Time

Created by
  • Haebom

Author

Nakyeong Yang, Jiwon Moon, Junseok Kim, Yunah Jang, Kyomin Jung

Outline

This paper proposes Decomposition of Experts (DoE), a novel framework for reducing the inference cost of large-scale language models (LLMs). DoE defines neurons that play a crucial role in a specific task as "experts," dynamically identifying and activating these experts for each task to accelerate inference. Upon receiving a user request, DoE identifies experts for the task, performs inference using only those experts, and reverts to the original model after the task is completed. This four-step process demonstrates that DoE achieves up to a 1.73x increase in inference speed and a 65% parameter reduction while maintaining accuracy. We validate the effectiveness of DoE and the importance of its components through comparisons with various expert identification methods and ablation studies. We also analyze the impact of batch size, number of tokens, and layer type on inference speed. DoE is a practical and highly scalable framework applicable to Transformer-based architectures.

Takeaways, Limitations

Takeaways:
A novel method for effectively reducing the inference cost of large-scale language models is presented.
Achieve up to 1.73x inference speedup and 65% parameter reduction without compromising accuracy.
A scalable framework applicable to various Transformer-based architectures.
Provides practical insights into how factors such as batch size, number of tokens, and layer type affect inference speed.
Limitations:
Currently, we present experimental results for only five natural language understanding benchmarks. Additional experiments on a variety of tasks and datasets are needed.
The computational cost of the expert identification process may increase with model size. Research is needed to improve the efficiency of the expert identification process.
Further research is needed on application and performance evaluation in actual service environments.
👍