Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Created by
  • Haebom

Author

Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi

Outline

In this paper, we present a novel cache-aware routing strategy for efficient deployment of mixed-expert (MoE) large-scale language models (LLMs) in memory-constrained environments. Existing MoE LLMs improve performance by selectively leveraging specific experts for each input, but they struggle to be deployed on memory-constrained devices, especially for sequential token generation with batch size 1. In this study, we propose a novel cache-aware routing strategy that improves cache locality by leveraging expert reuse during token generation to optimize MoE on memory-constrained devices where only a subset of expert weights can be loaded into DRAM. We present on-device results demonstrating a 2x speedup on mobile devices with language modeling, MMLU, and GSM8K benchmarks, extending the applicability of MoE to real-world applications as a flexible solution that requires no training.

Takeaways, Limitations

Takeaways: We present the feasibility of efficient deployment of MoE LLM in memory-constrained environments, and demonstrate that MoE performance enhancement can be realized even in resource-constrained environments such as mobile devices. It provides a flexible solution that does not require training, contributing to the expansion of practical applications. We experimentally demonstrate on-device performance enhancement (2x speedup).
Limitations: Further research is needed to investigate the generality of the proposed cache-aware routing strategy and its applicability to various MoE architectures. Since only the results for a specific mobile device are presented, the generalizability to other hardware platforms should be verified. Since the experimental results are limited to batch size 1, the performance on larger batch sizes should be further evaluated.
👍