Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Created by
  • Haebom

Author

Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

Outline

The generation quality of large-scale language models (LLMs) is often improved using inference-time sequence-level scaling methods (e.g., Chain-of-Thought). In this paper, we introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token in the model. In this study, we implement this approach in a Mixture-of-Experts (MoE) model, called Roster of Experts (RoE). RoE is a learning-free inference algorithm that transforms a single MoE into a dynamic MoE ensemble. RoE injects a controlled probabilistic element into the expert routing mechanism, allowing it to sample multiple different experts for each token and aggregate their outputs for a more accurate final prediction. To overcome computational costs, we introduce an efficient batching strategy and a specialized KV caching mechanism that minimizes compute and memory overhead. For example, using RoE, a 7B MoE model can match the performance of a 10.5B MoE model with 30% less computation during inference. These benefits can be achieved without fine-tuning model parameters.

Takeaways, Limitations

Takeaways:
We present hyper-parallel scaling, a novel framework that improves prediction quality at the token level during inference.
Development of Roster of Experts (RoE), a training-free inference algorithm for the Mixture-of-Experts (MoE) model.
Reduced computation and memory overhead through efficient placement strategies and KV-caching mechanisms.
Achieve the performance of larger models with smaller models without model fine-tuning.
Limitations:
Details about specific experimental results or performance comparisons are not specified in the paper.
No mention is made of the general applicability of RoE to the model.
There is no discussion on compatibility and synergy with other scaling techniques.
👍