Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Created by
  • Haebom

Author

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Outline

Jet-Nemotron is a novel hybrid architecture language model that achieves comparable or superior accuracy to existing full-attention models while significantly improving generation throughput. It was developed using a novel neural network architecture search pipeline called PostNAS (Post Neural Architecture Search). Unlike existing approaches, it efficiently explores attention block designs by fixing MLP weights based on a pre-trained full-attention model. Key components include optimal full-attention layer placement and removal, linear attention block selection, novel attention block design, and hardware-aware hyperparameter search. Compared to Qwen3, Qwen2.5, Gemma3, and Llama3.2, the Jet-Nemotron-2B model achieves comparable or superior accuracy across multiple benchmarks, while achieving up to 53.6x faster generation throughput and 6.1x faster pre-filling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models such as DeepSeek-V3-Small and Moonlight. This is possible despite the larger model having 15B total parameters and 2.2B activated parameters.

Takeaways, Limitations

Takeaways:
We demonstrate that a hybrid architecture can dramatically improve generation throughput while maintaining the accuracy of a full attention model.
We present an efficient model design pipeline called PostNAS.
The Jet-Nemotron-2B model outperforms existing state-of-the-art models in several benchmarks.
This suggests that higher performance can be achieved with fewer parameters than with larger models.
Limitations:
Further research is needed to explore the generalization performance of the PostNAS pipeline and its applicability to other types of models.
There is a lack of analysis on the energy efficiency of the Jet-Nemotron model.
We must consider the possibility of bias towards specific benchmarks.
A more in-depth analysis of the relationship between model size and performance is needed.
👍