Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Created by
  • Haebom

Author

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Outline

Jet-Nemotron is a novel hybrid-architecture language model that significantly improves generation throughput while maintaining or exceeding the accuracy of existing full-attention models. It was developed using a novel neural network architecture search pipeline called PostNAS (Post Neural Architecture Search). Unlike existing approaches, it efficiently explores attention block designs by fixing the MLP weights of a pre-trained full-attention model. Key components include optimal full-attention layer placement and removal, linear attention block selection, novel attention block design, and hardware-aware hyperparameter search. The Jet-Nemotron-2B model achieves similar or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across various benchmarks, while delivering up to 53.6x faster generation throughput and 6.1x faster prefilling. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models such as DeepSeek-V3-Small and Moonlight.

Takeaways, Limitations

Takeaways:
We demonstrate that a hybrid architecture can dramatically improve throughput while maintaining the accuracy of a full attention model.
We present an efficient model design pipeline called PostNAS.
Despite being a small model (2B parameters), it outperforms large-scale models.
It shows a big improvement in both generation speed and prefilling speed.
Limitations:
Further research is needed to determine the generalizability of the PostNAS pipeline and its applicability to other model architectures.
Lack of analysis of the energy efficiency of the Jet-Nemotron model.
Since we focused on improving performance for a specific benchmark, generalization performance to other tasks or datasets requires further validation.
👍