Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Created by
  • Haebom

Author

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Jet-Nemotron: High-Speed Language Model

Outline

This paper presents Jet-Nemotron, a hybrid architecture language model that achieves accuracy comparable to or higher than leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural network architecture exploration pipeline that enables efficient model design. PostNAS fixes the MLP weights of a pre-trained full-attention model and efficiently explores attention block designs. This pipeline comprises four main components: (1) optimal full-attention layer placement and pruning training, (2) linear attention block selection, (3) novel attention block design, and (4) hardware-aware hyperparameter search. The Jet-Nemotron-2B model achieves accuracy comparable to or higher than Qwen3, Qwen2.5, Gemma3, and Llama3.2, while providing up to a 53.6x speedup in generation throughput and a 6.1x speedup in dictionary filling. It also achieves higher accuracy in MMLU and MMLU-Pro than state-of-the-art MoE full-attention models such as DeepSeek-V3-Small and Moonlight.

Takeaways, Limitations

Takeaways:
Development of a new hybrid architecture language model, Jet-Nemotron, utilizing PostNAS.
Higher accuracy and significantly improved generation throughput compared to existing full-attention models.
Despite being a small model, it performs better than large MoE models.
Limitations:
Details of the specific model architecture and PostNAS pipeline are not specified in the paper.
When comparing performance with other models, there is a lack of information about the type of benchmark used and the specific settings.
Further research is needed to determine the practical applicability and scalability of the model.
👍