Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DInfer: An Efficient Inference Framework for Diffusion Language Models

Created by
  • Haebom

Author

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng

Outline

We present dInfer, an efficient and scalable framework for Diffusion-Based Large-Scale Language Model (dLLM) inference. dInfer decomposes the inference pipeline into four modular components—the model, the diffusion iteration manager, the decoding strategy, and the KV cache manager—and integrates novel algorithms and system-level optimizations for each component. We achieve significant efficiency gains without compromising output quality in LLaDA-MoE. At batch size 1, we process over 1,100 tokens per second on HumanEval and an average of over 800 tokens per second across six benchmarks on an $8\times$ H800 GPU. dInfer is 10x faster than Fast-dLLM and 2-3x faster than QWen2.5-3B, a highly optimized AR model with a state-of-the-art vLLM inference engine.

Takeaways, Limitations

Takeaways:
Providing an efficient framework for dLLM inference
Increased efficiency through a combination of algorithmic innovation and system enhancements.
10x faster than Fast-dLLM and 2-3x faster than QWen2.5-3B
Open source implementation ( https://github.com/inclusionAI/dInfer )
Limitations:
There is no Limitations specified in the paper itself (based on the abstract)
👍