Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Created by
  • Haebom

Author

Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, Bowen Zhou

SDAR: Synergistic Diffusion-Autoregression for Scalable, High-Throughput Reasoning

Outline

SDAR is a synergistic diffusion-autoregressive paradigm that combines the training efficiency of autoregressive models with the parallel inference capabilities of diffusion models. Instead of expensive end-to-end diffusion training, SDAR transforms a well-trained autoregressive (AR) model into a block-wise diffusion model through simple, data-efficient adaptation. During inference, SDAR autoregressively generates sequences across blocks to ensure global consistency, while simultaneously decoding all tokens within each block in parallel through a discrete diffusion process. AR models are significantly more computationally efficient than masked diffusion models, and based on this, SDAR achieves an efficient AR-to-diffusion transformation at minimal cost, enabling parallel generation while maintaining AR-level performance. Large-scale model studies demonstrate that SDAR is robust to block size and decoding thresholds, delivering significant speedups without loss of accuracy. SDAR also demonstrates enhanced inference capability and domain adaptability. The 30B MoE model outperforms AR models on demanding scientific inference benchmarks such as GPQA and ChemBench, and further improves with test time scaling methods such as majority voting and pass@k.

Takeaways, Limitations

Combining the efficiency of autoregressive models with the parallel inference capabilities of diffusion models.
Maintain computational efficiency with a simple AR-to-diffusion transformation.
Strong robustness to block size and decoding thresholds
Enhanced reasoning capabilities and domain adaptability
The 30B MoE model excels in scientific inference benchmarks.
Further research is needed to examine the performance gap between diffusion models and AR models.
Further review is needed on the details and generalization of the transformation and adaptation process.
👍