Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Created by
  • Haebom

Author

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli

Outline

This paper proposes Spiffy, a novel inference algorithm that improves the inference speed of Diffusion Language Models (dLLMs). Spiffy preserves the output distribution of dLLMs while improving inference speed by 2.8-3.1 times. The algorithm utilizes a self-guessing method to generate draft states by exploiting the dLLM distribution and proposes a novel directed draft graph that utilizes a bidirectional block-based dLLM generation method. Furthermore, it determines high-quality graph configurations through an efficient offline correction algorithm, thereby increasing the acceptance rate. When combined with other parallel decoding algorithms, such as KV-caching and multi-token unmasking, Spiffy can achieve up to 7.9 times faster inference.

Takeaways, Limitations

Takeaways:
Proposed Spiffy algorithm that improves dLLM inference speed by 2.8 to 3.1 times.
Generate draft states by leveraging dLLM's own distribution in an automatic guessing manner.
A novel directed draft graph design utilizing a bidirectional block-by-block dLLM generation method.
Optimizing high-quality graph construction through offline correction algorithms.
Up to 7.9x speedup possible through synergy with other methods such as KV-caching and multi-token unmasking.
Limitations:
No specific Limitations mentioned in the paper.
👍