Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fine-Grained Perturbation Guidance via Attention Head Selection

Created by
  • Haebom

Author

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Saungwu Lee, Sayak Paul, Susung Hong, Seungryong Kim

Outline

This paper discusses recent supervised methods for controlling backward sampling in diffusion models. In particular, we focus on attention perturbation, which shows robust experimental performance in unconditional situations where no-classifier supervising is applicable. Existing attention perturbation methods lack a principled approach to determining where to apply perturbation, especially in the Diffusion Transformer (DiT) architecture where quality-related computations are distributed across multiple layers. In this paper, we investigate the granularity of attention perturbation from the hierarchical level down to individual attention heads, and find that specific heads control distinct visual concepts such as structure, style, and texture quality. Based on this insight, we propose “HeadHunter,” a systematic framework for iteratively selecting attention heads that match user-centric goals, enabling fine-grained control over generation quality and visual properties. In addition, we introduce SoftPAG, which linearly interpolates the attention maps of each selected head along the identity matrix direction, providing a continuous adjustment mechanism to adjust the perturbation strength and suppress artifacts. This method not only alleviates the over-smoothing problem of existing layer-level perturbations, but also enables target-directed manipulation of specific visual styles via compositional head selection. We validate the method on state-of-the-art large-scale DiT-based text-to-image models, including Stable Diffusion 3 and FLUX.1, demonstrating excellent performance in both general quality enhancement and style-specific guidance. This study provides the first head-level analysis of attention perturbations in diffusion models, revealing interpretable specializations within the attention hierarchy and enabling practical design of effective perturbation strategies.

Takeaways, Limitations

Takeaways:
We provide new insights into the role of attention heads in Diffusion Transformer models.
We present the HeadHunter and SoftPAG frameworks, which allow fine-grained control over the quality and style of image generation by manipulating individual attention heads.
It overcomes the limitations of conventional hierarchical level perturbations and enables more effective and sophisticated image generation.
We experimentally validate performance improvements on state-of-the-art large-scale models such as Stable Diffusion 3 and FLUX.1.
Limitations:
HeadHunter's head selection process can still be computationally expensive.
Interpretation of how specific attention heads relate to specific visual concepts may require further study.
The generalization performance of the proposed method needs to be evaluated more broadly on various models and datasets.
SoftPAG's linear interpolation method may not always guarantee optimal performance.
👍