Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Created by
  • Haebom

Author

Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar

ZeroTuning: Training-Free LLM Improvement via Initial Token Tuning

Outline

This paper proposes ZeroTuning, a novel training-free method that improves LLM performance by lightly biasing the initial tokens to overcome the limitations of token-level attention tuning (e.g., Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT)). We theoretically demonstrate that biasing the initial tokens modulates the entropy of the downstream attention distribution, particularly in the early layers, and exhibits different scaling preferences across attention heads. ZeroTuning works by applying head-specific attention tuning to the initial tokens to minimize the model's output entropy and can be implemented with just four lines of modification to the LlamaAttention code. We present two variants (supervised and unsupervised) and demonstrate superior performance to existing methods on 15 datasets. Using the Llama-3.1-8B model, we achieve relative performance gains of 19.9% on classification tasks, 4.5% on question answering tasks, and 2.1% on conversational tasks, while maintaining performance even under quantized inference and long context lengths.

Takeaways, Limitations

Takeaways:
We demonstrate that LLM performance can be effectively improved by initial token tuning alone, without training.
It achieves superior performance over existing methods on a wide range of datasets with a simple and easy implementation.
It improves practicality by maintaining performance even with quantized inference and long context lengths.
It provides flexibility by providing two modes: supervised learning and unsupervised learning.
Limitations:
Further research may be needed on model architecture and hyperparameters for optimal effectiveness of initial token tuning.
Further evaluation of generalizability to other architectures and model sizes is needed.
Further analysis is needed to determine the impact of initial token bias adjustments on other aspects of the model.
👍