[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

Created by
  • Haebom

Author

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan

Outline

In this paper, we propose a novel optimization technique, Divergence-driven Zeroth-Order optimization (DiZO), to overcome the limitations of memory-efficient zero-order (ZO) optimization in fine-tuning large-scale language models (LLMs). Existing ZO methods are memory-efficient because they estimate gradients only using a forward pass, but their convergence speed and accuracy are significantly lower than those of first-order (FO) methods. DiZO analyzes the differences in update patterns between FO and ZO optimizations and introduces a layer-wise divergence-driven adaptation method that adjusts the update size according to the optimization needs at each layer. Experimental results show that DiZO significantly reduces the number of iterations required to converge while reducing the training GPU time by up to 48% on various datasets, and outperforms existing ZO techniques in fine-tuning models such as RoBERTa-large, OPT series, and Llama series, and in some cases even surpasses the memory-intensive FO fine-tuning.

Takeaways, Limitations

Takeaways:
We show that memory-efficient large-scale language model fine-tuning is possible via zero-order optimization.
We present the DiZO algorithm to improve the convergence speed and accuracy problems of existing zero-order optimization.
Demonstrates superior performance over existing methods on various LLM models and datasets.
Provides training time and cost savings (up to 48% reduction).
Limitations:
The public link to the presented code is an anonymous link, and access to and verification of the code may be restricted.
There may be a lack of analysis of performance changes depending on different hyperparameter settings.
The results may be biased towards certain types of LLM or datasets. More extensive experiments may be needed.
👍