Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Created by
  • Haebom

Author

Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh

Outline

This paper proposes a low-rank optimization method that restricts learning to a low-dimensional space to improve the running time of large-scale language model (LLM) training and reduce the memory footprint of the adaptive optimizer. Previous studies have projected gradients of linear layers based on the singular value decomposition (SVD) or QR decomposition. However, applying this method individually to each layer is computationally expensive, and storing the projection matrix incurs additional memory costs. In this study, we propose a computationally efficient and simple two-step procedure that approximates SVD/QR-based gradient projection into a low-dimensional space using a predefined orthogonal matrix of the discrete cosine transform (DCT). The aligned columns of the DCT matrix are dynamically selected for each layer, and an efficient projection matrix is obtained through a simple matmul with the DCT matrix in O(n³) time, followed by a lightweight alignment step to identify the most relevant basis vectors. For large layers, the DCT can be computed in O(n²log(n)) time using Makhoul's N-point algorithm based on the fast Fourier transform (FFT). Due to the predefined properties of the orthogonal basis, it is computed once at the start of training. Experimental results for pretraining and fine-tuning tasks demonstrate that it performs similarly to expensive SVD/QR-based methods while achieving rank-independent running times, achieving up to 25% faster execution times and reduced memory usage across a range of model sizes.

Takeaways, Limitations

Takeaways:
A computationally efficient method to approximate SVD/QR-based gradient projection is proposed.
Reduce training time and memory usage by leveraging DCT.
Achieving rank-independent execution times.
It shows similar performance to SVD/QR-based methods.
Limitations:
Specific Limitations is not stated in the paper (does not appear in the abstract)
👍