Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Created by
  • Haebom

Author

Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova

Outline

We present a theoretical framework for applying the minimum description length (MDL) principle to machine learning. Specifically, we address the lack of a universal measure of model complexity in neural networks such as the Transformer. This paper introduces the concept of an asymptotically optimal description length objective based on Kolmogorov complexity theory. We demonstrate that minimizing this objective achieves optimal compression across all datasets, excluding additive constants, as model resources increase. We demonstrate the computational universality of the Transformer, revealing the existence of an asymptotically optimal objective for the Transformer. Furthermore, we construct and analyze a variational objective based on an adaptive Gaussian mixture prior, demonstrating that this objective is practical and differentiable. While we experimentally analyze variational objectives that select low-complexity solutions with high generalization performance in algorithmic tasks, standard optimizers fail to find such solutions under random initialization, highlighting a key optimization challenge. More broadly, by providing a theoretical framework for identifying a description length objective with strong asymptotic guarantees, we suggest a potential path forward for neural network training that achieves better compression and generalization.

Takeaways, Limitations

Takeaways:
Provides a theoretical basis for applying MDL principles to neural networks, especially Transformers.
We propose a method to ensure optimal compression of the model by suggesting an asymptotically optimal technology length objective.
Strengthening the theoretical foundation by proving the computational universality of the transformer.
We present a practical approach using an adaptive Gaussian mixture prior-based variational objective and conduct experimental validation.
It shows the potential to find low-complexity solutions with better generalization performance.
Limitations:
It highlights the difficulty of the optimization process, pointing out that standard optimizers have difficulty finding low-complexity solutions.
The experiments were limited to a specific algorithmic task, and generalizability to a wider range of tasks requires further research.
👍