Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

Created by
  • Haebom

Author

Yukun Zhang, Xueqing Zhou

PDE-Transformer: A New Paradigm for Sequence Modeling

Outline

This paper proposes the PDE-Transformer, a novel sequence modeling paradigm that views the forward pass of a Transformer as a numerical discretization of a continuous reaction-diffusion system derived from a variational energy function. In this framework, token embeddings evolve according to partial differential equations with nonlocal integral terms modeling self-attention, local reaction terms modeling the feedforward layer, diffusion terms encoding positional smoothing, and stability control terms corresponding to layer regularization. From this integrated perspective, the authors design an adaptive PDE-diffusion layer, an efficient and learnable finite-difference stencil with linear time complexity that applies local smoothing in feature space and complements the global routing of self-attention. Through a systematic theoretical analysis based on four pillars—stability, diffusion geometry, multiscale dynamics, and component coupling—the authors derive principled guidelines for integrating PDE layers at seven candidate locations within the Transformer. In the Long Range Arena benchmark, placing the layer immediately after embedding improved accuracy by an average of 4.1% compared to a strong baseline, and adaptive multi-scale transformation provided further improvements. Therefore, this study provides a principled and lightweight mechanism to enhance long-range dependency modeling by combining continuous PDE smoothing and discrete self-attention.

Takeaways, Limitations

Takeaways:
We present a new perspective on interpreting transformer structures as numerical discretizations of partial differential equations (PDEs).
It provides an integrated view by corresponding each term of PDE, such as self-attention, feed-forward layer, position encoding, and layer normalization.
We propose an efficient adaptive PDE diffusion layer with linear time complexity to improve long-range dependency modeling.
Demonstrated excellent performance in the Long Range Arena benchmark.
Theoretical analysis provides principled guidelines for the integration of PDE layers.
Limitations:
The specific Limitations is not directly mentioned in the abstract. Additional research and analysis are required to identify it.
👍