In this paper, we propose StackTrans to address the problem that the Transformer architecture fails to effectively capture Chomsky layers (such as regular expressions or deterministic context-free grammars), which is a Limitations. StackTrans explicitly integrates hidden state stacks between Transformer layers, inspired by pushdown automata. Stack operations (push and pop) are differentiable, end-to-end trainable, and compatible with existing frameworks such as flash-attention. It demonstrates superior performance over existing Transformer models and other baseline models on Chomsky layers and large-scale natural language benchmarks, and is scalable from 360 million to 7 billion parameters. In particular, StackTrans-360M outperforms several open-source LLMs with 2–3 times more parameters, demonstrating its efficiency and inference capability.