Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

Created by
  • Haebom

Author

Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Keith Ross

Outline

In this paper, we investigate whether the improvement in mathematical reasoning performance is due to major changes in the transformer layer in large-scale language models that exhibit improved mathematical reasoning performance through post-training techniques (instruction tuning, reinforcement learning, and knowledge distillation) or to minor adjustments that do not significantly change the relative layer importance structure of the base model. We compare the base model, instruction-tuned models, knowledge-distilled models, and reinforcement learning models on mathematical reasoning benchmarks using layer-by-layer removal experiments. We show that mathematical reasoning generates a specific layer importance structure that persists across all post-training methods. Removing these important layers leads to an accuracy degradation of up to 80%, but the important layers do not appear in non-mathematical tasks such as factual representation. This suggests that mathematical reasoning requires special layers that are generated during pre-training, but other non-inference tasks do not. From an information-theoretic perspective, we find that these important layers are identical to the layers where the main representation transformation occurs.

Takeaways, Limitations

Takeaways: Mathematical inference relies on a specific transformer layer generated during pretraining, which plays a consistently important role across different posttraining methods. This finding provides important Takeaways for designing efficient posttraining strategies to improve the mathematical inference ability of large-scale language models. From an information-theoretic perspective, it contributes to understanding the inner workings of the model by revealing the relationship between the important layer and the representation transformation.
Limitations: This study may be limited to certain benchmarks and models. Additional research is needed on various benchmarks and models, and further review of the definition and measurement of “important layers” is needed. In addition, additional research is needed on how these important layers are formed during the pre-training process.
👍