Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Input Time Scaling

Created by
  • Haebom

Author

Rapheal Huang (Yuming), Weilong Guo

Outline

This paper presents Input Time Scaling (ITS), a novel scaling paradigm that complements existing data and training scaling and inference time scaling approaches for large-scale language models (LLMs). We propose a method that combines LLM meta-knowledge during training and testing to improve inputs using various strategies, and we uncover a phenomenon known as training-testing co-design. Applying query strategies to both training and testing significantly improves performance, while applying them only to one side significantly degrades performance. Interestingly, datasets with low data quality can achieve high performance, while using randomly selected examples or adding irrelevant information sometimes yields the best results. This refutes the common inductive bias of "garbage in, garbage out." In fact, datasets comprised of high-quality data can constrain performance. Models trained with more data of similar quality (15k vs. 1k) sometimes perform worse, suggesting the need for caution when simply scaling datasets. The results of this study are consistent with the "Less is More" phenomenon, demonstrating that high-dimensional inference capabilities can be induced with a small number of examples. In experiments with Qwen2.5-32B-Instruct-based models, we achieved state-of-the-art performance in AIME24 (76.7%) and AIME25 (76.7%) pass@1, and achieved AIME24 (76.7%) and AIME25 (80%) using a three-model majority vote. Based on DeepSeek-R1-Distill-Qwen-32B, we achieved AIME24 (86.7%) and AIME25 (76.7%). We plan to open-source the dataset, data pipeline, evaluation results, and checkpoints for reproducibility and further research.

Takeaways, Limitations

Takeaways:
A new LLM scaling paradigm called Input Time Scaling (ITS) is proposed.
Emphasize the importance of training-test co-design
Confirming the feasibility of achieving high performance even with low-quality datasets and refuting the conventional wisdom that "garbage in, garbage out"
Confirming the possibility of inducing high-dimensional inference capabilities even with small amounts of data (confirming the 'Less is More' phenomenon)
Achieved SOTA performance in AIME24 and AIME25
Limitations:
Open-sourcing of the dataset, data pipeline, evaluation results, and checkpoints is still in progress.
Further research is needed on the effects of simple dataset size expansion.
Generalizability needs to be verified across various LLM architectures and datasets.
👍