Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Input-Time Scaling

Created by
  • Haebom

Author

Rapheal Huang (Yuming), Weilong Guo

Outline

This paper presents Input-Time Scaling, a novel scaling paradigm that complements existing large-scale language models (LLMs) scaling methods, such as data and training scale scaling and inference time scaling. This method leverages meta-knowledge to improve inputs with various strategies, and discovers a phenomenon called "train-test co-design," where strategies are applied during both training and testing. Interestingly, we find that low-quality datasets sometimes perform better, and that peak performance can be achieved with as few as 1,000 randomly selected examples. This finding contradicts the common assumption of "garbage in, garbage out." Training with more high-quality data does not always lead to improved performance, and is consistent with the "Less is More" phenomenon, where high-dimensional inference capabilities can be achieved with as few as 1,000 examples. Experimental results using the Qwen2.5-32B-Instruct model achieved state-of-the-art performance (76.7%) on AIME24 and AIME25, and combining the three models via majority vote achieved 80% performance on AIME25. Using the DeepSeek-R1-Distill-Qwen-32B model, we achieved 86.7% performance on AIME24 and 76.7% performance on AIME25. We plan to open-source the dataset, data pipeline, evaluation results, and checkpoints.

Takeaways, Limitations

Takeaways:
A new input time scaling paradigm that complements existing data and learning scale scaling and inference time scaling.
Discovering the importance of training-test co-design
We have confirmed that low-quality datasets can perform better than high-quality datasets (refuting the Garbage in, Garbage out argument).
Consistency with the Less is More phenomenon (high-dimensional inference possible even with small amounts of data)
Achieving SOTA performance on AIME24 and AIME25
Open source release of datasets, code, etc.
Limitations:
To date, only experimental results for specific models (Qwen2.5-32B-Instruct, DeepSeek-R1-Distill-Qwen-32B) have been presented, requiring further research on generalizability.
Further validation is needed to determine whether the effects of input time extension can be applied to all LLMs.
Further analysis of the specific mechanisms of learn-test co-design is needed.
Open source release is not yet complete.
👍