Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Predicting LLM Reasoning Performance with Small Proxy Model

Created by
  • Haebom

Author

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jamin Shin

Outline

To address the high cost of large-scale language model pre-training, we propose a method to optimize datasets by leveraging small proxy models. Specifically, to address the inference performance challenges inherent only in large-scale models, we introduce rBridge, demonstrating that small proxy models (<1 billion) can effectively predict the inference performance of large-scale models by more closely aligning with (1) the pre-training objective and (2) the target task. rBridge weights the negative log-likelihood with task alignment and uses the inference traces of state-of-the-art models as gold labels. Experimental results demonstrate that rBridge reduces dataset ranking costs by over 100x compared to conventional methods, achieves the strongest correlation between models ranging from 1 billion to 32 billion across six inference benchmarks, and achieves zero-shot transfer of prediction relationships between pre-trained datasets ranging from 1 billion to 7 billion.

Takeaways, Limitations

Takeaways:
We present the possibility of optimizing datasets for training the inference capabilities of large-scale language models using small-scale proxy models.
Significantly reduce dataset ranking costs (over 100x) with rBridge.
Establishing predictive relationships between inference performance across models of different scales
Knowledge transfer between pre-trained datasets (zero-shot transfer)
Limitations:
Lack of detailed information about the specific technical details of the proposed rBridge (e.g. how tasks are sorted, how they are weighted, etc.).
Experimental results limited to specific benchmarks and model scales.
A single paper alone does not provide sufficient confidence in generalization performance to other types of models.
👍