Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Created by
  • Haebom

Author

Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang

Outline

This paper presents the Hierarchical Budget Policy Optimization (HBPO) framework to address the inefficiency of large-scale inference models, which consistently perform excessive inference despite the computational demands that vary with problem complexity. Unlike existing methods that rely on fixed constraints or discrete mode selection, HBPO partitions the search space into budget-constrained layers (512-2560 tokens) with differentiated reward structures, maintaining both efficiency and inference performance. To address the problem of conventional length penalties excluding redundant inference paths, we train the model to perform redundant inference only when necessary, while maintaining exploration diversity through hierarchical sampling and budget-aware rewards. Experimental results demonstrate that HBPO reduces average token usage by up to 60.6% and improves accuracy by 3.14% across four inference benchmarks, while automatically adaptively adjusting inference depth based on problem complexity. In conclusion, we demonstrate that appropriate hierarchical learning can simultaneously optimize inference efficiency and performance.

Takeaways, Limitations

Takeaways:
We present the possibility of learning an efficient inference model that dynamically adjusts the inference depth according to the problem complexity.
Overcoming the limitations of the existing simple length penalty method and confirming the possibility of simultaneously improving inference efficiency and accuracy.
Maintaining search diversity and preventing excessive inference through hierarchical search space partitioning.
Suggesting that there is no trade-off between inference efficiency and ability.
Limitations:
Further research is needed to optimize the hierarchical structure and budget settings of HBPOs.
There is a need to verify generalization performance for various types of inference problems.
Since these results are for a specific benchmark, further research is needed to determine their generalizability to other inference tasks.
Further review is needed to determine whether the budget constraint range of the 512-2560 token is appropriate for all problems.
👍