Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

StepWiser: Stepwise Generative Judges for Wiser Reasoning

Created by
  • Haebom

Author

Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar

Outline

This paper proposes a process compensation model that provides step-by-step feedback to address the problem of supervising the validity of intermediate-level inference in models that utilize multi-step inference strategies. Existing process compensation models lack explanations and rely on supervised learning using static datasets, resulting in limited generalization (T17685). In this paper, we reframe step-by-step compensation modeling as an inference task rather than a classification task, and propose a generative judge that infers the inference steps of a policy model. The proposed model, StepWiser, is trained using reinforcement learning using the relative outcomes of rollouts, and demonstrates improved intermediate-level judgment accuracy, improved policy modeling during training, and improved inference-time search compared to existing methods.

Takeaways, Limitations

Takeaways:
Solve the problem of lack of explanation and poor generalization ability of existing process compensation models, which is Limitations.
More accurately judge the validity of intermediate inferences through generative judgement.
Provides improved performance of policy models during training and improved inference time search.
Contributes to improving the performance and reliability of multi-level inference models.
Limitations:
There is a possibility that the performance improvements of the StepWiser model may be limited to specific problem domains.
Potential increase in computational cost and training time due to reinforcement learning-based training.
The generative judge may lack the ability to interpret the reasoning process.
Further validation of the generalizability to real-world complex problems is needed.
👍