Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Logical Reasoning with Outcome Reward Models for Test-Time Scaling

Created by
  • Haebom

Author

Ramya Keerthy Thatikonda, Wray Buntine, Ehsan Shareghi

Outline

This paper presents a novel approach to improving the deductive reasoning ability of large-scale language models (LLMs). Building on previous research combining test time extension and outcome or process compensation models, we propose outcome compensation models (ORMs) specialized for deductive reasoning. To train ORMs, we generate data through Chain-of-Thought (CoT) using single and multi-samples, and propose a novel "echo generation technique" that utilizes the error propensity of LLMs to generate additional training data. This technique generates training data containing a wider variety of error types than conventional CoT methods. Experimental results show that ORMs trained with CoT and echo-augmented data improve the performance of four different LLMs on the FOLIO, JustLogic, and ProverQA datasets.

Takeaways, Limitations

Takeaways:
We present novel outcome reward models (ORMs) and training techniques to improve LLM performance in deductive reasoning.
Overcoming the limitations of existing CoT methods and generating training data containing various error types through echo generation techniques.
Experimentally verifying the performance improvements of various LLMs on the FOLIO, JustLogic, and ProverQA datasets.
Limitations:
Further research is needed on the generalizability of echo generation techniques and their applicability to other types of inference problems.
There is a possibility that the performance improvements of the proposed ORMs may be limited to specific datasets or LLMs.
Additional data augmentation techniques are needed to comprehensively address various error types.
👍