Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Created by
  • Haebom

Author

Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang

Outline

Large-Scale Language Models (LLMs) have significantly improved their inference capabilities through extensive training on massive datasets, but relying solely on additional data is becoming impractical. This paper highlights the need for models that improve their inference capabilities autonomously, without external supervision. We propose Debate, Train, Evolve (DTE), a novel groundtruth-free training framework that uses multi-agent discussion traces to develop a single language model. Furthermore, we introduce Reflect-Critique-Refine, a novel prompting strategy that explicitly instructs agents to critique and improve their inferences to enhance the quality of discussions. Extensive evaluations on six publicly weighted models across seven inference benchmarks demonstrate that the DTE framework achieves significant improvements, with an average accuracy increase of 8.92% on the challenging GSM-PLUS dataset. Furthermore, across all other benchmarks, the DTE framework improves accuracy by an average of 5.8%, demonstrating strong cross-domain generalization.

Takeaways, Limitations

Takeaways:
We propose a DTE framework to enhance the inference ability of a single language model through multi-agent discussions without ground truth.
Introducing the Reflect-Critique-Refine prompt strategy to improve discussion quality.
Demonstrating robust performance and generalization ability with an accuracy improvement of 8.92% on the GSM-PLUS dataset and a 5.8% accuracy improvement on other benchmarks.
Contributing to the reproducibility and dissemination of research through the disclosure of open source code and models.
Limitations:
The paper alone lacks information on the computational cost and training time of the DTE framework.
Lack of detailed explanation of specific improvement principles or mechanisms of the model.
Further analysis of generalization performance on other inference benchmarks is needed.
👍