Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

Created by
  • Haebom

Author

Abdelrahman Sadallah, Tim Baumg artist, Iryna Gurevych, Ted Briscoe

Outline

This paper aims to develop an automated system to provide authors with useful feedback during peer review. To address the time constraints of reviewers, we propose four key dimensions that enhance the usefulness of reviews: actionability, evidence and specificity, verifiability, and usability. To evaluate these dimensions and facilitate model development, we introduce the RevUtil dataset, which contains 1,430 human-labeled review comments and 10,000 synthetically labeled data. The synthetic data also includes rationales, which explain the scores of each dimension. Using the RevUtil dataset, we benchmark fine-tuned models that evaluate these dimensions and generate rationales. Experimental results show that the fine-tuned models achieve agreement with humans, comparable to, or in some cases surpassing, powerful closed-form models like GPT-4o. However, machine-generated reviews generally perform worse than human reviewers on all four dimensions.

Takeaways, Limitations

Takeaways:
Contributed to the development of an automated peer review system by presenting four key aspects (Actionability, Grounding & Specificity, Verifiability, and Helpfulness) for evaluating review usefulness.
Contribute to the advancement of related research by providing the RevUtil dataset.
We demonstrate that fine-tuned models can achieve human-level performance.
Limitations:
Since the model was trained using synthetic data, its generalization performance on real data needs to be verified.
There is a lack of in-depth analysis of why machine-generated reviews underperform human reviews.
There may be other important aspects besides these four.
👍