Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

Created by
  • Haebom

Author

Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu

Outline

This paper focuses on assessing and improving the quality of AI-generated text. With the rapid increase in the volume of AI-generated text, assessing and improving the "quality" of text beyond mere grammatical accuracy and consistency has become increasingly important. We present the Writing Quality Benchmark (WQ), a set of 4,729 writing quality judgments, integrated from five existing datasets. Several baseline models, including state-of-the-art LLMs, demonstrate that they do not significantly outperform random benchmarks on the WQ. To address this, we train Writing Quality Reward Models (WQRM) of various sizes to assess writing quality, achieving robust generalization performance and 74% accuracy on the WQ benchmark on four out-of-distribution test sets. Furthermore, we demonstrate that the WQRM can be used to generate and rank candidate revisions, allowing for the selection of higher-quality outputs than initial drafts. In human evaluations conducted by nine professional writers, the WQRM-based selection method generated writing samples preferred by experts 66% of the time, and 72.2% of the time when the reward difference was greater than one point. The researchers intend to contribute to the development of AI writing systems by making the dataset and model public.

Takeaways, Limitations

Takeaways:
We present a new benchmark (WQ) and evaluation model (WQRM) for qualitative assessment of AI-generated text.
WQRM demonstrates superior writing quality assessment performance compared to existing models.
Suggesting the possibility of improving the quality of AI-generated text through multiple candidate generation and selection using WQRM.
Promoting collaboration and development between academia and industry through the disclosure of datasets and models.
Limitations:
The WQ benchmark is still based on a limited range of datasets.
The performance of WQRM is primarily based on quantitative evaluations, and may not fully reflect qualitative aspects such as subtle vocabulary choices or style.
The scale of human evaluation is relatively small, requiring further research on generalizability.
There are limits to perfectly reflecting subjective judgments about writing quality.
👍