This paper focuses on assessing and improving the quality of AI-generated text. With the rapid increase in the volume of AI-generated text, assessing and improving the "quality" of text beyond mere grammatical accuracy and consistency has become increasingly important. We present the Writing Quality Benchmark (WQ), a set of 4,729 writing quality judgments, integrated from five existing datasets. Several baseline models, including state-of-the-art LLMs, demonstrate that they do not significantly outperform random benchmarks on the WQ. To address this, we train Writing Quality Reward Models (WQRM) of various sizes to assess writing quality, achieving robust generalization performance and 74% accuracy on the WQ benchmark on four out-of-distribution test sets. Furthermore, we demonstrate that the WQRM can be used to generate and rank candidate revisions, allowing for the selection of higher-quality outputs than initial drafts. In human evaluations conducted by nine professional writers, the WQRM-based selection method generated writing samples preferred by experts 66% of the time, and 72.2% of the time when the reward difference was greater than one point. The researchers intend to contribute to the development of AI writing systems by making the dataset and model public.