This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Created by
Haebom
Author
Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
Outline
This paper presents a reinforcement learning-based approach to improve the creative writing ability of small-scale language models (SLMs). We study two AI-based reward strategies within the Reinforcement Learning-with-Intelligence (RLAIF) framework, targeting Chinese greeting generation using a 7 billion-parameter SLM. The first strategy utilizes an RM trained with high-quality preference data generated via a multi-agent rejection sampling framework, while the second utilizes a principle-based LLM-as-a-Judge optimized via adversarial training and a reflexive mechanism. Experimental results show that both approaches significantly improve creative output compared to baseline models, but the principle-based LLM-as-a-Judge offers superior generation quality and benefits in terms of training efficiency and reduced reliance on human annotation data. An automated evaluation method demonstrates high agreement with human judgment.
Takeaways, Limitations
•
Takeaways:
◦
An efficient RLAIF framework for improving the creative writing ability of small-scale language models is presented.
◦
We present a scalable and creative SLM training method that reduces dependence on human data.
◦
Validating the excellence and demonstrating the effectiveness of a principles-based LLM-as-a-Judge strategy.
◦
Automated evaluation metrics show a high correlation with human evaluations.
•
Limitations:
◦
Currently, it is specialized in generating Chinese greetings, so further research is needed to determine its generalizability to other languages or tasks.
◦
Since the results are for a 7 billion parameter SLM, generalizability to SLMs of different sizes needs to be verified.
◦
It is difficult to ensure complete objectivity of the automatic evaluation indicators used.