Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Post-training Large Language Models for Diverse High-Quality Responses

Created by
  • Haebom

Author

Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, Aldo Pacchiano

Outline

Reinforcement learning (RL) is widely used for post-training large-scale language models (LLMs), but it tends to reduce the model's output diversity. Existing diversity-enhancing methods have limitations, operating at inference time or focusing on superficial differences. This paper proposes a novel training method, Diversity Quality Optimization (DQO), based on determinant point processes (DPPs), to jointly optimize quality and semantic diversity. For each prompt, DQO samples and embeds a group of responses, then measures diversity as the volume represented by the embedding of these responses using the determinant of a kernel-based similarity matrix. DQO is flexible and adaptable to existing RL algorithms. Experiments on instruction-following, summarization, story generation, and inference tasks demonstrate that DQO significantly improves semantic diversity without compromising model quality.

Takeaways, Limitations

DQO presents a novel training methodology to address the output diversity problem of LLM.
DQO has been shown to improve semantic diversity without compromising quality.
DQO can be easily applied to existing RL algorithms.
This paper demonstrates the effectiveness of DQO in various tasks.
Although this paper does not mention the specific Limitations of DQO, there may be computational complexity or difficulty in hyperparameter tuning of DPP-based methodology.
👍