Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Using cognitive models to reveal value trade-offs in language models

Created by
  • Haebom

Author

Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman

Outline

Pointing out the lack of tools for interpreting value tradeoffs in large-scale language models (LLMs), we present research evaluating LLMs' value tradeoffs using cognitive models from cognitive science. Specifically, we analyze the model's inference effort and the dynamics of reinforcement learning (RL) post-training using a cognitive model of polite language use. We find that the model's default behavior prioritizes informational utility over social utility, and that this pattern changes in a predictable manner when prompted to prioritize specific goals. Furthermore, we study the LLM's training dynamics, revealing that the choice of base model and pre-training data significantly influences value changes. The proposed framework can contribute to identifying value tradeoffs across various model types, generating hypotheses about social behaviors such as flattery, and designing training methods that control the balance between values during model development.

Takeaways, Limitations

Takeaways:
A new framework for assessing the value trade-offs of LLM.
Analysis of the impact of the model's inference method and training dynamics on value balance.
Suggesting the possibility of controlling value balance through model training methods.
Limitations:
Lack of details about specific models and training methods.
Limited generalizability to other social behaviors.
Further research is needed on how to quantitatively measure value trade-offs.
👍