Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Created by
  • Haebom

Author

Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim

Outline

This paper highlights the limitations of existing approaches to personalizing large-scale language models (LLMs) by assuming that user preferences are static and consistent across tasks, highlighting that actual user preferences change dynamically across contexts. To assess this, we present the CUPID benchmark, consisting of 756 human-curated interaction session recordings between users and LLM-based chat assistants. In each interaction session, the user makes a request in a specific context and expresses their preferences through multiple rounds of feedback. The CUPID benchmark considers a new user request and previous interaction sessions to assess whether the LLM can infer preferences associated with that request and generate a response that satisfies those preferences. Our evaluation of ten open-source and proprietary LLMs reveals that even state-of-the-art LLMs struggle to infer preferences from multiple interactions and identify which prior contexts are relevant to a new request (with precision <50% and recall <65%). This study highlights the need to improve LLM capabilities for context-sensitive, personalized interactions and proposes CUPID as a resource for such improvements.

Takeaways, Limitations

Takeaways:
Emphasizes the need for improved personalized interaction capabilities tailored to the LLM's context.
We present CUPID, a new benchmark for assessing LLM's situational awareness and preference inference abilities.
We empirically demonstrate that state-of-the-art LLMs lack the ability to infer situational preferences and identify relevant situations.
Limitations:
The need to further expand the size of the CUPID benchmark (756 sessions) in the future.
The need to more comprehensively reflect diverse types of users and situations.
The types of LLMs evaluated may be limited.
👍