This paper highlights the limitations of existing approaches to personalizing large-scale language models (LLMs) by assuming that user preferences are static and consistent across tasks, highlighting that actual user preferences change dynamically across contexts. To assess this, we present the CUPID benchmark, consisting of 756 human-curated interaction session recordings between users and LLM-based chat assistants. In each interaction session, the user makes a request in a specific context and expresses their preferences through multiple rounds of feedback. The CUPID benchmark considers a new user request and previous interaction sessions to assess whether the LLM can infer preferences associated with that request and generate a response that satisfies those preferences. Our evaluation of ten open-source and proprietary LLMs reveals that even state-of-the-art LLMs struggle to infer preferences from multiple interactions and identify which prior contexts are relevant to a new request (with precision <50% and recall <65%). This study highlights the need to improve LLM capabilities for context-sensitive, personalized interactions and proposes CUPID as a resource for such improvements.