Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

Created by
  • Haebom

Author

Francesco Fabbri, Gustavo Penha, Edoardo D'Amico, Alice Wang, Marco De Nadai, Jackie Doremus, Paul Gigioli, Andreas Damianou, Oskar Stal, Mounia Lalmas.

Outline

This paper proposes a novel framework for evaluating the quality of personalized recommendations in long-form audio, such as podcasts. Existing offline metrics suffer from exposure bias, while online methods, such as A/B testing, are costly and operationally limited. To address these issues, we propose a method that utilizes a large-scale language model (LLM) as an offline evaluator. Natural language user profiles are generated from 90 days of listening history, providing the LLM with high-dimensional, semantically rich context to effectively determine the match between a user's interests and recommended episodes. This profile-based approach reduces input complexity and improves interpretability for the LLM, which then makes fine-grained point-by-point and pair-by-pair judgments based on profile-to-episode matching. In a controlled study with 47 participants, the proposed framework achieved high-fidelity agreement with human judgment, outperforming or equaling variants that use raw listening history. This framework enables efficient profile-based evaluation for iterative testing and model selection in recommender systems.

Takeaways, Limitations

Takeaways:
We present a novel framework for efficiently and interpretably evaluating podcast recommendation systems using LLM.
Leveraging user profiles to improve LLM's judgment accuracy and interpretation power.
Overcoming the cost and operational constraints of A/B testing through offline evaluation methods.
Provides an efficient evaluation system for repetitive testing and model selection.
Limitations:
The performance of LLM may depend on the quality of the user profile.
Limited study size with 47 participants.
Generalizability across various podcast genres and user characteristics needs to be verified.
The possibility that LLM bias may affect the evaluation results.
👍