This paper proposes a novel framework for evaluating the quality of personalized recommendations in long-form audio, such as podcasts. Existing offline metrics suffer from exposure bias, while online methods, such as A/B testing, are costly and operationally limited. To address these issues, we propose a method that utilizes a large-scale language model (LLM) as an offline evaluator. Natural language user profiles are generated from 90 days of listening history, providing the LLM with high-dimensional, semantically rich context to effectively determine the match between a user's interests and recommended episodes. This profile-based approach reduces input complexity and improves interpretability for the LLM, which then makes fine-grained point-by-point and pair-by-pair judgments based on profile-to-episode matching. In a controlled study with 47 participants, the proposed framework achieved high-fidelity agreement with human judgment, outperforming or equaling variants that use raw listening history. This framework enables efficient profile-based evaluation for iterative testing and model selection in recommender systems.