This paper addresses the challenges of evaluating the conversational performance of end-to-end speech conversation models, such as GPT-4o-audio, and proposes WavReward, a novel evaluation model to address these challenges. WavReward can assess the IQ and EQ of speech conversation systems based on an audio language model, and leverages reinforcement learning algorithms to build specialized evaluators through multi-sample feedback. Specifically, it was trained using the ChatReward-30K preference dataset, a dataset of 30,000 preferences, covering a variety of scenarios, including text-based chat, directed chat with acoustic properties, and implicit chat. Experimental results show that WavReward outperforms existing state-of-the-art evaluation models across multiple speech conversation scenarios, significantly improving the objective accuracy of Qwen2.5-Omni from 53.4% to 91.5%, and also outperformed subjective A/B tests by 83%. Through ablation studies, we validated the necessity of each component of WavReward.