This paper calls for a rethinking of how human-AI collaborative systems are evaluated and proposes a more sophisticated and multidimensional approach. We analyze the "Genetic Car Designer," a human-AI collaborative system, through a large-scale field study with 808 participants and a controlled laboratory study with 12 participants. Participants who received design proposals generated by an intelligent system based on MAP-Elites demonstrated greater cognitive and behavioral engagement and produced higher-quality design outcomes compared to those who received random design proposals. We demonstrate that existing evaluation methods that focus solely on behavioral and design quality metrics fail to capture the full spectrum of user engagement. We argue that the human-AI design process should be considered a holistic evaluation of human-AI systems, considering the designer's evolving emotional, behavioral, and cognitive states. We also argue that intelligent systems should be considered core elements of the user experience, not simply backend tools.