Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Metrics to Meaning: Time to Rethink Evaluation in Human-AI Collaborative Design

Created by
  • Haebom

Author

Sean P. Walton, Ben J. Evans, Alma AM Rahat, James Stovold, Jakub Vincalek

Outline

This paper calls for a rethinking of how human-AI collaborative systems are evaluated and proposes a more sophisticated and multidimensional approach. We analyze the "Genetic Car Designer," a human-AI collaborative system, through a large-scale field study with 808 participants and a controlled laboratory study with 12 participants. Participants who received design proposals generated by an intelligent system based on MAP-Elites demonstrated greater cognitive and behavioral engagement and produced higher-quality design outcomes compared to those who received random design proposals. We demonstrate that existing evaluation methods that focus solely on behavioral and design quality metrics fail to capture the full spectrum of user engagement. We argue that the human-AI design process should be considered a holistic evaluation of human-AI systems, considering the designer's evolving emotional, behavioral, and cognitive states. We also argue that intelligent systems should be considered core elements of the user experience, not simply backend tools.

Takeaways, Limitations

Takeaways:
We highlight the limitations of the existing simple, indicator-centric approach to evaluating human-AI collaborative systems and suggest the need for a multidimensional evaluation method that considers emotional, behavioral, and cognitive aspects.
We empirically demonstrate that MAP-Elites-based intelligent systems are effective in improving user engagement and design quality.
It emphasizes that intelligent systems should be considered a core element of the user experience in human-AI systems.
Limitations:
Since the studied system is limited to a specific type of design task (2D automobile design), further research is needed to determine its generalizability to other types of design tasks.
The number of participants in the laboratory study was limited (n=12), requiring consideration of the generalizability of the results.
Further research is needed on specific indicators and measurement methods to comprehensively assess emotional, behavioral, and cognitive aspects.
👍