This paper identifies challenges in evaluating generative machine learning models and proposes GrandJury, a novel evaluation protocol to address these issues. It highlights the limitations of existing static, benchmark-driven evaluation methods, which fail to reflect dynamic user needs or changing circumstances. GrandJury combines time-decayed aggregation, full traceability, dynamic and transparent application of work criteria, and multi-evaluator human judgment to enable multi-disciplinary and accountable evaluation. It provides an open-source implementation (grandjury PyPI package) that includes LLM inference results, demonstrating the necessity and methodology of GrandJury. This presents a new paradigm for evaluating machine learning outputs without absolute answers.