In this paper, we present an open-source evaluation framework, the Haerae Evaluation Toolkit (HRET), to address the reproducibility issue in the performance evaluation of Korean large-scale language models (LLMs). HRET integrates major Korean benchmarks, various inference backends, and multiple evaluation methods, and adopts a modular registry design that maintains the consistency of Korean output and allows rapid integration of new datasets, methods, and backends. In addition to standard accuracy metrics, it diagnoses morphological and semantic defects in the model output and suggests ways to improve them through Korean-specific analyses such as morphological recognition type-to-token ratio (TTR) and keyword omission detection.