Although LLM-as-a-judge, a method for evaluating text generated using a large-scale language model (LLM), is a common evaluation method in Natural Language Generation (NLG), it is primarily used as a quantitative tool with a numerical score as its primary output. In this paper, we propose an LLM-based evaluation method (LLM-as-a-qualitative-judge), which uses LLMs to generate structured reports on common problem types in NLG system output. This approach aims to provide developers with meaningful insights for improving a given NLG system and consists of two main steps: instance-by-instance problem analysis and clustering of the discovered problems using an intuitive stacking algorithm. We also present a strategy for evaluating the proposed approach with ~300 problem annotations from instances in 12 NLG datasets. The results demonstrate that instance-by-instance problems output by LLM-as-a-qualitative-judge match human-annotated problems in two-thirds of cases, and that LLM-as-a-qualitative-judge can generate error type reports similar to those generated by human annotators. Furthermore, case studies demonstrate that the use of LLM-as-a-qualitative-judge can substantially improve the performance of NLG systems.