This paper focuses on improving the efficiency and effectiveness of benchmark construction for evaluating domain-specific capabilities of large-scale language models (LLMs). Existing domain-specific benchmarks have primarily relied on scaling rules, fine-tuning supervised learning using massive corpora, or generating extensive question sets. However, the impact of corpus and question-answering (QA) set design on the precision and recall of domain-specific LLMs has not been explored. This paper addresses this gap and demonstrates that scaling rules are not always optimal for building domain-specific benchmarks. Instead, we propose Comp-Comp, an iterative benchmarking framework based on the principle of comprehensiveness-compressibility. Comprehensiveness ensures semantic recall for a given domain, while compactness improves precision, guiding the construction of corpora and QA sets. To validate this framework, we conducted a case study at a prominent university to develop XUBench, a large-scale, comprehensive, closed-domain benchmark. Although this study used an academic context as a case study, the Comp-Comp framework is designed to provide valuable insights into benchmark building in a variety of domains beyond academia.