Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

Created by
  • Haebom

Author

Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

Outline

This paper focuses on improving the efficiency and effectiveness of benchmark construction for evaluating domain-specific capabilities of large-scale language models (LLMs). Existing domain-specific benchmarks have primarily relied on scaling rules, fine-tuning supervised learning using massive corpora, or generating extensive question sets. However, the impact of corpus and question-answering (QA) set design on the precision and recall of domain-specific LLMs has not been explored. This paper addresses this gap and demonstrates that scaling rules are not always optimal for building domain-specific benchmarks. Instead, we propose Comp-Comp, an iterative benchmarking framework based on the principle of comprehensiveness-compressibility. Comprehensiveness ensures semantic recall for a given domain, while compactness improves precision, guiding the construction of corpora and QA sets. To validate this framework, we conducted a case study at a prominent university to develop XUBench, a large-scale, comprehensive, closed-domain benchmark. Although this study used an academic context as a case study, the Comp-Comp framework is designed to provide valuable insights into benchmark building in a variety of domains beyond academia.

Takeaways, Limitations

Takeaways: We demonstrate that relying solely on scaling laws is not the optimal approach for building domain-specific LLM benchmarks. We propose Comp-Comp, a novel framework based on the inclusiveness-compactness principle, offering a more effective and efficient way to build benchmarks. We demonstrate the practicality of the Comp-Comp framework with a real-world example, XUBench. We provide a framework that can be extended to a variety of domains.
Limitations: To date, only one case study has been presented for an academic field. Further research is needed to determine the generalizability of the Comp-Comp framework to other domains. Details on the specific configuration and performance metrics of XUBench are lacking.
👍