Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Benchmarking for Domain-Specific LLMs: A Case Study on Academia and Beyond

Created by
  • Haebom

Author

Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

Outline

This paper presents Comp-Comp, a benchmarking framework for domain-specific evaluation of large-scale language models (LLMs). Unlike existing large-scale data-based benchmarking methods, Comp-Comp accurately and efficiently evaluates domain-wide aspects based on comprehensiveness and parsimony. Comprehension enhances semantic recall, while parsimony reduces redundancy and noise, improving precision. Through a case study targeting a university, this paper demonstrates the process of developing PolyBench, a high-quality, large-scale academic benchmark, using Comp-Comp. This demonstrates the applicability of the Comp-Comp framework to a wide range of fields.

Takeaways, Limitations

Takeaways:
We point out the limitations of existing data expansion-based benchmarking and propose a new benchmarking framework based on comprehensiveness and conciseness.
We demonstrate that the Comp-Comp framework can improve the precision and recall of domain-specific LLM assessments.
We successfully developed a high-quality, large-scale academic benchmark called PolyBench, demonstrating its practical applicability.
Since it is a domain-independent framework, it can be applied to various fields.
Limitations:
The case study in this paper focuses on a specific domain, university, and further research is needed to determine its generalizability to other domains.
The effectiveness and efficiency of the Comp-Comp framework need to be verified across a wider range of domains and LLMs.
Further objective evaluation of the quality and scope of PolyBench is needed.
👍