Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models

Created by
  • Haebom

Author

Yang Fan

Outline

This paper proposes AdEval, a dynamic data evaluation method, to address data contamination in large-scale language model (LLM) evaluations. AdEval reduces the risk of data contamination by extracting knowledge points and key ideas from static datasets and dynamically aligning them with the core content of static benchmarks. It obtains background information through online searches to generate detailed explanations of knowledge points and designs questions across six dimensions (remembering, understanding, applying, analyzing, evaluating, and creating) based on Bloom's cognitive hierarchy, enabling multi-level cognitive evaluations. It controls the complexity of dynamically generated datasets through iterative question restructuring. Experimental results on multiple datasets demonstrate that AdEval effectively mitigates the impact of data contamination, addresses the lack of complexity control and single-dimensional evaluation issues, and enhances the fairness, reliability, and diversity of LLM evaluations.

Takeaways, Limitations

Takeaways:
A New Approach to Addressing Data Contamination in LLM Assessments
Providing a dynamic and multidimensional LLM assessment method
Improving the fairness, reliability, and diversity of evaluations
Multi-level cognitive assessment possible using Bloom's cognitive hierarchy
Limitations:
AdEval's performance may depend on the quality of online search results.
A discussion is needed on the subjectivity of the question generation and complexity control process.
Further extensive experiments on various types of LLMs and datasets are needed.
Analysis of AdEval's computational cost and efficiency is needed.
👍