Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

Created by
  • Haebom

Author

Kaiqi Zhang, Shuai Yuan, Honghan Zhao

Outline

This paper focuses on the evaluation of large-scale language models (LLMs), particularly in business scenarios. To address the inefficiencies of existing manual evaluation methods, we propose TALEC, a model-based evaluation method that allows for the application of user-defined evaluation criteria. TALEC utilizes in-context learning (ICL) to train internal criteria for the judgment model and combines zero-shot and few-shot evaluations to focus on more information. Furthermore, we propose an effective prompting paradigm and an engineering approach to enhance the accuracy of the judgment model. Experimental results show that TALEC correlates with human evaluations by over 80%, and in certain tasks, it outperforms inter-human correlations. We also present results demonstrating that ICL can be used as an alternative to fine-tuning. The code is available on GitHub.

Takeaways, Limitations

Takeaways:
Presentation of TALEC, a new model-based assessment method that can improve the efficiency of LLM assessment in business scenarios.
Custom evaluation criteria can be applied.
Improving the accuracy of judgment models using ICL.
Excellent performance can be achieved with ICL alone without fine-tuning.
Results showing a high correlation with human evaluation.
Open source release for improved accessibility.
Limitations:
There is a possibility that TALEC's performance may be biased towards specific tasks or datasets.
Further research is needed to determine the generalizability of the proposed prompt paradigm and engineering approach.
Additional experiments and validation are needed for various business scenarios.
Further research is needed on the scalability and stability of ICL-based evaluation methods.
👍