Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics

Created by
  • Haebom

Author

Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, Yashwanth Nakka, Devansh, Jagat Sesh Challa, Dhruv Kumar

Outline

This paper focuses on code evaluation using large-scale language models (LLMs) and proposes a novel multi-agent-based approach that utilizes question-specific rubrics instead of traditional question-agnostic rubrics. While previous research has focused on code generation using LLMs, research on code evaluation remains scarce, and this paper aims to fill this gap. To address the lack of adequate evaluation datasets, we introduce two new datasets: one for data structures and algorithms tasks (150 submissions) and the other for object-oriented programming tasks (80 submissions). In addition to standard metrics such as Spearman's correlation coefficient and Cohen's kappa coefficient, we propose a novel metric, "leniency," that quantifies the rigor of expert evaluations. Experimental results demonstrate that question-specific rubrics enhance the logical evaluation of code in an educational setting, providing better feedback that goes beyond mere syntactic correctness and aligns with educational objectives.

Takeaways, Limitations

Takeaways:
Demonstrating the utility of question-specific rubrics in LLM-based code evaluation.
Presenting new possibilities for code evaluation using LLM in educational environments.
Proposal of 'Leniency', a new metric for measuring the strictness of code evaluation.
Providing a new evaluation dataset in the fields of data structures and algorithms and object-oriented programming.
Limitations:
The size of the presented dataset is relatively small.
Further research is needed on generalizability across different programming languages and task types.
Further validation of the objectivity and reliability of the 'Leniency' indicator is needed.
There is a need to automate and improve the efficiency of question-specific evaluation criteria generation.
👍