Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Created by
  • Haebom

Author

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer , Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus

Outline

This paper introduces \textsc{LEXam}, a novel benchmark developed to improve the legal reasoning performance of large-scale language models (LLMs). This benchmark consists of 4,886 law exam questions (including 2,841 essay questions and 2,045 multiple-choice questions) in English and German, based on 340 law exams from 116 law courses. Essay questions are accompanied by guidance on problem-solving approaches and reference answers. Performance evaluation results on LLMs demonstrate that current models struggle with essay questions requiring structured, multi-step legal reasoning. Furthermore, we propose a scalable method for assessing the quality of legal reasoning by presenting an ensemble LLM-based "judge" paradigm that consistently and accurately evaluates the inference steps of model generation.

Takeaways, Limitations

Takeaways:
Development of a new benchmark, \textsc{LEXam}, to assess legal reasoning ability.
The LLM reveals that it struggles with structured, multi-level legal reasoning.
A new methodology for evaluating the inference stage of a model (LLM-as-a-Judge) is presented.
Demonstrating the effectiveness of benchmarks in distinguishing performance differences across different models.
Limitations:
No specific Limitations mentioned in the paper.
👍