This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Benchmarking the Pedagogical Knowledge of Large Language Models
Created by
Haebom
Author
Maxime Lelievre , Amy Waldock, Meng Liu, Natalia Vald es Aspillaga, Alasdair Mackintosh, Mar ia Jos e Ogando Portela, Jared Lee, Paul Atherton, Robin AA Ince, Oliver GB Garrod
Outline
This paper overcomes the limitations of existing AI performance evaluation benchmarks that mainly focus on content knowledge and proposes a new benchmark for pedagogical knowledge assessment, "The Pedagogy Benchmark". This benchmark consists of questions covering various pedagogical subdomains (e.g., teaching strategies, assessment methods) based on professional development test questions for teachers, and aims to assess cross-domain pedagogical knowledge (CDPK) and special educational needs and disabilities (SEND) pedagogical knowledge. Model performance is analyzed and visualized through experimental results for 97 models (accuracy ranging from 28% to 89%), cost-accuracy analysis, and an online leaderboard ( https://rebrand.ly/pedagogy ) based on model properties (cost per token, open/closed weights, etc.). It emphasizes the potential of LLM in education and the importance of benchmarks in the field of education, and aims to establish a foundation for responsible and evidence-based LLM utilization.
We have established new benchmarks specific to the field of education, providing criteria to objectively assess the level of pedagogical knowledge of LLMs.
◦
Through performance comparison and analysis of various models, we revealed the applicability and limitations of LLM in the educational field.
◦
Cost-performance analysis provides guidelines for developing and selecting efficient models.
◦
Enables continuous model performance comparison and research through online leaderboards.
◦
You can contribute to responsible approaches and evidence-based policy making for the educational use of LLMs.
•
Limitations:
◦
The number or variety of questions included in the benchmark may not be sufficient.
◦
It may not fully reflect the complexity of actual educational settings.
◦
Questions may include language or cultural bias.
◦
Due to limitations in the evaluation metrics, the educational impact of the LLM may not be fully captured.
◦
There may be a lack of assessment of the model's generalization ability.