Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Created by
  • Haebom

Author

Shashidhar Reddy Javaji, Zining Zhu

Outline

In this paper, we propose a novel framework for evaluating the ability of large-scale language models (LLMs) to acquire new knowledge. The framework simulates a curious human being who encounters a sentence introducing scientific knowledge by inducing the LLM to generate questions for the first time. We evaluate the knowledge acquisition potential of the LLM by assessing the quality of the generated questions, and validate the validity of the scoring procedure through a controlled elimination study. We generate a synthetic dataset consisting of 1,101 sentences of varying difficulty in physics, chemistry, and mathematics, 300 general knowledge sentences, and 567 incorrect sentences, and validate the validity of the model evaluation through human evaluation (weighted Cohen's kappa approximately 0.7). We find that while large models such as GPT-4 and Mistral 8x7b are adept at generating consistent and relevant questions, the smaller Phi-2 model is equally or more effective. This suggests that model size is not the only factor determining knowledge acquisition potential. The proposed framework quantifies important model features that have been previously overlooked, and presents research opportunities for developing more knowledge-rich AI systems.

Takeaways, Limitations

Takeaways:
A new framework for assessing LLMs' ability to acquire new knowledge
Reveals that model size is not the only factor determining knowledge acquisition potential
Presenting a new research direction for developing knowledge-rich AI systems
Presenting an effective way to indirectly assess knowledge acquisition ability through LLM's question generation ability
Limitations:
Further research is needed on the generalizability of the presented synthetic dataset.
Need to verify the applicability of the framework to different types of knowledge and tasks
There are limitations in the evaluation results due to the subjectivity of human evaluation.
The current framework focuses on the quality of questions, so it may not fully reflect the ability to acquire and utilize actual knowledge.
👍