Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Knowledge-based Consistency Testing of Large Language Models

Created by
  • Haebom

Author

Sai Sathiesh Rajan, Ezekiel Soremekun, Sudipta Chattopadhyay

Outline

In this paper, we propose KonTest, an automated testing framework for systematically identifying and measuring inconsistencies and knowledge gaps in large-scale language models (LLMs). KonTest leverages knowledge graphs to generate test cases and combines semantically equivalent queries with test oracles (transformational or ontological oracles) to investigate and measure inconsistencies in the LLM's world knowledge. Furthermore, it mitigates knowledge gaps through a weighted LLM model ensemble. Experimental results using four state-of-the-art LLMs—Falcon, Gemini, GPT3.5, and Llama2—show that KonTest generated 1,917 error-inducing inputs (19.2%) out of 9,979 test inputs, resulting in a knowledge gap of 16.5% across all tested LLMs. A mitigation method based on KonTest's test set reduced the LLM knowledge gap by 32.48%. Additional ablation studies demonstrate that GPT3.5's knowledge construction efficiency is only 60-68%, making it unsuitable for knowledge-based consistency testing.

Takeaways, Limitations

Takeaways:
We present an automated testing framework (KonTest) to systematically measure and mitigate inconsistencies and knowledge gaps in LLM.
Quantitatively measure the actual error rate and knowledge gap of LLM through KonTest and present its size.
We demonstrate that the knowledge gap in LLM can be significantly reduced using a KonTest-based mitigation method.
We present the characteristics of models that are suitable and those that are not suitable for LLM's knowledge-based consistency test.
Limitations:
KonTest's test case generation relies on a knowledge graph, so its performance may be affected by the completeness and accuracy of the knowledge graph.
The test subjects for LLMs are limited, and testing for a wider range of LLMs is needed.
Further analysis is needed to determine why GPT-3.5's knowledge building efficiency is low.
Further research is needed to determine the generalizability of the mitigation method.
👍