In this paper, we present a novel evaluation methodology to address the hallucination problem of large-scale language models (LLMs), especially when answering questions outside the knowledge base in the augmented generation (RAG) setting. We introduce knowornot, an open-source library that enables automated evaluation instead of traditional manual annotation, and show that it can be used to systematically evaluate the robustness of LLMs outside the knowledge base (OOKB). knowornot supports the development of custom evaluation data and pipelines, and provides features such as a unified API, a modular architecture, rigorous data modeling, and a variety of user-defined tools. We demonstrate the utility of knowornot by developing a benchmark called PolicyBench, which includes four government policy-related question-answering chatbots. The source code of knowornot is available on GitHub.