This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
BIS Reasoning 1.0 is the first large-scale Japanese dataset explicitly designed to evaluate the belief-inconsistent reasoning ability of large-scale language models (LLMs). Unlike existing NeuBAROCO or JFLD, it does not focus on general or belief-consistent reasoning, but introduces logically valid but belief-inconsistent syllogisms to expose the inference bias of LLMs trained on human-aligned corpora. Benchmarking state-of-the-art models, including the GPT model, the Claude model, and leading Japanese LLMs, shows a significant gap in performance, with GPT-4o achieving 79.54% accuracy. Our analysis reveals that current LLMs have serious weaknesses when handling logically valid but belief-conflicting inputs.
Takeaways, Limitations
•
Takeaways: By revealing the vulnerability of LLM to logically valid but belief-conflicting inputs, we provide important implications for LLM deployment in high-risk domains such as law, medicine, and scientific literature. We suggest that truth should be prioritized over intuitive beliefs for the safety and integrity of LLMs. We analyze the performance differences of various LLMs and suggest future directions for LLM development.
•
Limitations: The current dataset consists only of Japanese, so further research is needed on the generalizability to other languages. Since the evaluation is limited to syllogism, further research is needed to evaluate the performance of LLM on more complex reasoning tasks. Despite the high accuracy of GPT-4o, there are still areas for improvement, and further research is needed to address these issues.