This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Code Simulation as a Proxy for High-order Tasks in Large Language Models
Created by
Haebom
Author
Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
Outline
This paper presents a study on the use of natural language inference tasks and artificially generated inference tasks to evaluate the inference ability of large-scale language models (LLMs). Since natural language inference tasks are difficult to generate manually, we create an artificial dataset that can be easily generated on a large scale by utilizing the basic structure of programming (e.g., linear programs, codes with critical paths, approximate and redundant instructions, etc.). We evaluate the ability of LLMs by using additional artificial datasets with alignment problems and repeated operations, and show that even the most powerful LLMs rely heavily on memory and pattern recognition and have weak inference processes. This study contributes to artificially testing the inference ability of LLMs in a scalable way that complements manually annotated tasks.
Takeaways, Limitations
•
Takeaways:
◦
A method for generating scalable artificial datasets for evaluating the inference ability of LLM is presented.
◦
Presentation of the results of evaluating LLM inference ability using artificial datasets (even powerful LLMs show weaknesses as they rely on memory and pattern recognition)
◦
Proposing a comprehensive evaluation method combining natural language inference tasks and artificial datasets
•
Limitations:
◦
Need to verify whether artificial datasets can perfectly replace real natural language inference tasks
◦
Further analysis and improvement measures are needed to address vulnerabilities in LLM's reasoning process.
◦
Need to verify generalizability to various types of reasoning tasks