Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Created by
  • Haebom

Author

Nathaniel Getachew, Abulhair Saparov

Outline

StorySim is a programmable framework for artificially generating stories to evaluate the theoretical mind (ToM) and world modeling (WM) abilities of large-scale language models (LLMs). To address the pretraining data contamination problem of existing benchmarks, StorySim generates novel, constructive story prompts based on highly controlled storyboards, allowing for precise manipulation of character perspectives and events. Using this framework, we designed primary and secondary ToM tasks, along with WM tasks that assess the ability to track and model mental states. Experiments with state-of-the-art LLMs revealed that most models performed better on WM tasks than on ToM tasks, and tended to perform better in reasoning with humans than with inanimate objects. Furthermore, we found evidence of heuristic behaviors, such as recency bias and overreliance on early events in the story. All code for data generation and evaluation is publicly available.

Takeaways, Limitations

Takeaways:
We present StorySim, a novel framework for assessing ToM and WM abilities in LLM.
Addressing the pre-training data contamination issue of existing benchmarks, Limitations.
Precise narrative manipulation and design of various ToM and WM tasks are possible through storyboards.
Provides new insights into ToM and WM abilities in LLM (WM > ToM, human reasoning > inanimate reasoning, discovery of heuristic behavior).
Ensure reproducibility and scalability through open source code.
Limitations:
Further research is needed to determine the generalizability of the stories generated by StorySim.
Further experiments on different types of LLMs are needed.
Further analysis is needed to understand the root causes of heuristic behavior.
👍