Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TextQuests: How Good are LLMs at Text-Based Video Games?

Created by
  • Haebom

Author

Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

Outline

This paper proposes TextQuests, a novel benchmark for evaluating AI agents in complex, interactive environments reflecting real-world problems. While existing benchmarks focus on tool use or structured task performance, TextQuests assesses long-term, self-directed reasoning based on the Infocom interactive fiction game. By restricting the use of external tools, TextQuests focuses on the agent's inherent long-term contextual reasoning, trial-and-error learning, and persistent problem-solving abilities. It evaluates the AI agent's self-directed problem-solving abilities through complex games that would take a human player over 30 hours. We publish the benchmark at https://textquests.ai .

Takeaways, Limitations

Takeaways:
Provides a new benchmark for evaluating the long-term reasoning and problem-solving capabilities of AI agents in complex, real-world environments.
By assessing the agent's inherent capabilities without relying on external tools, the true capabilities of AI agents can be assessed more accurately.
Leveraging the complexity of Infocom games, we provide a broad assessment environment that can assess a wide range of problem-solving skills.
Contribute to the advancement of the AI research community through the release of the TextQuests benchmark.
Limitations:
TextQuests are limited to text-based games, making them difficult to apply to other types of environments or interaction methods.
Due to the complexity of the game, completing the benchmark may require significant time and resources.
Further research may be needed on the evaluation metrics and measurement methods of benchmarks.
There is a possibility that the evaluation results may be biased towards certain types of games.
👍