Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Created by
  • Haebom

Author

Prashant Jayannavar, Liliang Ren, Marisa Hudspeth, Risham Sidhu, Charlotte Lambert, Ariel Cordes, Elizabeth Kaplan, Anjali Narayan-Chen, Julia Hockenmaier

Outline

This paper focuses on Builder Action Prediction (BAP), a subtask of the Minecraft Collaborative Building Task (MCBT), aiming to improve AI agents' language understanding, environmental perception, and physical world behavior. To address the evaluation, training data, and modeling challenges of existing BAPs, we present BAP v2. BAP v2 presents improved evaluation benchmarks, fairer and more insightful metrics, and spatial reasoning capabilities, which are key performance detractors. To address data scarcity, we generate various types of synthetic MCBT data and leverage them to enhance the model's spatial capabilities. We present a new state-of-the-art model, Llama-CRAFTS, which leverages improved input representations to achieve an F1 score of 53.0 in BAP v2. While this represents a 6-point improvement over previous work, it still highlights the challenges of the task and lays the foundation for future research.

Takeaways, Limitations

Takeaways:
BAP v2 addresses the challenges of MCBT evaluation and provides a more fair and accurate benchmark.
Generating synthetic data solves the problem of data shortage and contributes to improving the model's spatial inference capability.
The Llama-CRAFTS model improves performance over existing SOTA models and provides a useful metric for evaluating the spatial capabilities of LLM.
This suggests that improving spatial reasoning is an important direction for future research.
Limitations:
The Llama-CRAFTS model still does not achieve perfect performance on BAP v2 and requires further performance improvements.
We must consider the limitations of synthetic data and the differences from real-world data.
Due to the limitations of text-only LLM, integrating information from various modalities (visual, auditory, etc.) could be a future research direction.
👍