[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Created by
  • Haebom

Author

Hessa A. Alawwad, Anas Zafar, Areej Alhothali, Usman Naseem, Ali Alkhathlan, Amani Jamal

Outline

This paper is the first to evaluate the textbook question answering (TQA) capabilities of state-of-the-art multimodal large-scale language models (MLLMs), LLaVA-1.5 and LLaMA 3.2-Vision, on the CK12-QA dataset. To simulate a real-world learning environment, we introduce a multimodal retrieval augmented generation (RAG) pipeline that provides relevant textbook paragraphs and pictures as context. Zero-shot experiments reveal that the retrieved context improves the text-based question performance of LLaVA, while it significantly degrades the picture-based question accuracy of LLaMA 3.2-Vision from 74.07% to 25.93%, a phenomenon known as “catastrophic context interference.” Fine-tuning experiments demonstrate that LLaMA 3.2-Vision improves performance while LLaVA degrades performance, demonstrating the challenges of modal prioritization and context integration in MLLM.

Takeaways, Limitations

Takeaways:
Provides the first assessment of MLLM's textbook question-answering skills.
We demonstrate that a multi-modal RAG pipeline can effectively simulate real-world learning environments.
We find a phenomenon called “fatal contextual interference” in MLLM, emphasizing the importance of modal priority setting and contextual integration.
We show the performance differences according to the architecture of MLLM and suggest future research directions.
Provides a benchmark for developing AI-based educational tools.
Limitations:
Using only one CK12-QA dataset may be insufficient to examine generalizability.
The MLLM used in the evaluation may be limited.
There is a lack of in-depth analysis of the causes of the “fatal contextual interference” phenomenon.
👍