This paper is the first to evaluate the textbook question answering (TQA) capabilities of state-of-the-art multimodal large-scale language models (MLLMs), LLaVA-1.5 and LLaMA 3.2-Vision, on the CK12-QA dataset. To simulate a real-world learning environment, we introduce a multimodal retrieval augmented generation (RAG) pipeline that provides relevant textbook paragraphs and pictures as context. Zero-shot experiments reveal that the retrieved context improves the text-based question performance of LLaVA, while it significantly degrades the picture-based question accuracy of LLaMA 3.2-Vision from 74.07% to 25.93%, a phenomenon known as “catastrophic context interference.” Fine-tuning experiments demonstrate that LLaMA 3.2-Vision improves performance while LLaVA degrades performance, demonstrating the challenges of modal prioritization and context integration in MLLM.