Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Created by
  • Haebom

Author

Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang

Outline

This paper presents MLLM-CTBench, a benchmark for continuous instruction tuning of multimodal large-scale language models (MLLMs). MLLM-CTBench comprises seven carefully selected tasks from six diverse domains. It provides a multidimensional evaluation metric (combining final answer accuracy and Chain of Thought (CoT) inference quality), a comprehensive evaluation of continuous learning algorithms (eight algorithms across four major categories), and a comparison of the effectiveness of reinforced fine-tuning (RFT) and supervised fine-tuning (SFT) (based on model performance retention across successive tasks). Experimental results demonstrate that the MLLM inference process is more robust to forgetting during continuous training than the final output, and that a robust base model exhibits stronger forgetting resistance. Properly regularized RFT is shown to be a more robust approach for performance retention across tasks than SFT, highlighting the importance of KL-divergence regularization.

Takeaways, Limitations

Takeaways:
Providing MLLM-CTBench, a systematic benchmark for continuous instructional adjustment of MLLM.
Multidimensional evaluation metrics enable detailed analysis of MLLM's continuous learning capabilities.
A comprehensive evaluation of various continuous learning algorithms and provides actionable insights for algorithm design and adoption.
A comparison of RFT and SFT reveals that RFT, especially RFT with KL-divergence regularization, is more effective in continuous learning.
Experimentally demonstrated that MLLM's inference process is more robust to forgetting than its final output.
Limitations:
The scope of the MLLM-CTBench task may be limited.
The possibility of subjectivity in the selection of evaluation indicators and algorithms.
Further research is needed to determine the generalizability of the experimental environment and settings.
👍