This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
This paper highlights the need for high-quality multimodal benchmarks and presents a framework for transforming text-based question-answer pairs (TQAs) into multimodal question-answer pairs (MMQAs). This framework builds a benchmark for MMQA generation and evaluation, and develops an agent system (Q-Mirror) to enable iterative improvement. Experimental results demonstrate that while state-of-the-art models can generate MMQAs, there is still room for improvement, and that an understanding model performs similarly to human judgment in MMQA quality assessment. The Q-Mirror agent demonstrates improved benchmark scores and has the potential to contribute to the development of large-scale scientific benchmarks.
Takeaways, Limitations
•
Takeaways:
◦
A framework for transforming text-based QA into multimodal QA is presented.
◦
Building a benchmark for MMQA creation and evaluation.
◦
Suggesting the possibility of iterative improvement through the development of an agent system (Q-Mirror).
◦
High performance of the understanding model confirmed in MMQA quality assessment.
◦
Potential to contribute to the establishment of large-scale scientific benchmarks.
•
Limitations:
◦
The MMQA generation results of state-of-the-art models still have room for improvement.
◦
Lack of description of specific model structure or technical details.
◦
Further research is needed to determine generalizability to other fields.