This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
This paper proposes Modality-Balancing Preference Optimization (MBPO), a novel preference learning framework to address the modality imbalance problem in large-scale multimodal models (LMMs). MBPO builds a more effective offline preference dataset using hard negatives generated through adversarial perturbation and generates online responses using verified rewards using a close-ended task. Furthermore, Group Relative Policy Optimization (GRPO) is used to train the model using hybrid offline-online data. Experimental results show that MBPO improves the performance of LMMs and effectively reduces hallucinations.
Takeaways, Limitations
•
Takeaways:
◦
Contributes to solving the modality imbalance problem of LMM.
◦
Augmenting the effectiveness of offline preference datasets by generating hard negatives using adversarial perturbation.
◦
Improving model adaptability by generating online data and training using GRPO.
◦
Demonstrating the effectiveness of LMM in improving performance and reducing hallucination in vision-language tasks.
•
Limitations:
◦
Further research is needed on how to mitigate the internal bias of the LLM backbone.
◦
Generalization performance evaluation is needed for all types of LMM tasks.
◦
Further research is needed on the scalability and computational efficiency of MBPO.