Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Created by
  • Haebom

Author

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

Outline

Athena-PRM is a multimodal process reward model (PRM) designed to evaluate the reward score of each step in a complex inference problem-solving process. Developing a high-performance PRM requires significant time and financial investment because it requires step-by-step inference step annotations. Existing automatic labeling methods, such as Monte Carlo estimation, generate noisy labels and incur significant computational costs. In this paper, we propose an efficient method for generating high-quality process label data by leveraging the prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Athena-PRM demonstrates outstanding performance across a variety of scenarios and benchmarks with only 5,000 samples. Furthermore, we develop two effective strategies: ORM initialization and upsampling for negative data to improve PRM performance. We validate this method in three specific scenarios: test time scaling validation, direct assessment of inference step accuracy, and fine-tuning of reward rankings. Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Specifically, when using Qwen2.5-VL-7B as a policy model, Athena-PRM achieved a performance improvement of 10.2 points on WeMath and 7.1 points on MathVista. It also achieved state-of-the-art (SoTA) results on VisualProcessBench, demonstrating a 3.9 F1-score improvement over the previous SoTA. Using Athena-PRM as a reward model, we developed Athena-7B by fine-tuning the reward ranking, and it significantly outperformed the baseline model across five benchmarks.

Takeaways, Limitations

Takeaways:
We present an efficient method for generating high-quality process label data by leveraging predictive consistency between weak and strong completers.
We propose a strategy to improve PRM performance through ORM initialization and negative data upsampling.
Validation of Athena-PRM's superior performance across various benchmarks and scenarios (WeMath, MathVista, VisualProcessBench, etc.).
Improving the performance of the Athena-7B model by fine-tuning the reward ranking.
Excellent performance even with small amounts of data (5,000 samples).
Limitations:
The paper lacks specific references to Limitations or future research directions.
There is a lack of detailed description of the characteristics and limitations of the dataset used.
A more detailed comparative analysis with other PRM models is needed.
👍