Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Created by
  • Haebom

Author

Run-Ze Fan, Zengzhi Wang, Pengfei Liu

Outline

This paper addresses the lack of large-scale open-source datasets for scientific reasoning by presenting the TextbookReasoning dataset, which contains 650,000 inference questions extracted from college-level science textbooks, and the MegaScience dataset, which contains 1.25 million instances integrated from various open-source datasets. MegaScience was developed by systematically identifying optimal subsets through ablation studies of various data selection methodologies. Furthermore, a comprehensive evaluation system encompassing 15 benchmarks ensures accurate evaluation metrics. Experimental results demonstrate that the proposed dataset outperforms existing open-source scientific datasets in terms of performance and training efficiency. Baseline models trained on MegaScience—Llama3.1, Qwen2.5, and Qwen3—significantly outperform their corresponding official instruction models on average. This paper contributes to the advancement of scientific reasoning research by disclosing the data cleaning pipeline, evaluation system, dataset, and seven trained models.

Takeaways, Limitations

Takeaways:
Contribute to the advancement of scientific reasoning AI research by providing TextbookReasoning and MegaScience, large-scale, high-quality scientific reasoning datasets.
Presenting an optimal dataset composition strategy through comparative analysis of various data selection methodologies.
A comprehensive evaluation system enables accurate measurement and comparison of the performance of scientific inference models.
Models trained on MegaScience outperform existing models.
Demonstrates scalability to large-scale models.
Open-source datasets and trained models to ensure research sharing and reproducibility.
Limitations:
Further review of the dataset's balance and diversity is needed. There is a possibility of bias toward certain fields or question types.
Further verification of the quality and reliability of the datasets used to construct MegaScience is needed.
The scope of the evaluation system needs to be further expanded to encompass a wider range of scientific reasoning types.
There is a need for a continuous update and management plan for the dataset.
Lack of support for multiple languages.
👍