This paper introduces BoxingGym, a benchmark for systematically evaluating the scientific model proposal, experimental design, and data-driven revision capabilities of LLM-based scientific agents, targeting the core goals of AI research: understanding the world and explaining scientific theories. It includes ten environments based on real-world scientific disciplines ranging from psychology to ecology, and uses standard model evaluation metrics such as Expected Information Gain (EIG) to assess experimental design capabilities and explanation-based evaluation and prediction error to assess model discovery. Our results show that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery, and adding explicit statistical models does not significantly improve results.