PhysGym is a novel benchmark and simulation platform for evaluating the scientific discovery capabilities of agents based on large-scale language models (LLMs). It focuses on assessing their ability to cope with changing environmental complexity and their ability to utilize prior knowledge. A key feature of PhysGym is its ability to finely control the level of prior knowledge provided to the agent. It consists of an interactive physics simulation in which the agent must actively explore the environment, sequentially collect data under constraints, and formulate hypotheses about the underlying physical laws. It provides standardized evaluation protocols and metrics to evaluate the accuracy of hypotheses and the fidelity of models. We present results from a baseline LLM to demonstrate differences in ability across different prior knowledge and task complexity.