This paper proposes a novel learning-based 3D simulator, 3DGSim. 3DGSim directly learns physical interactions from multi-view RGB video, enabling realistic simulations without the need for privileged information such as depth information or particle tracking. It learns a latent particle-based representation of a 3D scene using MVSplat, predicts particle dynamics with the Point Transformer, performs consistent temporal aggregation with the Temporal Merging module, and generates new view renderings using Gaussian Splatting. By jointly learning inverse rendering and dynamics prediction, we embed physical properties into point-wise latent features, capturing a wide range of physical behaviors (from rigid to elastic, including cloth-like dynamics and boundary conditions) and realistic lighting effects, and generalize to unseen multi-body interactions and novel scene manipulations.