SceneGen is a novel framework that simultaneously generates multiple 3D assets with geometric and texture information, taking a single scene image and its corresponding object masks as input. It operates without optimization or asset search and introduces a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders to generate 3D assets and their relative spatial positions in a single feedforward pass. Although trained on a single image input, it is directly scalable to multi-image input scenarios, and quantitative and qualitative evaluations demonstrate its efficiency and robust generation capabilities. It presents a novel solution to the emerging problem of 3D content generation for applications in VR/AR and implemented AI.