In this paper, we propose a novel pipeline, ArtiScene, which utilizes a text-to-image model to solve the difficulty of text-based 3D scene generation. To solve the problem of lack of high-quality 3D data in existing text-to-3D models, we first generate a 2D image with text input, and then generate a 3D model using the shape, appearance, and location information of objects extracted from the image, and assemble the final 3D scene. ArtiScene can generate various scenes and styles, and outperforms existing state-of-the-art methods in quantitative indicators, user studies, and GPT-4 evaluations.