This paper deals with hardware design automation using large-scale language models (LLMs), especially register-transfer level (RTL) code generation. We review previous research on LLM-based RTL code generation and present the elements required to construct a dataset for effective model learning and fine-tuning. A robust Verilog dataset is constructed through an automated three-step process: database construction and management using PostgreSQL, data collection from code hosting sites such as OpenCores and GitHub, and preprocessing including code syntax verification, logic synthesis execution, and related module metadata extraction. We implement a scalable and efficient DB infrastructure to support analysis, and describe in detail the preprocessing pipeline to ensure high-quality data before DB insertion. As a result, we present the largest known high-quality Verilog dataset, consisting of 20,392 Verilog samples and 751 MB of Verilog code data, and explore potential applications for dataset evaluation, related challenges, and future research and development in the field of LLM-based hardware generation.