[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation

Created by
  • Haebom

Author

Paul E. Calzada, Zahin Ibnat, Tanvir Rahman, Kamal Kandula, Danyu Lu, Sujan Kumar Saha, Farimah Farahmandi, Mark Tehranipoor

Outline

This paper deals with hardware design automation using large-scale language models (LLMs), especially register-transfer level (RTL) code generation. We review previous research on LLM-based RTL code generation and present the elements required to construct a dataset for effective model learning and fine-tuning. A robust Verilog dataset is constructed through an automated three-step process: database construction and management using PostgreSQL, data collection from code hosting sites such as OpenCores and GitHub, and preprocessing including code syntax verification, logic synthesis execution, and related module metadata extraction. We implement a scalable and efficient DB infrastructure to support analysis, and describe in detail the preprocessing pipeline to ensure high-quality data before DB insertion. As a result, we present the largest known high-quality Verilog dataset, consisting of 20,392 Verilog samples and 751 MB of Verilog code data, and explore potential applications for dataset evaluation, related challenges, and future research and development in the field of LLM-based hardware generation.

Takeaways, Limitations

Takeaways:
Provides a large-scale, high-quality Verilog dataset for LLM-based hardware design automation.
We present a method for building efficient database management and preprocessing pipelines.
It suggests future research directions in the field of LLM-based hardware generation.
Limitations:
Additional assessment of the quality and diversity of the dataset may be necessary.
Further research is needed to determine whether the presented dataset is applicable to all types of hardware designs.
The size of the dataset may not be sufficient for future developments in LLM.
👍