This paper presents a novel methodology to automatically collect data from corporate disclosures using large-scale language models (LLMs). Using a search-augmented generation (RAG) framework based on GPT-4o-mini, we successfully collected CEO compensation ratios and critical audit matters (CAMs) from approximately 10,000 proxy statements and over 12,000 10-K reports, saving significant time and cost compared to manual collection. This can contribute to expanding research participation by improving data accessibility for researchers with limited resources. In this paper, we share the methodology and the collected dataset to contribute to creating a more comprehensive research environment.