Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research

Created by
  • Haebom

Author

Julian Junyan Wang, Victor Xiaoqi Wang

Outline

This paper presents a novel methodology to automatically collect data from corporate disclosures using large-scale language models (LLMs). Using a search-augmented generation (RAG) framework based on GPT-4o-mini, we successfully collected CEO compensation ratios and critical audit matters (CAMs) from approximately 10,000 proxy statements and over 12,000 10-K reports, saving significant time and cost compared to manual collection. This can contribute to expanding research participation by improving data accessibility for researchers with limited resources. In this paper, we share the methodology and the collected dataset to contribute to creating a more comprehensive research environment.

Takeaways, Limitations

Takeaways:
Contributes to improving research data accessibility and reducing research costs by presenting an automatic data collection methodology using LLM.
Expand research opportunities and promote research participation for researchers with limited resources.
Creating a more comprehensive research environment by sharing collected datasets.
Limitations:
Further research is needed on the generalizability of methodologies relying on GPT-4o-mini and the RAG framework.
Potential for errors due to dependence on the performance of LLM.
Limiting the scope of application of the methodology to certain types of corporate disclosures.
👍