This paper introduces "ARIA-MIDI", a large-scale dataset of piano performance audio collected from the Internet and converted to MIDI files. Approximately 100,000 hours of audio were converted to over 1 million MIDI files through a multi-stage pipeline that automatically collects and evaluates audio sources using language models, and then uses audio classifiers to remove and segment unnecessary parts. Statistical analysis and metadata tag information of the dataset are also provided, and the dataset is open to the public on Github.