This paper presents a framework to address training data quality, safety, and ethical concerns for large-scale language models (LLMs). Specifically, we highlight the challenges posed by the indiscriminate collection of web-scale datasets, such as Common Crawl, and propose a method for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. Experimental results on SwissAI's FineWeb-2 corpus (1.5 TB, four languages) demonstrate that it achieves millisecond-level search performance, enabling real-time dataset analysis. This provides a practical tool that can contribute to the development of safer and more responsible AI systems.