Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Created by
  • Haebom

Author

In es Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy

Outline

This paper presents a framework to address training data quality, safety, and ethical concerns for large-scale language models (LLMs). Specifically, we highlight the challenges posed by the indiscriminate collection of web-scale datasets, such as Common Crawl, and propose a method for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. Experimental results on SwissAI's FineWeb-2 corpus (1.5 TB, four languages) demonstrate that it achieves millisecond-level search performance, enabling real-time dataset analysis. This provides a practical tool that can contribute to the development of safer and more responsible AI systems.

Takeaways, Limitations

Takeaways:
Contributes to improving data quality management and safety by providing real-time analysis and search capabilities for large-scale LLM training datasets.
Presenting an efficient data processing and analysis method using an ElasticSearch-based pipeline.
Providing practical tools for developing safer and more responsible AI systems.
Limitations:
Generalizability needs to be verified by testing on only SwissAI's FineWeb-2 corpus.
There is a possibility of performance degradation depending on the size of the dataset being analyzed.
Further research is needed to determine whether all types of harmful content can be effectively identified and filtered.
👍