[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Preprint: Did I Just Browse A Website Written by LLMs?

Created by
  • Haebom

Author

Sichang "Steven" He, Ramesh Govindan, Harsha V. Madhyastha

Outline

In this paper, we propose a reliable and scalable pipeline for detecting “LLM-dominant” content, i.e., automatically generated web content by large-scale language models (LLMs). Existing LLM detectors perform well only on clean, prose-like text, but web content has limitations due to its complex markup and diverse genres. Therefore, instead of simply classifying the text extracted from each page, we present a pipeline that classifies each site based on the output of the LLM text detector for multiple prose-like pages. We train and evaluate the detector on two independent baseline datasets of 120 sites, and achieve 100% accuracy. We test the detector in a real-world environment against 10,000 sites from search engine results and Common Crawl archives, and find that it detects a significant number of LLM-dominant sites, which rank highly in search results and are increasing in number, raising concerns about their impact on end users and the web ecosystem as a whole.

Takeaways, Limitations

Takeaways:
We present an effective and scalable pipeline for the detection of LLM-dominant content.
We monitor the proliferation and ranking rise of LLM-dominant content on the web and warn of the negative impacts it can have.
Emphasizes the importance of LLM-dominant content detection techniques.
Limitations:
The performance evaluation of the currently presented pipeline is based on a limited dataset (120 sites). Additional validation using a broader and more diverse dataset is needed.
There may be a lack of clear guidance on the definition and classification criteria for LLM-dominant content.
Due to the continuous development of LLM and the emergence of new generation methods, there is a possibility of degradation in detector performance.
👍