[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Created by
  • Haebom

Author

Sergio E. Zanotto, Segun Aroyehun

Outline

This paper is a study on characterizing texts generated by large-scale language models (LLMs) and human-written texts using various linguistic-level features such as morphology, syntax, and semantics. Using 11 LLM-generated and human-written text datasets across 8 domains, we computed various linguistic features such as dependency length and sentiment. Statistical analysis results showed that human-written texts tend to have simpler syntactic structures and more diverse semantic content. In addition, we computed the variability of features according to models and domains, and both human and machine texts showed style diversity depending on the domain, but human texts showed greater variability. We further verified the variability of human-written and machine-generated texts by applying style embedding, and the latest model outputs texts with similar variability, suggesting the homogeneity of machine-generated texts.

Takeaways, Limitations

Takeaways:
The differences in linguistic features between LLM-generated texts and human-written texts were analyzed and presented specifically at various linguistic levels.
By analyzing the domain and model-to-model style variability of LLM-generated texts, we reveal the homogenization tendency of the latest models.
We present a number of linguistic features that help us understand the differences between human-generated and LLM-generated texts.
Limitations:
The domain of the dataset used and the type of LLM may be limited.
Linguistic features used in the analysis may not capture all types of textual differences.
Given the pace of development in LLMs, validation of research findings to ensure long-term validity is required.
👍