Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Created by
  • Haebom

Author

Sergio E. Zanotto, Segun Aroyehun

Outline

Advances in large-scale language models (LLMs) have made it difficult to distinguish LLM-generated texts from human-generated texts. Instead of categorizing texts as human- or machine-generated, this study characterizes texts using various linguistic features, such as morphology, syntax, and semantics. We select human- and machine-generated texts from eight domains and 11 LLMs and compute various linguistic features, such as dependency length and sentiment, using sampling strategies, iteration control, and model release dates. Human-generated texts exhibit simpler syntactic structures and more diverse semantic content. Calculating feature variability across models and domains revealed that both human- and machine-generated texts exhibited varying styles across domains, with human-generated texts exhibiting greater variability. We further tested the variability between human- and machine-generated texts using style embeddings. We found that the most recent models produced texts with similar variability, suggesting a homogeneity of machine-generated texts.

Takeaways, Limitations

Human-written texts exhibit simpler syntactic structures and more diverse semantic content.
Both human- and machine-generated texts exhibit a variety of styles across domains.
Human-written text exhibits greater feature variability.
Recent LLMs show similar text variability, suggesting homogeneity in machine-generated texts.
The study is limited to a specific linguistic feature and may require generalization to other features, models, or domains.
The application of style embeddings may be based on limited data.
Further review is needed to determine how research findings will change as text generation technologies continue to evolve and change.
👍