Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Rethinking the Role of Text Complexity in Language Model Pretraining

Created by
  • Haebom

Author

Dan John Velasco, Matthew Theodore Roque

Outline

This paper focuses on the fact that while improving the quality and size of pre-training data is known to improve downstream performance, the impact of text complexity (reading difficulty) has been relatively less studied. By reducing surface complexity—that is, using shorter sentences, easier words, and simpler structures while maintaining a largely consistent core content—we studied (i) how text complexity affects various model sizes, (ii) whether useful representations can be learned from simple text alone, and (iii) how pre-training text complexity affects downstream language understanding. To achieve this, we used a large-scale language model to simplify human-written texts. Causal models (28M-500M) were pre-trained from scratch using both the original and simplified data, and then fine-tuned and evaluated under zero-shot settings.

Takeaways, Limitations

Takeaways:
Model performance varies depending on the interaction between model size and text complexity. Smaller models exhibit less performance degradation on simpler text.
Text complexity has little effect on fine-tuning evaluation.
In zero-shot evaluation, simple texts are advantageous for tasks involving linguistic knowledge, while more complex texts are advantageous for tasks requiring world knowledge and object tracking.
Data diversity impacts transfer learning and zero-shot performance differently, providing useful information for tailoring data curation to specific goals.
Limitations:
Reference to specific Limitations is not included in the abstract of the paper.
👍