Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Created by
  • Haebom

Author

Chaymaa Abbas, Mariette Awad, Razane Tajeddine

Outline

This paper identifies style-conditional data contamination as a covert vector that amplifies sociolinguistic bias in large-scale language models. Using a small contaminated budget, we pair dialectal prompts, such as those from African American English (AAVE) and Southern dialects, with toxic or stereotypical completions to investigate whether language style can serve as a potential trigger for harmful behavior. Across multiple model families and scales, contaminated exposure increases toxicity and stereotype expression for dialectal input, particularly consistently for AAVE. Standard American English, while relatively low, is not immune. A multi-metric audit combining classifier-based toxicity assessment with LLM-as-a-judge reveals stereotype-laden content even when lexical toxicity appears suppressed, indicating that existing detectors underestimate sociolinguistic harm. Furthermore, contaminated models exhibit rapid escape even without explicit profanity in the toxicity, suggesting weakened alignment rather than memorization.

Takeaways, Limitations

Style-conditional data contamination can amplify sociolinguistic biases in language models.
Certain dialects, such as AAVE, are more prone to toxicity and stereotyping.
Existing toxicity detectors may not adequately detect sociolinguistic harm.
A contaminated model can exhibit jailbreak behavior even without explicit profanity.
We need training protocols that assess dialect awareness, audit content-level stereotypes, and separate style from toxicity.
👍