Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

Created by
  • Haebom

Author

Sergey Berezin, Reza Farahbakhsh, Noel Crespi

Outline

This paper presents a novel adversarial attack method against toxicity detection models that exploits a vulnerability in language models' ability to interpret spatially structured text in ASCII art format. We propose ToxASCII, a benchmark for evaluating the robustness of toxicity detection systems against visually obfuscated inputs. We demonstrate that ToxASCII achieves perfect attack success rates (ASR) on various state-of-the-art large-scale language models and dedicated moderation tools, exposing a serious vulnerability in current text-only moderation systems.

Takeaways, Limitations

Takeaways: By demonstrating that adversarial attacks using ASCII art are highly effective against toxicity detection models, we clearly demonstrate the vulnerabilities of existing text-based toxicity detection systems. This highlights the need for developing novel toxicity detection techniques that take visual information into account. The ToxASCII benchmark can be a useful tool for evaluating the robustness of future toxicity detection models.
Limitations: This attack method is limited to ASCII art, and its effectiveness against other forms of obfuscation techniques has not been verified. Further research is needed to determine its applicability and effectiveness in real-world online environments. Further verification of the versatility and generalizability of the ToxASCII benchmark is also needed.
👍