Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection

Created by
  • Haebom

Author

Arefeh Kazemi, Sri Balaaji Natarajan Kalaivendan, Joachim Wagner, Hamza Qadeer, Kanishk Verma, Brian Davis

Outline

This paper addresses the challenges of developing a cyberbullying (CB) detection system for online users, including children. Specifically, we propose a method for generating synthetic data and labels using a large-scale language model (LLM) to address the lack of labeled data reflecting children's language and communication styles. Experimental results show that a BERT-based CB classifier trained on synthetic data generated via LLM achieves comparable performance (75.8% accuracy vs. 81.5% accuracy) to a classifier trained on real data. Furthermore, LLM is also effective for labeling real-world data, with the BERT classifier achieving comparable performance (79.1% accuracy vs. 81.5% accuracy). This suggests that LLM can be a scalable, ethical, and cost-effective solution for generating cyberbullying detection data.

Takeaways, Limitations

Takeaways:
We demonstrate that LLM can effectively address the data generation and labeling challenges for cyberbullying detection systems.
We offer practical solutions to the challenge of obtaining cyberbullying data on children, which is hampered by ethical, legal, and technical constraints.
Leveraging LLM-based synthetic data enables the construction of cost-effective and scalable cyberbullying detection systems.
Limitations:
The performance of the model using synthetic data was slightly lower than that of the model using real data (75.8% vs. 81.5%). Further research is needed to reduce the performance gap.
Further validation of the quality and diversity of the data generated by LLM is needed.
There is a need to evaluate how accurately the data generated by the LLM reflect the language use patterns of actual children.
👍