Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Detecting value-expressive text posts in Russian social media

Created by
  • Haebom

Author

Maria Milkova, Maksim Rudnev, Lidia Okolskaya

Outline

This paper aims to develop a model that accurately detects value-expressing posts on the Russian social media VKontakte. We believe that studying personal values in social media can shed light on how and why social values evolve, especially when stimulus-based methods such as surveys are ineffective (e.g., for hard-to-reach populations). We annotate 5,035 posts using three experts, 304 crowd workers, and ChatGPT, and train several classification models using embeddings from various pre-trained transformer-based language models, applying an ensemble of human and AI-assisted annotations, including an active learning approach. The best performance (F1 = 0.75, F1-macro = 0.80) is achieved using embeddings from the fine-tuned rubert-tiny2 model, which provides an important step forward in studying values within and across Russian social media users. The agreement between crowd workers and experts in post classification is moderate, with ChatGPT showing higher consistency but struggling with spam detection.

Takeaways, Limitations

Takeaways: Successfully developed a model that detects value-expressing posts on Russian social media with high accuracy, which can contribute to the study of value expressions of Russian social media users. Demonstrates the effectiveness of data annotation through collaboration between humans and AI.
Limitations: The annotation agreement between crowd workers and experts was only moderate. ChatGPT had difficulty detecting spam. The model's performance was limited to the Russian social media VKontakte, so further research on generalizability is needed. There is a possibility that the bias of the dataset affects the model's performance.
👍