Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Learning From Crowdsourced Noisy Labels: A Signal Processing Perspective

Created by
  • Haebom

Author

Shahana Ibrahim, Panagiotis A. Traganitis, Xiao Fu, Georgios B. Giannakis

Outline

This paper focuses on crowdsourcing techniques used to build large-scale refined datasets, which is one of the main driving forces for the development of artificial intelligence (AI) and machine learning (ML). Labels generated through crowdsourcing may contain noise for various reasons, which adversely affects the learning performance. In this paper, we introduce the latest research trends in learning from noisy crowdsourced labels. We review major crowdsourcing models and methodological treatments, from classical statistical models to recent deep learning-based approaches, and especially emphasize the connection with signal processing (SP) theory (such as the identifiability of tensor and non-negative matrix factorizations), suggesting new solutions to long-standing challenges in crowdsourcing. We also cover new topics that are important for the development of next-generation AI/ML systems, such as crowdsourcing with reinforcement learning and human feedback (RLHF), and direct preference optimization (DPO). In particular, we cover techniques that are important for fine-tuning large-scale language models (LLMs).

Takeaways, Limitations

Takeaways:
Presentation of various approaches (statistical models, deep learning-based approaches, etc.) and the possibility of utilizing signal processing theory to solve the noise problem of crowdsourced data.
A novel solution is presented that exploits the identifiability of tensor and non-negative matrix decompositions.
The importance of crowdsourcing using RLHF and DPO and its applicability to LLM fine-tuning are presented.
Limitations:
Lack of detailed description of specific algorithms and experimental results.
Lack of comparative analysis of different crowdsourcing models and approaches.
Lack of sufficient discussion of practical applications.
👍