Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations

Created by
  • Haebom

Author

Daniele Barolo, Chiara Valentin, Fariba Karimi, Luis Gal arraga, Gonzalo G. M endez, Lisette Esp in-Noboa

Outline

This paper evaluates the performance of physics expert recommendation using six open-weight large-scale language models (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, and llama3.1-70b). We examine biases related to consistency, factuality, gender, ethnicity, academic popularity, and scholar similarity across five tasks (top k experts by field, influential scientists by discipline, era, seniority, and scholar-correspondence). We establish academic benchmarks based on real-world data from the American Physical Society (APS) and OpenAlex, and compare model outputs to real-world academic records. Our analysis reveals inconsistencies and biases in all models, with mixtral-8x7b producing the most stable output, while llama3.1-70b exhibited the highest variability. Many models exhibited redundancies, with gemma2-9b and llama3.1-8b in particular exhibiting a high prevalence of formatting errors. While LLMs generally recommend actual scientists, they consistently show poor accuracy when querying by field, era, and seniority, favoring senior scholars. Representation biases persist, including gender imbalance (male-dominated), underrepresentation of Asian scientists, and overrepresentation of White scholars. Despite the diversity of institutions and collaboration networks, the model favors highly cited and productive scholars, reinforcing the rich-get-richer effect while providing limited geographic representation.

Takeaways, Limitations

Takeaways: Demonstrates the potential of developing an academic expert recommendation system using a large-scale language model, while also highlighting the inherent biases and limitations. The need for improved model consistency and accuracy is highlighted. The need for further research to ensure the fairness of academic recommendation systems is emphasized.
Limitations: Limitations of the dataset used for evaluation (limited to APS and OpenAlex data). The models analyzed were limited (only six open-weight LLMs were evaluated). Lack of specific solutions to address bias. Further analysis is needed to address the lack of geographic representation.
👍