Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

Created by
  • Haebom

Author

Osma Suominen, Juho Inkinen, Mona Lehtinen

Outline

This paper presents the Annif system for topic indexing using large-scale language models (LLMs) in SemEval-2025 Task 5 (LLMs4Subjects). This task required generating topic predictions using the Globally Neural Network (GND) topic vocabulary for bibliographic records in the bilingual TIBKAT database. The Annif system combines existing natural language processing and machine learning techniques implemented in the Annif toolkit with an innovative LLM-based method for translation and synthetic data generation, as well as prediction merging of Japanese models. In quantitative evaluations, it ranked first in all subject categories, second in the tib-core-subject category, and fourth in qualitative evaluations. These results demonstrate the potential of combining the existing XMTC algorithm with modern LLM techniques to improve the accuracy and efficiency of topic indexing in multilingual environments.

Takeaways, Limitations

Takeaways:
The potential to improve the accuracy and efficiency of multilingual subject indexing tasks by combining existing natural language processing and machine learning techniques with LLM-based techniques is presented.
The excellent performance of the Annif system confirms the potential for advancement in the field of subject indexing using LLM.
Presenting an effective approach to topic indexing in a multilingual environment.
Limitations:
Ranked fourth in the qualitative evaluation, this result differs from the quantitative evaluation. A detailed explanation of the qualitative evaluation criteria and results is needed.
Lack of detailed description of the LLM and other techniques used. Additional information is needed to ensure reproducibility.
👍