Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

Created by
  • Haebom

Author

Zhyar Rzgar K Rostam, G abor Kert esz

Outline

This paper presents an efficient text classification method for handling the growing volume of scientific literature. We fine-tune pre-trained language models (PLMs) such as BERT, SciBERT, BioBERT, and BlueBERT on the Web of Science (WoS-46985) dataset and apply them to scientific text classification. We expand the dataset by adding 1,000 papers per category, matching the major categories of WoS-46985, by executing seven targeted queries on the WoS database. We use PLMs to predict labels for unlabeled data and combine the predictions using a hard-voting strategy to improve accuracy and confidence. Fine-tuning on the expanded dataset using dynamic learning rates and early stopping significantly improves classification accuracy, especially in specialized domains. We demonstrate that domain-specific models such as SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These results highlight the effectiveness of dataset augmentation, inference-based label prediction, hard-voting, and fine-tuning techniques in creating a robust and scalable solution for automated academic text classification.

Takeaways, Limitations

Takeaways:
We demonstrate that combining dataset augmentation, inference-based label prediction, hard voting, and fine-tuning techniques can improve the accuracy and efficiency of scientific literature classification.
We confirm that domain-specific PLMs (SciBERT, BioBERT) are more suitable for classifying scientific literature than general-purpose PLMs (BERT).
The methodology of this study provides a general framework that can be applied to text classification in other domains.
Limitations:
Since the dataset was constructed based on the WoS database, further research is needed to determine its generalizability to other databases or datasets.
A comparative performance analysis is needed by applying other ensemble methods other than the hard voting strategy used.
Performance improvements for specific domains may depend on the size and quality of the dataset, so further experiments on datasets of various sizes and qualities are needed.
👍