Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

Created by
  • Haebom

Author

Xue-Yong Fu, Elena Khasanova, Md Tahmid Rahman Laskar, Harsh Saini, Shashi Bhushan TN

Outline

Large-scale language models (LLMs) exhibit impressive performance in text summarization, but their performance tends to degrade when applied to specialized domains other than pretraining distributions. Fine-tuning can improve summarization quality, but it requires high-quality labeled data. In this study, we explore continuous pretraining, a scalable and self-supervised learning approach, to adapt LLMs to downstream summarization tasks involving noisy real-world conversations. Using a large-scale, unlabeled business conversation dataset, we conduct extensive experiments to determine whether continuous pretraining improves the model's ability to summarize conversations. Our results demonstrate that continuous pretraining yields significant gains on both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effectiveness of data selection strategies, providing practical guidance for applying continuous pretraining to summarization-centric industrial applications.

Takeaways, Limitations

Takeaways:
Continuous pre-training is a way to effectively adapt LLM to specific domains even without high-quality labeled data.
Continuous pre-training improves performance in both in-domain and out-of-domain summarization.
Data selection strategies can influence the performance of continuous pre-training, providing practical guidance for application in industrial applications.
Limitations:
Because this study focuses on a specific type of data (business conversations), further verification of generalizability to other domains is required.
Further analysis of the effectiveness of data selection strategies is needed.
Research on the optimal parameters and hyperparameters of continuous pre-learning is needed.
👍