Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing

Created by
  • Haebom

Author

Yao Zhao, Yantian Ding, Zhiyue Zhang, Dapeng Yao, Yanxun Xu

Outline

Retrieval Augmented Generation (RAG) systems struggle to handle domain-specific knowledge due to the poor performance of pre-trained embeddings and the excessive computational overhead of large-scale language model (LLM)-based retrievers. While fine-tuning data-augmented embedding models offers a promising direction, their effectiveness is limited by the need for high-quality training data and a reliable chunking strategy that maintains contextual integrity. In this paper, we propose Language Model Augmented Retriever (LMAR), a model-agnostic framework that combines LLM-based data synthesis, contrastive embedding adaptation, and efficient text clustering. LMAR consists of a two-stage pipeline: (1) triplet sampling and synthetic data augmentation, where the LLM acts as a labeler and verifier to ensure high-fidelity supervision throughout the pipeline. Experimental results demonstrate that LMAR outperforms several baseline models on several domain-specific benchmark datasets while maintaining reasonable hardware requirements and low latency. Its model-agnostic nature allows for seamless integration with the new RAG architecture and text embedding models, enabling continuous improvement without pipeline redesign. These results highlight that LMAR is a practical and cost-effective solution for scalable domain-specific adaptation.

Takeaways, Limitations

Takeaways:
Achieving performance enhancements in domain-specific RAG systems through data augmentation and contrastive learning using LLM.
Model-independent framework ensures compatibility with various RAG architectures and embedding models.
Presenting practical applicability with reasonable hardware requirements and low latency.
Solving the problem of lack of high-quality training data with LLM-based data synthesis.
Limitations:
It depends on the performance of LLM, and the limitations of LLM may affect the performance of LMAR.
Further research is needed to determine the optimal parameters of the proposed triplet sampling and clustering strategy.
Additional evaluation of generalization performance across different domains and tasks is needed.
It's difficult to say that the high cost of LLMs has been completely resolved. (The use of LLMs doesn't necessarily mean that computational costs are completely low.)
👍