Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MetaGen Blended RAG: Unlocking Zero-Shot Precision for Specialized Domain Question-Answering

Created by
  • Haebom

Author

Kunal Sawarkar, Shivam R. Solanki, Abhilasha Mangal

Outline

This paper presents a novel method, "MetaGen Blended RAG," to address the challenges faced by Retrieval-Augmented Generation (RAG) on domain-specific datasets: isolated behind firewalls and rich in complex and specialized terminology not encountered during LLM pretraining. To address three key challenges of existing RAGs—interdomain semantic variation, the cost of fine-tuning and lack of generalization, and the difficulty of achieving zero-shot accuracy—we propose a method to enhance semantic retrieval through a metadata generation pipeline and a hybrid query index utilizing dense and sparse vectors. By leveraging key concepts, topics, and abbreviations to generate a metadata-rich semantic index and an enhanced hybrid query, our method achieves robust and scalable performance without fine-tuning. It outperforms existing zero-shot RAG baseline models on the PubMedQA, SQuAD, and NQ datasets, and even competes with fine-tuned models. This represents a novel approach to building semantic retrieval systems with superior generalization across domains.

Takeaways, Limitations

Takeaways:
We demonstrate that high-accuracy RAG performance can be achieved on domain-specific enterprise datasets without fine-tuning.
A novel RAG approach is presented through metadata generation and hybrid query indexing.
It shows excellent generalization performance in various domains (biomedical, general knowledge, etc.).
Achieved performance that surpasses existing zero-shot RAG reference models and some fine-tuned models.
Limitations:
Lack of detailed analysis of the computational cost of the proposed method and the complexity of the metadata generation pipeline.
Further validation of generalization performance on various corporate datasets is needed.
Lack of analysis of errors that may occur during metadata creation and their impact.
The need for and limitations of domain-specific optimized metadata generation strategies.
👍