Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CLAP: Coreference-Linked Augmentation for Passage Retrieval

Created by
  • Haebom

Author

Huanwei Xu, Lin Xu, Liang Yuan

Outline

While LLM-based passage expansion is effective in improving initial retrieval performance, it often degrades in dense retrieval systems due to semantic bias and mismatch with the pre-trained semantic space. Furthermore, only a portion of a passage is relevant to the query, while the remainder introduces noise, and chunking techniques break co-reference continuity. In this paper, we propose Coreference-Linked Augmentation for Passage Retrieval (CLAP), a lightweight LLM-based augmentation framework that partitions passages into coherent chunks, resolves co-reference chains, and generates local pseudo-queries aligned with the dense searcher representation. By simply fusion of global topic signals with fine-grained subtopic signals, we achieve robust performance across a variety of domains. CLAP consistently improves performance as the searcher's performance increases, enabling the dense searcher to achieve performance comparable to or exceeding that of two-stage rankers such as BM25 + MonoT5-3B (up to 20.68% absolute improvement in nDCG@10). These improvements are particularly noticeable in domain-free settings, where existing LLM-based extension methods relying on domain knowledge often fail. CLAP instead adopts a logic-driven pipeline, enabling robust, domain-independent generalization.

Takeaways, Limitations

Takeaways:
A new method (CLAP) is presented to overcome the limitations of LLM-based passage expansion.
Significantly improves the performance of high-density search engines, achieving performance beyond that of tier 2 rankers.
Excellent domain-independent generalization performance.
Noise reduction through co-reference resolution and local pseudo-query generation.
Limitations:
Further experiments are needed to determine whether CLAP's performance improvements are consistent across all situations.
Need to explore optimization possibilities and limitations for specific domains.
Since it is LLM-based, there is a possibility that the limitations of LLM (e.g., computational cost, bias) may have an impact.
👍