Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Created by
  • Haebom

Author

Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun

HiCBench: A Benchmark for Evaluating Document Chunking in Retrieval-Augmented Generation

Outline

This paper addresses the lack of effective evaluation tools for document chunking, a crucial component of Retrieval-Augmented Generation (RAG) systems that enhance the responsiveness of language models by integrating external knowledge sources. Based on the analysis that existing RAG evaluation benchmarks are inadequate for assessing document chunking quality due to evidence sparsity, we propose HiCBench, which incorporates manually annotated multi-level document chunking points, synthesized evidence-dense question-answer (QA) pairs, and corresponding evidence sources. Furthermore, we introduce the HiChunk framework, a multi-level document structuring framework based on a fine-tuned LLM and combined with an Auto-Merge retrieval algorithm, to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of various chunking methods across the entire RAG pipeline, and that HiChunk achieves better chunking quality within a reasonable amount of time, thereby enhancing the overall performance of the RAG system.

Takeaways, Limitations

Takeaways:
We propose HiCBench, a new benchmark for effective evaluation of document chunking in RAG systems.
HiCBench includes manually annotated multi-level document chunking points, synthesized evidence-dense QA pairs, and evidence sources.
Improve document chunking quality and overall performance of the RAG system through the HiChunk framework.
HiCBench effectively evaluates the impact of different chunking methods in RAG pipelines.
Limitations:
The specific Limitations is not specified in the paper.
👍