Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Created by
  • Haebom

Author

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie

Outline

Accurate diagnosis of large-scale medical language models is hampered by knowledge gaps and hallucinations. While retrieval and tool augmentation methods are helpful, their impact is limited by weak utilization of external knowledge and poor traceability of feedback inference. To address these challenges, this study presents Deep-DxSearch, an end-to-end trained agent-RAG system using reinforcement learning (RL). This system applies traceable retrieval-augmented inference to medical diagnosis. Deep-DxSearch constructs a large medical retrieval corpus containing patient records and trusted medical knowledge sources to support retrieval-aware inference across diagnostic scenarios. It is crucial to evolve the agent-RAG policy using RL on large-scale data, with the LLM as the core agent and the retrieval corpus as the environment, and tailored rewards for format, retrieval, inference structure, and diagnostic accuracy. Experimental results demonstrate that the end-to-end agent-RAG training framework consistently outperforms prompt-engineered and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch significantly improved diagnostic accuracy in both common and rare disease diagnoses, outperforming robust diagnostic benchmarks such as GPT-4o, DeepSeek-R1, and other healthcare-specific frameworks, both in the in-distribution and out-of-distribution settings. Furthermore, ablation studies on reward design and search corpus components confirmed their significant role in highlighting the approach's uniqueness and effectiveness compared to traditional implementations. Finally, case studies and interpretability analyses highlighted Deep-DxSearch's diagnostic policy improvements, providing deeper insights into its performance gains and helping clinicians provide more reliable and accurate preliminary diagnoses.

Takeaways, Limitations

Takeaways:
We have significantly improved the accuracy of medical diagnosis through the agent RAG system based on end-to-end reinforcement learning.
It showed performance that surpassed existing state-of-the-art models such as GPT-4o and DeepSeek-R1.
It has demonstrated excellent performance in both in-distribution and out-of-distribution settings and has proven effective in diagnosing both common and rare diseases.
We identified the importance of reward design and retrieval corpus, suggesting future research directions.
Case studies and interpretability analyses help you understand the model's decision-making process.
Limitations:
Currently available information does not provide specific information about Deep-DxSearch's training data size, training time, and computational resource consumption.
Additional performance evaluation and validation in actual clinical environments are needed.
There may be solutions to the model's hallucination problem and room for further improvement.
Consideration needs to be given to accessibility and privacy issues related to large-scale medical data.
👍