Accurate diagnosis of large-scale medical language models is hampered by knowledge gaps and hallucinations. While retrieval and tool augmentation methods are helpful, their impact is limited by weak utilization of external knowledge and poor traceability of feedback inference. To address these challenges, this study presents Deep-DxSearch, an end-to-end trained agent-RAG system using reinforcement learning (RL). This system applies traceable retrieval-augmented inference to medical diagnosis. Deep-DxSearch constructs a large medical retrieval corpus containing patient records and trusted medical knowledge sources to support retrieval-aware inference across diagnostic scenarios. It is crucial to evolve the agent-RAG policy using RL on large-scale data, with the LLM as the core agent and the retrieval corpus as the environment, and tailored rewards for format, retrieval, inference structure, and diagnostic accuracy. Experimental results demonstrate that the end-to-end agent-RAG training framework consistently outperforms prompt-engineered and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch significantly improved diagnostic accuracy in both common and rare disease diagnoses, outperforming robust diagnostic benchmarks such as GPT-4o, DeepSeek-R1, and other healthcare-specific frameworks, both in the in-distribution and out-of-distribution settings. Furthermore, ablation studies on reward design and search corpus components confirmed their significant role in highlighting the approach's uniqueness and effectiveness compared to traditional implementations. Finally, case studies and interpretability analyses highlighted Deep-DxSearch's diagnostic policy improvements, providing deeper insights into its performance gains and helping clinicians provide more reliable and accurate preliminary diagnoses.