This paper presents a novel approach for building efficient and privacy-preserving inference and augmented retrieval generation (RAG) systems even in resource-constrained and secure environments. Unlike existing RAG systems that rely on large-scale models and external APIs, this study leverages recent advances in test-time scaling and small-scale inference models to develop a search-augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. It integrates dense retrieval with a fine-tuned Qwen2.5-Instruct model, and utilizes synthetic query generation and inference tracking derived from state-of-the-art models (e.g., DeepSeek-R1) on curated corpora such as the NHS A-to-Z disease pages. We investigate the impact of summary-based document compression, synthetic data design, and inference-aware fine-tuning. Evaluations on non-inference and general-purpose lightweight models demonstrate that the domain-specific fine-tuning approach significantly improves answer accuracy and consistency, achieving close to state-of-the-art performance while enabling local deployment. All implementation details and code are publicly available, supporting reproducibility and cross-domain applicability.