This paper addresses the limitations of automated evaluation metrics of the Retrieval Augmented Generation (RAG) approach, which has emerged to ensure factual accuracy in medical question answering (QA) systems. Since existing automated evaluation metrics underperform in clinical and conversational use cases, we propose a new automated and scalable evaluation metric called ASTRID to overcome the limitations of expensive and non-scalable manual evaluation. ASTRID consists of three metrics: contextual relevance (CR), rejection accuracy (RA), and conversational consistency (CF). In particular, CF is designed to evaluate the fidelity of responses to the knowledge base without penalizing conversational elements. We validate ASTRID using a post-cataract surgery patient questionnaire dataset and show that CF better predicts human evaluation in conversational use cases than existing metrics. We also show that ASTRID is consistent with clinician evaluations of inappropriate, harmful, and unhelpful responses, and we verify that the three metrics are in good agreement with human evaluations using various LLMs. Finally, we make the prompts and datasets used in the experiments publicly available, providing valuable resources for further research and development.