This paper compares and analyzes a supervised lightweight CNN with a state-of-the-art zero-shot medical Vision-Language Model (VLM), BiomedCLIP, in an automated approach to accurately interpret chest X-ray images. We perform two diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Experimental results show that the supervised CNN serves as a competitive baseline in both cases. While the VLM initially performs poorly at zero-shot, we demonstrate that adjusting the decision threshold significantly improves its performance. For pneumonia detection, the adjusted zero-shot VLM achieves an F1-score of 0.8841, outperforming the supervised CNN's F1-score of 0.8803. For tuberculosis detection, the adjustment significantly improves the F1-score from 0.4812 to 0.7684, approaching the supervised baseline's F1-score of 0.7834. This study highlights that appropriate calibration is essential to leverage the full diagnostic capabilities of zero-shot VLMs, enabling them to achieve performance equivalent to or better than efficient task-specific supervised learning models.