Citrus-V is a multimodal medical-based model that combines medical image analysis and text inference. It integrates detection, segmentation, and multimodal thought-chain inference to enable pixel-level lesion localization, structured report generation, and physician-level diagnostic inference in a single framework. It proposes a novel multimodal learning approach and releases a curated open-source dataset covering inference, detection, segmentation, and document understanding tasks. It outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, providing an integrated pipeline from visual evidence to clinical inference, enabling accurate lesion quantification, automated reporting, and a reliable second opinion.