This paper presents a novel framework that addresses the adaptability gap that arises when applying pre-trained vision-language models (VLMs) to zero-shot anomaly detection (ZSAD). VLMs suffer from a lack of local inductive bias for dense prediction and a reliance on an inflexible feature fusion paradigm. To address this, we propose an architectural co-design framework that simultaneously improves feature representation and cross-modal fusion. This framework injects local inductive bias for fine-grained representations via a parameter-efficient Convolutional Low-Dimensionality Adaptation (Conv-LoRA) adapter, and introduces a Dynamic Fusion Gateway (DFG) that adaptively adjusts text prompts using visual context, enabling robust bidirectional fusion. Extensive experiments on various industrial and medical benchmarks demonstrate excellent accuracy and robustness, demonstrating the importance of synergistic co-design for robustly applying the base model to dense perception tasks.