This paper proposes a multimodal approach that integrates nonverbal cues to address the client resistance problem encountered in text-based cognitive behavioral therapy (CBT) models in the field of psychotherapy utilizing large-scale language models (LLMs). Specifically, we introduce a novel synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which combines client speech and facial images. Visual language models (VLMs) trained on this dataset analyze facial cues to infer emotions and generate empathic responses. This model measures the strength of the therapeutic alliance in situations of client resistance and outperforms existing text-based CBT approaches.