This paper analyzes how a dual-encoder architecture like CLIP maps two types of inputs to a shared embedding space and predicts their similarity. To overcome the limitations of existing first-order feature attribution methods, we propose a second-order method that enables attribution of feature interactions to the dual-encoder's predictions. Applying this method to the CLIP model, we demonstrate that it learns fine-grained correspondences between caption segments and image regions, accounting for object matches as well as mismatches. However, we reveal that this visual-linguistic capability varies significantly across object classes, exhibits significant domain-external effects, and can identify both individual errors and systematic failure patterns. The code is publicly available.