This paper addresses the challenge of interpreting sarcasm in multimodal input. We highlight that existing Chain-of-Thought approaches fail to effectively leverage the cognitive processes humans use to identify sarcasm. We present IRONIC, a context-based learning framework that leverages multimodal coherence relations to analyze referential, analogical, and pragmatic image-text connections. Experimental results demonstrate that IRONIC achieves state-of-the-art performance in zero-shot multimodal sarcasm detection over various baseline models. This highlights the need to integrate linguistic and cognitive insights into the design of multimodal inference strategies.