MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment) is a knowledge distillation technique that transfers domain-level multimodal semantics from a large-scale vision-language teacher model (e.g., LLaVa) to a lightweight vision-specific object detector student model (e.g., YOLO). A translation module maps student features to a shared space, and the training of both the student and translator is guided by a dual-objective loss function that enforces both local alignment and global relational consistency. Unlike existing approaches that focus on dense or global alignment, MOCHA operates at the object level, enabling efficient semantic transfer without modifying the teacher model or requiring text input at inference time. We validate this method on four personalized detection benchmarks in a few-shot environment, achieving an average score improvement of 10.1 points over the baseline. Despite its compact architecture, MOCHA achieves performance comparable to larger multimodal models, demonstrating its suitability for real-world deployment.