Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

Created by
  • Haebom

Author

Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli

Outline

MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment) is a knowledge distillation technique that transfers domain-level multimodal semantics from a large-scale vision-language teacher model (e.g., LLaVa) to a lightweight vision-specific object detector student model (e.g., YOLO). A translation module maps student features to a shared space, and the training of both the student and translator is guided by a dual-objective loss function that enforces both local alignment and global relational consistency. Unlike existing approaches that focus on dense or global alignment, MOCHA operates at the object level, enabling efficient semantic transfer without modifying the teacher model or requiring text input at inference time. We validate this method on four personalized detection benchmarks in a few-shot environment, achieving an average score improvement of 10.1 points over the baseline. Despite its compact architecture, MOCHA achieves performance comparable to larger multimodal models, demonstrating its suitability for real-world deployment.

Takeaways, Limitations

Takeaways:
We present a novel knowledge distillation technique that efficiently conveys multimodal semantics to lightweight vision-specific object detectors.
Object-level semantics can be conveyed without text input during teacher model modification or inference.
Performance improvement over existing methods in a small-shot environment (average increase of 10.1 points).
Implementing a lightweight architecture suitable for real-world deployment.
Limitations:
Generalization performance needs to be verified on datasets other than the four personalized detection benchmarks presented.
Dependence on specific teacher and student models exists. Performance evaluation of other model combinations is needed.
Further research is needed on parameter tuning of the dual-objective loss function.
👍