In this paper, we propose HMID-Net, a novel method that integrates mask image modeling (MIM) and knowledge distillation to effectively learn the hierarchical structure of visual and semantic concepts in hyperbolic space. Compared with the existing MERU model that successfully applied multi-modal learning to hyperbolic space, HMID-Net enables more efficient model learning by utilizing MIM and knowledge distillation. In particular, it introduces a knowledge distillation loss function specialized in hyperbolic space to support effective knowledge transfer. Experimental results show that HMID-Net significantly outperforms existing models such as MERU and CLIP on image classification and retrieval tasks.