This paper points out the limitations of the existing method of building a multi-modal model by connecting various pre-trained single-modal models, and proposes a new method, Hypernetwork Model Alignment (Hyma), to solve this problem. While the existing method requires a lot of computational cost for selecting a single-modal model and training a connection module, Hyma improves efficiency by simultaneously learning the optimal combination of single-modal models and connection modules by utilizing hypernetworks. Hyma jointly learns connection modules for N x M combinations of single-modal models, thereby drastically reducing the cost of searching for the optimal model combination.