This paper focuses on Foley synthesis, which synthesizes high-quality audio that is semantically and temporally aligned with an image by utilizing a pre-trained audio generative model. To overcome the limitation of existing ControlNet-based Foley synthesis methods that rely on hand-crafted temporal conditions, in this paper we propose a SpecMaskFoley method that applies ControlNet to a pre-trained SpecMaskGIT model. In particular, we effectively utilize a single ControlNet branch by using a frequency-aware temporal feature aligner to resolve the mismatch between the temporal features of an image and the time-frequency features of the SpecMaskGIT model. As a result, SpecMaskFoley demonstrates improved performance compared to existing from-scratch models, and contributes greatly to the development of ControlNet-based Foley synthesis models.