PAN-sharpening is a technique to generate high-resolution multi-spectral (HRMS) images by synthesizing high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images. However, alignment errors between modalities due to differences in sensor placement, acquisition time, and resolution are major challenges. Existing deep learning methods assume perfect pixel alignment and rely on pixel-wise reconstruction loss, which causes spectral distortion, double edges, and blurring when there is alignment errors. In this paper, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the alignment errors between PAN and MS modalities. Corely, a single network jointly reconstructs HRMS and PAN images through Modality-Adaptive Reconstruction (MARs), and the high-frequency details of the PAN are utilized as auxiliary self-supervision. In addition, a Cross-Modality Alignment-Aware Attention (CM3A) mechanism is introduced to align the MS texture and the PAN structure, enabling adaptive feature enhancement across modalities.