This paper presents a method for integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation to enhance the 3D perception performance of autonomous vehicles. Existing methods suffer from spatial misalignment between LiDAR and camera features, which leads to errors in accurate depth supervision of camera branches and cross-modal feature aggregation. This paper demonstrates that the root causes of these misalignments lie in calibration inaccuracies and projection errors caused by the rolling shutter effect. We note that these errors are predictably concentrated at object-background boundaries, which 2D detectors reliably identify. Therefore, our primary goal is to leverage 2D object prior information to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior-Guided Depth Calibration (PGDC), which utilizes 2D prior information to mitigate misalignment and maintain accurate cross-modal feature pairs. To address global alignment errors, we introduce Discontinuity-Aware Geometric Fusion (DAGF), which suppresses residual noise from PGDC and explicitly enhances distinct depth variations at object-background boundaries to generate structurally recognizable representations. To effectively utilize the aligned representations, we integrate the Structural Guidance Depth Modulator (SGDM), which efficiently fuses aligned depth and image features using a gated attention mechanism. The proposed method achieves state-of-the-art performance (mAP 71.5%, NDS 73.6%) on the nuScenes validation dataset.