In this paper, we propose an integrated framework called HeCoFuse to address the challenges of real-world vehicle-to-everything (V2X) cooperative perception systems operating in heterogeneous sensor configurations. HeCoFuse is designed for cooperative perception in diverse sensor setups, including nodes that use both cameras (C) and lidars (L). We introduce a hierarchical fusion mechanism that adaptively weights features via a combination of channel-wise and spatial attention to address issues such as misalignment and imbalanced representation quality of multi-modality features. In addition, we employ an adaptive spatial resolution adjustment module to balance computational cost and fusion efficiency. To enhance the robustness against diverse configurations, we implement a collaborative learning strategy that dynamically adjusts the type of fusion depending on the available modalities. Experimental results on the real-world TUMTraf-V2X dataset show that HeCoFuse achieves 43.22% 3D mAP for all sensor configurations (LC+LC), outperforming the CoopDet3D baseline by 1.17%, and reaches 43.38% 3D mAP in the L+LC scenario. It also ranks first in the CVPR 2025 DriveX challenge, maintaining 21.74% to 43.38% 3D mAP across nine heterogeneous sensor configurations.