This paper proposes a multimodal protein representation learning framework that leverages both protein sequence and 3D structural information. It combines the strengths of a Transformer-based protein language model (pLM), pre-trained on large-scale protein sequence data, and a graph neural network (GNN) that leverages 3D structural information. This framework enables effective information exchange between both modalities through attention and gating mechanisms. Specifically, a bi-hierarchical fusion approach enhances the integration of sequence and structural information at both local and global levels. The proposed method outperforms existing methods on various protein representation learning benchmarks, including enzyme EC classification, model quality assessment, protein-ligand binding affinity prediction, protein-protein binding site prediction, and B-cell epitope prediction, achieving a new state-of-the-art in the field of multimodal protein representation learning.