This paper proposes SignClip, a novel framework for improving the accuracy of sign language translation (SLT). Unlike previous studies that primarily focus on passive cues (hand gestures) in sign language, SignClip leverages both passive and non-passive cues (lip shapes). Specifically, it fuses spatial gesture and lip movement features and introduces a hierarchical contrastive learning framework with a multi-level alignment objective to ensure semantic consistency between sign-to-lip and vision-to-text modes. Experimental results using the PHOENIX14T and How2Sign datasets demonstrate that SignClip outperforms SpaMo, the previous state-of-the-art model. For example, in the PHOENIX14T gloss-free setting, BLEU-4 scores improved from 24.32 to 24.71, and ROUGE scores improved from 46.57 to 48.38.