In this paper, we present a portable, lightweight gripper with integrated tactile sensors to address the __T1465_____ of portable grippers, which are widely used for collecting human demonstration data due to their portability and versatility. Using this gripper, we simultaneously collect visual and tactile data from various real-world environments, and propose a cross-modal representation learning framework that integrates visual and tactile signals while preserving the unique characteristics of each signal. This learning process generates interpretable representations that consistently focus on the contact region involved in physical interactions. The proposed representations are applied to detailed manipulation tasks such as test tube insertion and pipette-based fluid transfer, enabling robotic manipulation with improved accuracy and robustness under external disturbances.