In this paper, we propose Learn2Diss, a novel framework for self-supervised learning of speech representations. Unlike conventional frame-wise mask prediction methods, Learn2Diss learns both frame-level features and utterance-level features (speakers, channel features, etc.) of speech by combining a frame-wise encoder and an utterance-wise encoder. The frame-wise encoder learns pseudophoneme representations based on conventional self-supervised learning techniques, and the utterance-wise encoder learns pseudospeaker representations based on contrastive learning. The two encoders are trained separately using a mutual information-based criterion. Through various sub-task evaluation experiments, we demonstrate that the frame-wise encoder improves the performance of semantic tasks, while the utterance-wise encoder improves the performance of non-semantic tasks. As a result, Learn2Diss achieves state-of-the-art performance on various tasks.