This paper addresses the complexity of measuring gender stereotype bias in language models and the limitations of existing benchmarks. We highlight that existing benchmarks fail to adequately capture the multifaceted nature of gender stereotypes, reflecting only a partial picture. Using StereoSet and CrowS-Pairs as case studies, we investigate the impact of data distribution on benchmark results. By applying a social-psychological framework to balance benchmark data, we demonstrate that simple balancing techniques can significantly improve correlations across different measures. Ultimately, we highlight the complexity of gender stereotypes in language models and suggest new directions for developing more sophisticated techniques to detect and mitigate bias.