This paper presents a novel evaluation framework based on free-form storytelling to uncover gender bias in large-scale language models (LLMs). Analyzing ten major LLMs, we found a consistent pattern of overrepresentation in the occupational distribution of female characters. This overrepresentation is likely due to supervised learning fine-tuning (SFT) and human-feedback-reinforced learning (RLHF). Paradoxically, despite this overrepresentation, the occupational gender distributions generated by LLMs are more consistent with human stereotypes than with real-world labor data. This highlights the importance of implementing balanced mitigation measures to promote fairness and prevent the creation of potential new biases. The prompts and stories generated by LLMs are publicly available on GitHub.