This paper questions the ability of state-of-the-art text-to-image diffusion models to accurately represent diverse cultural nuances, and introduces the CultDiff benchmark to evaluate the ability of image generation to include cultural features from 10 countries. We show that the generation of cultural elements such as architecture, clothing, and food, especially from marginalized regions, exhibits deficiencies compared to real images. Through a detailed analysis of various similarity aspects, we find significant differences in cultural relevance, descriptive fidelity, and realism, and develop CultDiff-S, a neural network-based image-to-image similarity metric that predicts human judgments for real and generated images with cultural features based on collected human ratings. In conclusion, we emphasize the need for comprehensive generative AI systems across a wide range of cultures and fair dataset representation.