This paper addresses concerns about the ability of text-to-image (T2I) models to accurately represent diverse cultural contexts and presents the first study to systematically quantify the consistency of explicit and implicit cultural expectations with T2I models and evaluation metrics. To this end, we introduce CulturalFrames, a novel benchmark spanning ten countries and five sociocultural domains. CulturalFrames comprises 983 prompts, 3,637 images generated by four state-of-the-art T2I models, and over 10,000 detailed human annotations. Our results reveal that cultural expectations are misfulfilled on average 44% of the time across models and countries. A surprisingly high 68% of explicit expectations are misfulfilled, and 49% of implicit expectations are misfulfilled. Furthermore, existing T2I evaluation metrics, regardless of their underlying inference methods, show low correlations with human judgments of cultural consistency. In conclusion, this study reveals important gaps, provides concrete testing environments, and suggests actionable directions for developing culturally sensitive T2I models and metrics that improve global usability.