This paper systematically evaluated the performance of the GPT-5 family and GPT-4o models on four publicly available mammography datasets (EMBED, InBreast, CMMD, and CBIS-DDSM) for BI-RADS assessment, anomaly detection, and malignancy classification. While GPT-5 outperformed other GPT models, it fell short of human experts and domain-specific fine-tuned models. On each dataset, GPT-5 demonstrated significant performance in various breast tissue types (dense, distorted, mass, and microcalcification) and malignancy classification, but its sensitivity and specificity were lower than those of human experts. The significant performance improvement from GPT-4o to GPT-5 suggests the potential of large-scale language models (LLMs) to support mammography VQA tasks in the future.