This paper analyzes 15 large-scale language models (LLMs) and finds that the maximum softmax probability (MSP) of LLMs fine-tuned for chat is consistently miscalibrated in multiple-choice Q&A. However, MSPs can still contain useful uncertainty information. We hypothesize that incorrect answers will be associated with smaller MSPs than correct answers, and rigorous statistical testing demonstrates that this hypothesis holds true for models that perform well on the basic Q&A task. We also find a strong directional correlation between Q&A accuracy and MSP accuracy predictions, but no correlation between Q&A accuracy and calibration errors. This suggests that within the current fine-tuning paradigm, improving LLM performance will likely result in improved accuracy predictions, not calibration. We also present experimental results demonstrating that selectively rejecting responses based on MSP can improve performance.