This paper demonstrates that training a language model (LM) to generate natural language inference processes via reinforcement learning (RL) improves performance on a variety of difficult question answering tasks. Existing RL methods use a binary reward function that only evaluates the accuracy of the LM output, resulting in side effects such as reduced calibration and increased generation of erroneous responses. In this paper, we propose Reinforcement Learning with Calibration Rewards (RLCR). RLCR is a method to simultaneously improve accuracy and calibrated confidence estimates by optimizing a reward function in which the LM generates both post-inference predictions and numerical confidence estimates, and adds Brier scores to the binary accuracy scores. We prove that this reward function yields models that generate accurate and well-calibrated predictions, and experimentally demonstrate that RLCR significantly improves calibration without compromising accuracy on a variety of datasets. We also show that the confidence expressed in language at test time can be used to improve accuracy and calibration through a confidence-weighted scaling method.