This paper investigates whether large-scale language models (LLMs) actually bail when given the option to do so. We conducted experiments on sequences from real-world data (Wildchat and ShareGPT) using three different bailout methods: a bailout tool that the model can invoke, a bailout string that the model can output, and a bailout prompt that asks the model whether to bail. We found that across all bailout methods, the model bails out conversations at approximately 0.28% and 32% of the time (depending on the model and the bailout method), suggesting that the model used for transcription can significantly overestimate the real-world bailout rate by up to a factor of four. Accounting for false positives for bailout prompts (22%), we estimate the real-world bailout rate to be 0.06% and 7%, respectively. Based on observations of real-world sequences, we constructed a relatively inclusive taxonomy of bailout instances and used it to create a representative synthetic dataset, BailBench, which represents situations in which some models bail out. Using this dataset, we tested various models and found that most models exhibited some bailout behavior. Abandonment rates varied significantly across models, interruption methods, and prompt phrases. Finally, we studied the relationship between rejections and interruptions, finding that 0-13% of real conversational continuations resulted in interruptions without rejections; jailbreaks decreased rejection rates but increased interruptions; rejection removal increased interruption rates without rejections only for some interruption methods; and BailBench's rejection rate did not predict interruptions.