This paper studies how major language model (LM) providers, such as OpenAI and Anthropic, can apply filters to block fine-tuning on excessively harmful data to prevent abuse, given that they allow state-of-the-art LMs to be fine-tuned for specific purposes. Just as previous studies have shown that safe alignment is “shallow,” we also show that existing fine-tuning attacks are shallow. That is, the attack targets only the first few tokens of the model response, and thus can be blocked by generating the first few response tokens with the aligned model. However, in this paper, we present a method to further enhance the attack by introducing a “refuse-then-comply” strategy. This strategy first rejects harmful requests and then responds to them, thereby bypassing shallow defenses and generating harmful responses that evade the output filter. Experimental results demonstrate the effectiveness of the new fine-tuning attack on both open-source and commercial models, achieving attack success rates of 57% and 72% on GPT-4o and Claude Haiku, respectively. This research was awarded a $2,000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic. In conclusion, it shows that it is wrong to think that a model is safe just because it initially rejects a malicious request, and it expands awareness of the range of attacks that fine-tuning APIs face in operation.