This paper presents a novel vulnerability in audio-based interactions with large-scale language models (LLMs) and introduces WhisperInject, a novel attack framework that exploits it. WhisperInject manipulates state-of-the-art audio LLMs using subtle, human-imperceptible audio perturbations to generate malicious content. The two-stage framework utilizes reinforcement learning and projected gradient descent (RL-PGD) in the first stage to bypass the model's safety protocols and generate malicious raw responses. In the second stage, projected gradient descent (PGD) is used to embed malicious responses into benign audio (e.g., weather questions, greetings, etc.). Targeting the Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal models, we achieve a success rate of over 86% under rigorous safety evaluation frameworks including StrongREJECT, LlamaGuard, and human evaluation. This research presents a novel audio-based threat that goes beyond theoretical attacks and demonstrates a practical and stealthy AI manipulation method.