This paper addresses the vulnerabilities of recently released large-scale language models (LLMs) in their multilingual and multimodal capabilities. Unlike previous studies that primarily focused on English, this paper presents novel strategies for bypassing LLMs in text and image generation tasks, utilizing code mixing and phonetic transformations. We present two novel bypass strategies and demonstrate their effectiveness over existing methods. By applying phonetic misspellings to sensitive words in code mixing prompts, we effectively bypass LLM safety filters while maintaining interpretability. We achieve 99% attack success rates for text generation and 78% for image generation, and 100% attack relevance rates for text generation and 95% for image generation. Experiments demonstrate that phonetic transformations influence word tokenization, leading to successful attacks. Considering the potential for prompts containing misspellings in real-world settings, we highlight the need for research on generalized safety alignment of multilingual multimodal models. This paper includes examples of potentially harmful and objectionable content.