This paper reveals that memorization in generative models extends beyond simple literal reproduction, encompassing metaphorical patterns, semantic associations, and, surprisingly, across modalities (e.g., lyric-to-music generation, text-to-video generation). Specifically, we uncover a novel type of cross-modal memorization, where copyrighted content leaks through indirect speech channels, and propose Adversarial Voice Prompting (APT) as a way to attack it. APT replaces iconic phrases with phonetically similar but semantically different alternatives (e.g., "mom's spaghetti" to "Bob's confetti"), preserving their acoustic form while significantly altering their semantic content. Experimental results demonstrate that models can be induced to reproduce memorized songs using phonologically similar but semantically unrelated lyrics. Despite the semantic shift, black-box models like SUNO and open-source models like YuE produce output that is remarkably similar (in terms of melody, rhythm, and vocals) to the original song, achieving high scores on AudioJudge, CLAP, and CoverID. These effects persist across genres and languages. More surprisingly, we found that visual memorization can be induced in a text-to-video model using only audio prompts. When presented with altered lyrics from "Lose Yourself," Veo 3 generated scenes that mirrored the original music video (including the rapper in a hoodie and a dark urban backdrop), but without explicit visual cues in the prompts. This cross-modality leakage poses an unprecedented threat, defeating existing safeguards such as copyright filters. This study demonstrates a fundamental vulnerability in transcription-based generative models and raises urgent concerns about copyright, provenance, and the secure distribution of multimodal generative systems.