Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

Created by
  • Haebom

Author

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

Outline

This paper reveals that memorization in generative models extends beyond simple literal reproduction, encompassing metaphorical patterns, semantic associations, and, surprisingly, across modalities (e.g., lyric-to-music generation, text-to-video generation). Specifically, we uncover a novel type of cross-modal memorization, where copyrighted content leaks through indirect speech channels, and propose Adversarial Voice Prompting (APT) as a way to attack it. APT replaces iconic phrases with phonetically similar but semantically different alternatives (e.g., "mom's spaghetti" to "Bob's confetti"), preserving their acoustic form while significantly altering their semantic content. Experimental results demonstrate that models can be induced to reproduce memorized songs using phonologically similar but semantically unrelated lyrics. Despite the semantic shift, black-box models like SUNO and open-source models like YuE produce output that is remarkably similar (in terms of melody, rhythm, and vocals) to the original song, achieving high scores on AudioJudge, CLAP, and CoverID. These effects persist across genres and languages. More surprisingly, we found that visual memorization can be induced in a text-to-video model using only audio prompts. When presented with altered lyrics from "Lose Yourself," Veo 3 generated scenes that mirrored the original music video (including the rapper in a hoodie and a dark urban backdrop), but without explicit visual cues in the prompts. This cross-modality leakage poses an unprecedented threat, defeating existing safeguards such as copyright filters. This study demonstrates a fundamental vulnerability in transcription-based generative models and raises urgent concerns about copyright, provenance, and the secure distribution of multimodal generative systems.

Takeaways, Limitations

Takeaways:
It reveals that the memorization phenomenon of generative models appears in various ways beyond literal reproduction.
Cross-modality memorization poses a new threat to copyrighted content leakage.
Demonstrates the potential to disable existing safety measures such as copyright filters.
The need to develop new safety measures for the safe deployment of multimodal generation systems is raised.
Demonstrates the possibility of adversarial attacks using voice prompts.
Limitations:
Further research is needed on the generalizability of APT attacks and other models/datasets.
Further research is needed on defense techniques against the proposed APT attack.
Extensive experimentation with various generative models and datasets is required.
Further research is needed to determine its relevance to real-world copyright infringement cases.
👍