Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Recomposer: Event-roll-guided generative audio editing

Created by
  • Haebom

Author

Daniel PW Ellis, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, Manoj Plakal

Outline

This paper presents a system for editing complex real-world audio scenes. It provides the ability to delete, insert, and enhance individual audio events in complex audio scenes where individual sound sources overlap temporally. It operates based on textual edit descriptions (e.g., "enhance door sound") and graphical representations of event timing derived from event roll transcriptions. It employs an encoder-decoder transformer with a SoundStream representation, trained on pairs of synthetic (input, desired output) audio examples generated by adding isolated audio events to real-world backgrounds. Evaluation results reveal the importance of each part of the edit description (action, class, and timing), demonstrating that "reconstruction" has important and practical applications.

Takeaways, Limitations

Takeaways:
Presenting an effective system for editing individual sound events within complex sound scenes.
Editing is possible using text-based editing instructions and event timing information.
Efficient model implementation using SoundStream representation.
Introducing the potential of a new sound editing application called "Reconstruction".
Experimentally verify the importance of each element of the edit description (action, class, timing).
Limitations:
Possible degradation of generalization performance to real-world data due to training using synthetic data.
Dependence on the accuracy of the event roll warrior.
Verification of generalization performance for various types of acoustic events and complex acoustic scenes is required.
👍