This paper presents a system for editing complex real-world audio scenes. It provides the ability to delete, insert, and enhance individual audio events in complex audio scenes where individual sound sources overlap temporally. It operates based on textual edit descriptions (e.g., "enhance door sound") and graphical representations of event timing derived from event roll transcriptions. It employs an encoder-decoder transformer with a SoundStream representation, trained on pairs of synthetic (input, desired output) audio examples generated by adding isolated audio events to real-world backgrounds. Evaluation results reveal the importance of each part of the edit description (action, class, and timing), demonstrating that "reconstruction" has important and practical applications.