Fugatto: NVIDIA-generated AI heralds audio revolution

Status
2024.12
Summary
NVIDIA has unveiled a new generative AI model, 'Fugatto', that can generate and convert music, speech, and unique sounds using text and audio prompts. It is capable of various audio tasks such as 1) generating music based on text descriptions, 2) adding/removing instruments to existing music, and 3) modifying intonation and emotion.
Category
  1. NVIDIA
Tag
  1. AI Sound
Dates
2024/12/01
Created by
  • mintbear

Fugatto: NVIDIA-generated AI heralds audio revolution

Mintbear AI News 2024.12.01
Hello. Today, we’re going to take a closer look at Fugatto , an innovative sound generation AI model announced by NVIDIA .
(It has not been released to the public yet.)
Fugatto naming: short for F o u n dational G enerative A udio T ransformer O pus 1

Swiss Army Knife for Sound

" A team of generative AI researchers created a Swiss Army knife for sound , one that allows users to the audio output simply using text. " _ EN / KR

In this announcement, NVIDIA is really confident that they have created
a Swiss Army Knife for audio . Fugatto, is it really that useful?
Fugatto can create and transform audio content in a variety of ways, including through text and audio:
1️⃣ Create sound from text (Text-to-Sound, T2S)
2️⃣ Convert the sound to a different style
3️⃣ Separate and extract voices or specific elements from music
4️⃣ Create conversations that reflect intonation and emotion
5️⃣ Add instruments to your music
6️⃣ Change the instrument playing to singing
7️⃣ You can also create music with unique and fresh sound sources.
(The above seven features are covered in more detail below.)
Almost any audio task you can imagine is possible.

What is the difference between Fugatto and Suno?

But how is it different from Suno, which I was already using? Suno and Udio, which are already the most powerful music creation tools, can create excellent music, but they do not allow free conversion, mixing, or element editing of the created music.
In comparison, the technology revealed by Fugatto is a tool that can 1️⃣ generate, 2️⃣ transform, 3️⃣ extract, 4️⃣ make speak, 5️⃣ add, 6️⃣ change, and 7️⃣ replace audio sources in various forms of sound or music - with ' natural language prompts ' at an almost everyday level .
As a result,
If Suno is a complete composition tool that creates music (songs) that includes vocals and instruments ,
Fugatto can be viewed as a tool for creating and editing all forms of audio sounds, including music, voices, and sound effects .
Of course, if the purpose is to compose music (songs), Suno will continue to be more convenient and provide better results. However, for users who need a variety of sound sources for purposes such as video production, the convenience of Fugatto is truly overwhelming.

Let's take a closer look at the hidden features of this awesome Swiss Army knife. Let's unfold them one by one.

7 Features of Fugatto

Here are the seven features revealed in Fugatto:
Feature 01. Audio Generation
Feature 02. Audio Transformation
Function 03. Extract audio elements
Feature 04. Generate emotional voice and dialogue
Feature 05. Adding instruments to music
Function 06. Convert Melody to Vocal
Feature 07. Create music with unique sounds
This time, we will analyze the work process and actual sound of Fugatto that has been released. ( Original video link )

01. Audiocraft Generation (Text-to-Sound)

Fugatto creates new music and sounds based on text prompts. Below is a sample of the sound generated using 'bass pulses and digital noise' to express 'the awakening of a great intelligence'. Listen for yourself.

Prompt: Deep, rumbling bass pulses paired with intermittent, high-pitched digital chirps, like the sound of a massive, sentient machine waking up.
Prompt: A combination of deep, booming bass pulses and occasional high-pitched digital noises, giving the impression of a giant intelligent machine waking up.
It is created by making good use of the characteristics of two sounds, and it understands natural language expressions for the mood of the sound, such as "a feeling as if a machine with a huge intelligence is waking up", as well as the production of "intermittently blending together" . In particular, understanding natural language prompts is a really important point.

02. Audio Morphing

Fugatto can transform and transform sounds in a variety of ways. It can transform simple sounds into completely different textures and emotions, creating unique sonic experiences for your film or audio projects.
Perhaps you could change a sad guitar note into an upbeat instrument or rhythm. For example, you could morph the sound of a passing train into an orchestral sound.
Prompt: Create a sound where a train passes by and becomes a lush string orchestra.
Prompt: Create
a sound that transforms the sound of a passing train into a rich string orchestra .
It's a really cool feature! But I don't know how many seconds the train sound in front lasts, and how detailed the string orchestra's instrumentation is. If you do elaborate prompts, you might have to add the string instrument names one by one.
We still have to wait to see how well this type of user intent can be reflected, as well as the detailed controls and UI/UX convenience.

03. Audio Element Extraction

You can cleanly isolate specific elements from audio, such as extracting only the voice track from a song.
Prompt: Isolate the voice track.
Prompt:
Please separate the voice tracks .

04. Create conversation with intonation and emotion (Emotionally Speech Voice)

Generate conversational voices and transform them into different tones , emotions and intonation styles. Change a calm voice into an angry voice, or transform a happy voice.
Prompt 1: In a calm voice , with an American accent say : "Kids are talking by the door."
Prompt 1: In a calm voice and with an American accent, say, “The children are talking by the door.”
Prompt 2: Turn this calm voice into an angry voice.
Prompt: Change this calm voice into an angry voice.
Prompt 3: Now make it happy.
Prompt: Now make this voice happy.
The technology to generate voice conversations by inputting scripts is already commercialized by Eleven Labs, but the part that freely implements tone, emotion, and intonation is truly amazing. Currently, most voice generation AIs have menu-based interfaces and support limited emotions and countries. This is because they are implemented by learning specific emotions.
Mint
But Fugatto is just implementing the tone, intonation, and emotion as much as possible when speaking in natural language. It feels like the natural conversation of OpenAI - Voice Mode has been implemented. Is this a scaling law that is possible because it is OpenAI and NVIDIA?
No, maybe only the basic emotions like angry and happy have been learned.
But there's one more amazing thing left! The third natural language prompt is " Now make it happy." When you prompt for ' Now make it happy ', it automatically recognizes the previous conversation context and the sound generated in advance, and edits it accordingly.
Normally, this would require reusing the same prompt each time, or re-uploading or inputting previously created sounds as references. However, this can be easily handled with the expression "Now make it -".
Oh.. this is really cool! This is actually doing AI and generative work with real conversations.

05. Add Instrument on Music

Adding new instruments to already prepared sound sources is a very convenient way to compose music. You can complete the music by adding the necessary instruments in order to the created sound sources.
Prompt: Add drums to this Synthesizer track. (on Techno Music)
Prompt: Add drums to this synthesizer track. (Applies to techno music)
In Suno, you can complete a song in one go by entering lyrics and style prompts.
In Fugatto, I can compose by layering the instruments I want in the order I want. This is also the actual way of composing. It's really cool.
But, in fact, the process of adding instruments every time can be really cumbersome, or it may not be controlled as you want. When you add an instrument, it may not be the instrument you want, or it may be played in a different style. The generation part may be too short or too long. The pitch may be over or the volume may be different. Then, you need to write a prompt that controls it more precisely, and it may not be easy from here.
In that case, I think the greatest effect would be achieved if original composers added special sounds to their own sound sources.

06. Melody Conversion to Voice

You can turn a melody played on an instrument into a human song. Just input a basic melody and Fugatto will sing it for you. From opera style to pop and rock style scat singing.
Prompt 1: Turn this MIDI melody into a female voice, operatic scat singing style.
Prompt 1: Transform this MIDI melody into a female voice, opera-style scat singing.
Prompt 2: Turn this MIDI melody into a female voice, pop rock scat singing style.
Prompt 2: Transform this MIDI melody into a female voice, pop rock style scat singing.
In the sample, I just used 'a female voice', but with a little more development, it would probably be possible to use 'a specific voice reference'. So I could have Ariana Grande sing a melody that I made using her voice. (Of course, I have to respect the copyright.)

07. Unique Music Creation

Finally, you could use a musical instrument with a unique tone as a sound source, or even change the sound of a dog barking into a sound source.
Prompt 1: Create an upbeat soundtrack with tabla, melody is uplifting and played on the saxophone.
Prompt 1: Create an uplifting soundtrack by playing an uplifting melody on tabla and saxophone.
Prompt 2: Create a saxophone howling, barking then electronic music with dogs barking.
Prompt 2: Start with a saxophone that sounds like it's howling and barking, then create a sound that's a mix of electronic music and barking dogs.
If the sound of a dog barking is possible, it is also possible to make music with the sound of a refrigerator, an electric fan, or nature sounds such as a stream or the sea.

Let's take a look at the 7 features.

So far, these are the 7 features of Fugatto that have been revealed this time. In fact, it seems that almost every sound-related feature we can imagine has been implemented. It seems that all kinds of sounds can be created, transformed, replaced, and mixed.
However, we will have to wait a little longer to see if all the features are implemented easily with high quality, if they can be controlled as delicately as desired, what the UI/UX is like, and if there are any cost issues.

Web interface

If you think about the recently released Fugatto web interface: the top is [Sound Reference], the middle is [Prompt Input Window], and the bottom is [Sound Output], so it looks very simple. Looking at the icons shown, the only additional functions are [Speed Control], [Trim], and [Microphone Recording].
They say simple is best, but it doesn't seem to have a friendly GUI menu for detailed control and convenience yet.
[ Fugatto's public web interface from NVIDIA ]
Still, given its wide range of features and fast creation speed, it seems like it'll be the best tool out there for the time being.

Sounds that inspire imagination

Here’s what the NVIDIA team has to say:
Fugatto allows users to create soundscapes it's never seen before, such as a thunderstorm easing into a dawn with the sound of birds singing .
(Unlike most models that can only reproduce the training data they’ve been exposed to, Fugatto
lets you create soundscapes you’ve never heard before, like a thunderstorm winding down at dawn with birdsong .)
'A thunderstorm that subsides into the dawn with the sound of birds', this image of a truly wonderful sound comes to mind. What kind of sound would it be? Couldn't the sound of the wind and the trees also be heard?
[Midjourney V6.1 thunderstorm easing into a dawn with the sound of birds singing. ]

Expected areas

Some potential applications where Fugatto could be useful include:
Music: Helps you prototype or edit songs by experimenting with different styles, voices, and instruments.
Advertising: Tailor your campaigns to different regions to reflect different accents and emotions.
Language: Personalize your teaching tools by using familiar voices that feel like family to your learners.
Video Games: Allows developers to dynamically adjust pre-recorded audio assets based on gameplay or generate new sounds in real time. 

Image & Video & Sound Prompts

I have been studying AI image and AI video prompts for a long time since the beginning. But the natural language prompts presented by Fugatto this time were really impressive. Because it understood both the image characteristics of the sound and the description of the overall mood in natural language, just like when generating images.
Also, like a conversation with a GPT chatbot, the convenience of being able to use previous contexts and previous references in a natural conversation was really amazing. Among image or video creation tools, I think the most recently updated Luma Dream Machine's Boards support contextual conversational creation canvases . It is an uncommon feature even in the advanced image AI field.
I'm really excited to see that these advanced features will be available in Sound AI.

When can I use it?

Fugatto represents a significant advancement in audio generation, but NVIDIA has not announced any immediate plans to release it publicly due to concerns about potential misuse. Let’s wait a little longer.
Instead, Fugatto is said to be working with One Take Audio, part of the same NVIDIA Inception program. We don’t have much information on One Take Audio, but it appears to be a program that offers audio services for Mac and PC. You can sign up for the waiting list at the link below, so let’s wait together for now.
Thank you for reading the long article. I will come back with another good article.
Mintbear 🍀🧸

Reference

Fugatto announced by NVIDIA on 2024.11.25
👍