Nvidia Unveils 'Swiss Army Knife' of AI Audio Tools: Fugatto

High-powered computer chip maker Nvidia on Monday unveiled a new AI model developed by its researchers that can generate or transform any mix of music, voices and sounds described with prompts using any combination of text and audio files.

The new AI model called Fugatto — for Foundational Generative Audio Transformer Opus — can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice, and even produce sounds never heard before.

According to Nvidia, by supporting numerous audio generation and transformation tasks, Fugatto is the first foundational generative AI model that showcases emergent properties — capabilities that arise from the interaction of its various trained abilities — and the ability to combine free-form instructions.

“We wanted to create a model that understands and generates sound like humans do,” Rafael Valle, a manager of applied audio research at Nvidia, said in a statement.

“Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale,” he added.

Nvidia noted the model is capable of handling tasks it was not pretrained on, as well as generating sounds that change over time, such as the Doppler effect of thunder as a rainstorm passes through an area.

The company added that unlike most models, which can only recreate the training data they’ve been exposed to, Fugatto allows users to create soundscapes it’s never seen before, such as a thunderstorm easing into dawn with the sound of birds singing.

Breakthrough AI Model for Audio Transformation

“Nvidia’s introduction of Fugatto marks a significant advancement in AI-driven audio technology,” observed Kaveh Vahdat, founder and president of RiseOpp, a national CMO services company based in San Francisco.

“Unlike existing models that specialize in specific tasks — such as music composition, voice synthesis, or sound effect generation — Fugatto offers a unified framework capable of handling a diverse array of audio-related functions,” he told TechNewsWorld. “This versatility positions it as a comprehensive tool for audio synthesis and transformation.”

Vahdat explained that Fugatto distinguishes itself through its ability to generate and transform audio based on both text instructions and optional audio inputs. “This dual-input approach enables users to create complex audio outputs that seamlessly blend various elements, such as combining a saxophone’s melody with the timbre of a meowing cat,” he said.

Additionally, he continued, Fugatto’s capacity to interpolate between instructions allows for nuanced control over attributes like accent and emotion in voice synthesis, offering a level of customization not commonly found in current AI audio tools.

“Fugatto is an extraordinary step towards AI that can handle multiple modalities simultaneously,” added Benjamin Lee, a professor of engineering at the University of Pennsylvania.

“Using both text and audio inputs together may produce far more efficient or effective models than using text alone,” he told TechNewsWorld. “The technology is interesting because, looking beyond text alone, it broadens the volumes of training data and the capabilities of generative AI models.”

Nvidia at Its Best

Mark N. Vena, president and principal analyst at SmartTech Research in Las Vegas, asserted that Fugatto represents Nvidia at its best.

“The technology introduces advanced capabilities in AI audio processing by enabling the transformation of existing audio into entirely new forms,” he told TechNewsWorld. “This includes converting a piano melody into a human vocal line or altering the accent and emotional tone of spoken words, offering unprecedented flexibility in audio manipulation.”

“Unlike existing AI audio tools, Fugatto can generate novel sounds from text descriptions, such as making a trumpet sound like a barking dog,” he said. “These features provide creators in music, film, and gaming with innovative tools for sound design and audio editing.”

Fugatto deals with audio holistically — spanning sound effects, music, voice, virtually any type of audio, including sounds that have not been heard before — and precisely, added Ross Rubin, the principal analyst with Reticle Research, a consumer technology advisory firm in New York City.

He cited the example of Suno, a service that uses AI to generate songs. “They just released a new version that has improvements in how generated human voices sound and other things, but it doesn’t allow the kinds of precise, creative changes that Fugatto allows, such as adding new instruments to a mix, changing moods from happy to sad, or moving a song from a minor key to a major key,” he told TechNewsWorld.

“Its understanding of the world of audio and the flexibility that it offers goes beyond the mask-specific engines that we’ve seen for things like generating a human voice or generating a song,” he said.

Opens Door for Creatives

Vahdat pointed out that Fugatto can be useful in both advertising and language learning. Agencies can create customized audio content that aligns with brand identities, including voiceovers with specific accents or emotional tones, he noted.

At the same time, in language learning, educational platforms will be able to develop personalized audio materials, such as dialogues in various accents or emotional contexts, to aid in language acquisition.

“Fugatto technology opens doors to a wide array of applications in creative industries,” Vena maintained. “Filmmakers and game developers can use it to create unique soundscapes, such as turning everyday sounds into fantastical or immersive effects,” he said. “It also holds potential for personalized audio experiences in virtual reality, assistive technologies, and education, tailoring sounds to specific emotional tones or user preferences.”

“In music production,” he added, “it can transform instruments or vocal styles to explore innovative compositions.”

Further development may be needed to get better musical results, however. “All these results are trivial, and some have been around for longer — and better,” observed Dennis Bathory-Kitsz, a musician and composer in Northfield Falls, Vt.

“The voice isolation was clumsy and unmusical,” he told TechNewsWorld. “The additional instruments were also trivial, and most of the transformations were colorless. The only advantage is that it requires no particular learning, so the development of musicality for the AI user will be minimal.”

“It may usher in some new uses — real musicians are wonderfully inventive already — but unless the developers have better musical chops to begin with, the results will be dreary,” he said. “They will be musical slop to join the visual and verbal slop from AI.”

AGI Stand-In

With artificial general intelligence (AGI) still very much in the future, Fugatto may be a model for simulating AGI, which ultimately aims to replicate or surpass human cognitive abilities across a wide range of tasks.

“Fugatto is part of a solution that uses generative AI in a collaborative bundle with other AI tools to create an AGI-like solution,” explained Rob Enderle, president and principal analyst at the Enderle Group, an advisory services firm in Bend, Ore.

“Until we get AGI working,” he told TechNewsWorld, “this approach will be the dominant way to create more complete AI projects with far higher quality and interest.”