You may well be aware of Stable Diffusion, the much-discussed open-source AI model that can generate images from text. Well, as a “hobby project”, a couple of developers - Seth Forsgren and Hayk Martiros - have now created Riffusion, which uses the same model to turn text into music.
Riffusion works by generating images from spectrograms, which are then converted into audio clips. We’re told that it can generate infinite variations of a text prompt by varying the ‘seed’.
Riffusion’s creators explain that a spectrogram can be computed from audio using what’s known as the Short-time Fourier transform (STFT), which approximates the audio as a combination of sine waves of varying amplitudes and phases.
However, in the case of Riffusion, the STFT is inverted so that the audio can be reconstructed from a spectrogram. Here, the images from the AI model only contain the amplitude of the sine waves and not the phases - these are approximated by something called the Griffin-Lim algorithm when reconstructing the audio clip.
As well as short loops, Riffusion is also capable of creating longer jams, which are based on subtle variations of one image.
The web app enables you to type in prompts and will keep on generating interpolated content in realtime for as long as you let it, while giving you a visual 3D representation of the spectrogram. You can also skip immediately to the next prompt; if there isn’t one, Riffusion will interpolate between different seeds of the same prompt.
We can’t pretend to understand exactly how it all works but Riffusion is impressive and terrifying in equal measure. This kind of technology is in its infancy but it’s not hard to imagine how capable it will become in the future.
See and hear for yourself on the Riffusion website