How Google’s MusicLM Paves New Paths for Creativity (Guest Column)
From ChatGPT writing code for software engineers to Bing’s search engine sliding in place of your bi-weekly Hinge binge, we’ve become obsessed with the capacity for artificial intelligence to replace us.
Within creative industries, this fixation manifests in generative AI. With models like DALL-E generating images from text prompts, the popularity of generative AI challenges how we understand the integrity of the creative process: When generative models are capable of materializing ideas, if not generating their own, where does that leave artists?
Google’s new text-based music generative AI, MusicLM, offers an interesting answer to this viral terminator-meets-ex-machina narrative. As a model that produces “high-fidelity music from text descriptions,” MusicLM embraces moments lost in translation that encourages creative exploration. It sets itself apart from other music generation models like Jukedeck and MuseNet by inviting users to verbalize their original ideas rather than toggle with existing music samples.
Describing how you feel is hard
AI in music is not new. But between recommending songs for Spotify’s Discover Weekly playlists to composing royalty free music with Jukedeck, applications of AI in music have evaded the long-standing challenge of directly mapping words to music.
This is because, as a form of expression on its own, music resonates differently to each listener. The same way that different languages struggle to perfectly communicate nuances of respective cultures, it is difficult (if not impossible) to exhaustively capture all dimensions of music in words.
MusicLM takes on this challenge by generating audio clips from descriptions like “a calming violin melody backed by a distorted guitar riff,” even accounting for less tangible inputs like “hypnotic and trance-like.” It approaches this thorny question of music categorization with a refreshing sense of self awareness. Rather than focusing on lofty notions of style, MusicLM grounds itself in more tangible attributes of music with tags such as “snappy”, or “amateurish.” It broadly considers where an audio clip may come from (eg. “Youtube Tutorial”), the general emotional responses it may conjure (eg. “madly in love”), while integrating more widely accepted concepts of genre and compositional technique.
What you expect is (not) what you get
Piling onto this theoretical question of music classification is the more practical shortage of training data. Unlike its creative counterparts (e.g. DALL-E), there isn’t an abundance of text-to-audio captions readily available.
MusicLM was trained by a library of 5,521 music samples captioned by musicians called ‘MusicCaps.’ Bound by the very human limitation of capacity and the almost-philosophical matter of style, MusicCaps offers finite granularity in its semantic interpretation of musical characteristics. The result is occasional gaps between user inputs and generated outputs: the “happy, energetic” tune you asked for may not turn out as you expect.
However, when asked about this discrepancy, MusicLM researcher Chris Donahue and research software engineer Andrea Agostinelli celebrate the human element of the model. They describe primary applications such as “[exploring] ideas more efficiently [or overcoming] writer’s block,” quick to note that MusicLM does offer multiple interpretations of the same prompt — so if one generated track fails to meet your expectations, another might.
“This [disconnect] is a big research direction for us, there isn’t a single answer,” Andrea admits. Chris attributes this disconnect to the “abstract relationship between music and text” insisting that “how we react to music is [even more] loosely defined.”
In a way — by fostering an exchange that welcomes moments lost in translation — MusicLM’s language-based structure positions the model as a sounding board: as you prompt the model with a vague idea, the generation of approximates help you figure out what you actually want to make.
Beauty is in breaking things
With their experience producing Chain Tripping (2019) — a Grammy-nominated album entirely made with MusicVAE (another music generative AI developed by Google) — the band YACHT chimes in on MusicLM’s future in music production. “As long as it can be broken apart a little bit and tinkered with, I think there’s great potential,” says frontwoman Claire L. Evans.
To YACHT, generative AI exists as a means to an end, rather than the end in itself. “You never make exactly what you set out to make,” says founding member Jona Bechtolt, describing the mechanics of a studio session. “It’s because there’s this imperfect conduit that is you” Claire adds, attributing the alluring and evocative process of producing music to the serendipitous disconnect that occurs when artists put pen to paper.
The band describes how the misalignment of user inputs and generated work inspires creativity through iteration. “There is a discursive quality to [MusicLM]… it’s giving you feedback… I think it’s the surreal feeling of seeing something in the mirror, like a funhouse mirror,” says Claire. “A computer accent,” band member Rob Kieswetter jokes, referencing a documentary about the band’s experience making Chain Tripping.
However, in discussing the implications of this move to text-to-audio generation, Claire cautions the rise of taxonomization in music: “imperfect semantic elements are great, it’s the precise ones that we should worry about… [labels] create boundaries to discovery and creation that don’t need to exist… everyone’s conditioned to think about music as this salad of hyper-specific genre references [that can be used] to conjure a new song.”
Nonetheless, both YACHT and the MusicLM team agrees that MusicLM — as it currently is — holds promise. “Either way there’s going to be a whole new slew of artists fine-tuning this tool to their needs,” Rob contends.
Engineer Andrea recalls instances where creative tools weren’t popularized for its intended purpose: “the synthesizer eventually opened up a huge wave of new genres and ways of expression. [It unlocked] new ways to express music, even for people who are not ‘musicians.’” “Historically, it has been pretty difficult to predict how each piece of music technology will play out,” researcher Chris concludes.
Happy accidents, reinvention, and self-discovery
Back to the stubborn, unforgiving question: Will generative AI replace musicians? Perhaps not.
The relationship between artists and AI is not a linear one. While it’s appealing to prescribe an intricate and carefully intentional system of collaboration between artists and AI, as of right now, the process of using AI in producing art resembles more of a friendly game of trial and error.
In music, AI gives room for us to explore the latent spaces between what we describe and what we really mean. It materializes ideas in a way that helps shape creative direction. By outlining these acute moments lost in translation, tools like MusicLM sets us up to produce what actually ends up making it to the stage… or your Discover Weekly.
Tiffany Ng is an art & tech writer based in NYC. Her work has been published in i-D Vice, Vogue, South China Morning Post, and Highsnobiety.
Marc Schneider
Billboard