Do Robots Dream of Podcasting?

Last year(!), my colleagues published a beautiful pair of articles with thoughts and predictions about podcasting in 2022. They collected a great number of super-smart folks and co-created readings rife with interesting insights into what to expect.

This got me thinking about how far some podcast tech might move things forward in the near future.

Imagine for a moment that you’re listening to your favourite podcast. It’s a heartfelt moment, the guest is going deep, the music is emotional. It fades to silence. Then, a host of a different podcast slides in with an ad. The music and mood (and LEVELS!) are in line with what you were just listening to. The new host mentions a local-to-you landmark and references your current weather to make you feel special. They set up a story and enwrap you in a short narrative. It’s delivered at a high enough resolution for you to be affected by the intense 3D audio that surrounds you, with sound elements seeming to come from every direction.

Did you realize that the second host was merely a cloned voice? One able to address you personally with locally engaging references and using geodata from your phone? And did you notice the segments were mixed in real-time by an algorithm—creating a dynamic and immersive soundscape on the fly? It’s designed to feel like a tangent from your podcast listening experience instead of an interruption.

Did you also know that for the next week, the brand behind the immersive ads will use another robot to “listen to” every piece of audio in the podiverse to figure out which podcast neighbourhoods to focus their next campaign on, targeting episodes with the adjacent subject matter? They will see an uptick in new subscribers and engagement for the show they are promoting.

Is any of this possible or am I a romantic futurist holding a cloudy crystal ball?

The truth is, we have all of this technology now. Robots can narrate a script, mix and restore audio, and insert ads on the fly and get smarter about where their listeners are located while doing it. Whether it all gets implemented in the way I describe it is up to us, really.

Let’s break this down with the help of some smart people…

ROBOT HOSTS

Lately, I’ve been tinkering with Descript (an “all-in-one audio & video editing” platform that is a swiss army knife of podcast production workflows) to clone the voices of a few of our hosts.

We have some fancy folk who host some of our shows, and they don’t have a ton of time in their busy calendars. By using AI versions of their voices (with their consent, of course), we’ve been able to create usable narration to try out different scripting options for things like table reads. That’s an old term for when we used to sit around a table and read a script together to see how it flows, before recording. It’s pretty amazing to be able to make script tweaks, client changes, and refinements using a cloned voice until everything is approved. At that point we only need to do one session with the human host, saving them a ton of time.

The robots aren’t necessarily ready to completely take over hosting duties yet. I long to be able to refine and tune the robot performance like an instrument, but it’s not far off and I imagine we’ll be sneaking robot performances of pickups in relatively short order.

I chatted with Descripts’ Head of Business & Corporate Development, Jay LeBoeuf, to see how close we were to the future I imagine. Turns out, we’re very close.

“AI voice clones are going to become much, much more widely used in 2022 than they were before. But specifically, they’re going to be used for editorial mistakes, for pickups, and even for ad reads and trailers.”

Jay let me know that Pushkin Industries is already using an audio AI version of Malcolm Gladwell (host of Revisionist History) to voice table reads. “It doesn’t make it to the show yet, but they have a really high-quality version of his voice because he’s already done all these audiobooks and all these podcasts.”

Using the vast catalogue of professionally recorded audio they already have of Malcolm Gladwell, sound designers can use machine learning to train an artificially intelligent text-to-speech voice clone.

For the producers, this means that they can write the script and the narration based on what his voice sounds like, in his cadence. And, as Jay points out, this saves a lot of time, “what’s left for him at the end is to go in and actually record it in a way that only he can.”

Getting personal… about ads

Robots are changing the sound of ads too. Using dynamic ad insertions, companies can generate audio customized to your listening experience based on who you are and where you are located. Amazingly, Jay tells me that there are already examples in the UK where AI voice clones are “reading” dynamically scripted ad-reads which reference current weather, “which in the UK is going to be rain,” and storefront locations specific to the listeners’ real-time location. Depending on the information you are able to collect about your listener, the ad read can even be adjusted for dialect.

“You could make 20, 30, 40 versions of it. with, you know, different mixes, different flavours, and formats for different regions.”

Here is what a Robot Host could sound like, using Descript. In the second example, the cloned voice is only used for pickups. Can you tell which words are voiced by a human and which are voiced by a clone?

Shawn Cole’s Cloned Voice, made using Descript

Shawn Cole’ Human Voice (With the Clone replacing a few words as pickups)

¿Habla español? Maso

Descript isn’t the only company pushing the boundaries of voice cloning technology. Veritone is doing incredible work that is already having an impact in the podcasting space. Their marvel.ai tool allows a user to train a synthetic version of themselves, with limitless possibilities. For example, super podcast ad-tech expert Bryan Barletta of Sounds Profitable showed me how he used the platform to create a Spanish-speaking version of himself (listen for yourself on this page), unlocking an extra 7% of the world’s population as potential listeners. Bryan hopes this will grow from a purely translated version of his podcast to a truly localized version involving people from those communities: “Ultimately, if enough interest is generated in the Spanish language version of the newsletter and the podcast, I would be foolish not to expand into localization instead of only translation, finding a peer who speaks Spanish, and collaborating with them to create unique content for Sounds Profitable.”

ROBOT SOUND MIXERS

Fixing and Mixing it in Post (It’s Time To Give the Robots the Lemons)

This is a tough one. I work with a great team of sound designers who make their living in the dark art of transforming unpolished audio into delightful ear candy. Tools are finally emerging that have the ability to eat away at a significant chunk of their workload. Will the robots take our jobs?

Auphonic is an incredible tool that acts as an “Automatic audio post production web service for podcasts.” And, I have to admit it’s pretty amazing. It uses AI to analyze stereo or multitrack audio, and then will automagically mix the audio. It can handle noise and hum reduction, overall levelling, and it can even mix the music around the dialogue so that everything is heard.

Here’s a sample of how it works.

Auphonic sound samples:

This is a raw recording with no mixing.

The same recording, “mixed” by Auphonic.

The same recording, mixed by a Sound Designer.

Descript has even waded into the audio engineering job-stealing territory with their offering of “Studio Sound”. With this function, at the click of a button, a voice recorded with imperfect tools and lackluster quality is quickly transformed into sounding like it was recorded at a fancy studio. As Jay states, “We just have trained it on lots of great-sounding voices and so it’s going to remove everything that isn’t a good-sounding voice, whether it’s a dog barking in the background or my MacBook Pro fan spinning up.”

I was skeptical, but this is pretty unreal.

Descript Studio sound samples:

Raw recording, before “Studio Sound” applied.

The same recording with “Studio Sound” applied.

The same recording processed by a human hand (ear).

So, will we be out of a job as sound folx in the near future? Not at all. I’ve realized that if my team doesn’t have to struggle with the less glamorous task of removing reverb and dog barks, they will have more time to focus on the fun stuff! Like creating immersive soundscapes and finding the perfect music to propel our stories.

I reached out to Jocelyn Gonzales, the Director of PRX Productions at PRX and she astutely pointed out that “the definitions of producer and engineer/editor/designer will get sketchier — or more blurred maybe?”

I love that idea. This should mean less division between the phases of production and more collaboration within the whole team.

Jay agrees. And, helps me feel better about job security when he reminds me that “the fun of sound design actually doesn’t come from cleaning up poor quality field recordings […] focus on storytelling by adding layers and subtracting layers and not having to worry about the cleanup.”

Cool! The Robots can have that job, now we can focus on immersive audio…

SPATIAL AUDIO

This one has been on the horizon for what seems like forever. Every year somebody makes a bold statement about using some form of spatial audio for podcasts. We have the technology. In fact, we’ve had it in some form since the 1970s.

Spatial audio is an effect that gives the listener the feeling that sound is coming at them from three dimensions, instead of just from the left, right or center. It can be achieved in a few different ways and sometimes involves a head-tracking component so that when the listener looks at a thing, the sound of that thing follows their movement.

The difficulty seems to be in getting everyone on board with the same technology. Dolby recently teamed up with Apple to unlock spatial audio for apple music. Dolby and Apple have collaborated to ensure that newer devices are Atmos compatible, and Dolby was able to deliver a backlog of Atmos-mixed music tracks for Apple Music. Meaning that when the tech was announced, they had a wide variety of music available for customers to listen to, as seemingly 3D sound.

I chatted with some folks from Dolby, who told me that the same plan is in the works for podcasts. Tim Pryde, the Director of Dolby Music and Streaming Services told me that “we are seeing interest from a wide range of podcast creators in a multitude of genres who believe there is an opportunity for Dolby Atmos to enhance their storytelling for listeners.” I know that some production houses are creating Atmos mixes of their shows and stockpiling these versions for release when all the tech is compatible.

But therein lies the problem. With Apple Music, they just had to make sure Apple products could reproduce Atmos-mixed songs. For podcasting, every link in the chain, from distribution platforms, podcast players, and every possible listening device needs to be Atmos-certified before the feature will be rolled out. This could mean we’re a few years out.

The thing is we don’t NEED Atmos to deliver a spatial/3D/immersive audio experience. It’s just one option.

Binaural audio is the ‘70s shag carpet version of spatial audio. It’s been around since ABBA was fresh and new. The trouble is that it only translates “accurately” with headphones. But that doesn’t mean it sounds terrible without headphones. And really, if you’re into narrative audio storytelling I bet you aren’t listening at 2x speed through your laptop speakers.

I spoke with Mirko Vogel from Vaudeville, an international sound shop doing incredible design work in the immersive audio field. He loves Atmos for linear visual storytelling like movies. Ambisonic (the fanciest of the formats), is great for virtual worlds where sound has to follow the head movements of the audience (hello metaverse!), but he argues that binaural should just become the default for podcasts since it excels at simple reproduction of spatial audio without head tracking or sitting in front of a screen. It’s cheap to implement, doesn’t require any extra fancy tech, and is perfectly suited to the medium. Aside from the fancy sound effects swirling around your head, spatial audio can make it so that you feel like you’re sitting in a room with the podcasters, instead of them just living between your ears. “What immersive audio does with voice is that it gives it a place to live.”

Here’s a clip from Adobe’s Wireframe mixed in stereo and then remixed using free binaural post production tools:

Clip mixed in normal Stereo.

The same clip, mixed for Binaural (headphones make it even better!)

AI BI TOOL (They’re listening)

The last bit of fancy tech I outlined in our dream robot scenario is a new tool that could really help content marketers and brands focus their efforts in the podcast world like never before.

Michael Mörs is the CEO of Podmon, an AI-based podcast monitoring tool. If you ask his robot to be on the lookout for a topic or search term, it will deliver search results within minutes.

This is wild to me. If I were vain and wanted a heads up any time my name might be mentioned in a podcast, the PodMonBot would HMU with a direct link to the exact spot in the episode where it found that mention.

The business intelligence this could give brands is completely bonkers. On the one hand, it could help you focus your efforts on the right podcasts where you know relevant topics are being mentioned.

It scores those mentions by relevance and is expanding into sentiment analysis. So, if you’re a company, instead of a vain weirdo like me, this tool could let you track the overall vibes you’re getting around those mentions. Michael points out that “Podcasts, in particular, continue to be characterized by a sometimes very high level of depth of intensity when dealing with topics, which other media channels do not have.” A robot would certainly help brands keep track of how things are going and save a lot of time doing it.

My final thoughts (for now)

All of this technology is currently in play in the podcasting world. And while it might not all be as sophisticated and seamless as it will no doubt become, it’s surely just a matter of time. I can’t tell you that my opening scenario will play out exactly like I dreamt, aside from the likelihood of rain in the UK, but I know that AI has entered the chat. It’s already being used in ways that we likely don’t realize. Personally, I feel the kind of excitement about future possibilities that the romance of a new year brings. And I’m very excited to see how our industry can push things forward to make listener experiences match the promises that we so often make to them. Give them a personalized listening experience, even within ads. Enwrap them in an immersive world that helps them lose themselves and forget about pining for pre-pandemic times for a spell. Temper their advertising auditory intrusions with 30 seconds that doesn’t make them reach frantically for the volume control.

I’m excited about what’s next, how about you?

Sign up for the Pacific Content Newsletter: audio strategy, analysis, and insight in your inbox.

Do Robots Dream of Podcasting?