This episode video discusses a place on the internet of the world where you can get your audio voice sample and have the AI reconstructed it with ease.
Dear Fellow Scholars, this is Two Minute Papers with this guy’s name that is impossible to
My name is Dr. Károly Zsolnai-Fehér, and indeed, it seems that pronouncing my name
requires some advanced technology.
So what was this?
I promise to tell you in a moment, but to understand what happened here, first, let’s
have a look at this deepfake technique we showcased a few videos ago.
As you see, we are at the point where our mouth, head, and eye movements are also realistically
translated to a chosen target subject, and perhaps the most remarkable part of this work
was that we don’t even need a video of this target person, just one photograph.
However, these deepfake techniques mainly help us in transferring video content.
So what about voice synthesis?
Is it also as advanced as this technique we’re looking at?
Well, let’s have a look at an example, and you can decide for yourself.
This is a recent work that goes by the name Tacotron 2, and it performs AI-based voice
All this technique requires is a 5-second sound sample of us, and is able to synthesize
new sentences in our voice, as if we uttered these words ourselves.
Let’s listen to a couple examples.
Wow, these are truly incredible.
The timbre of the voice is very similar, and it is able to synthesize sounds and consonants
that have to be inferred because they were not heard in the original voice sample.
And now, let’s jump to the next level, and use a new technique that takes a sound sample
and animates the video footage as if the target subject said it themselves.
This technique is called Neural Voice Puppetry, and even though the voices here are synthesized
by this previous Tacotron 2 method that you heard a moment ago, we shouldn’t judge this
technique by its audio quality, but how well the video follows these given sounds.
If you decide to stay until the end of this video, there will be another fun video sample
waiting for you there.
Now, note that this is not the first technique to achieve results like this, so I can’t
wait to look under the hood and see what’s new here.
After processing the incoming audio, the gestures are applied to an intermediate 3D model, which
is specific to each person since each speaker has their own way of expressing themselves.
You can see this intermediate 3D model here, but we are not done yet, we feed it through
a neural renderer, and what this does is apply this motion to the particular face model shown
in the video.
You can imagine the intermediate 3D model as a crude mask that models the gestures well,
but does not look like the face of anyone, where the neural renderer adapts this mask
to our target subject.
This includes adapting it to the current resolution, lighting, face position and more, all of which
is specific to what is seen in the video.
What is even cooler is that this neural rendering part runs in real time.
So, what do we get from all this?
Well, one, superior quality, but at the same time, it also generalizes to multiple targets.
Have a look here!
And the list of great news is not over yet, you can try it yourself, the link is available
in the video description.
Make sure to leave a comment with your results!
To sum up, by combining multiple existing techniques, it is important that everyone
knows about the fact that we can both perform joint video and audio synthesis for a target
This episode has been supported by Weights & Biases.
Here, they show you how to use their tool to perform faceswapping and improve your model
that performs it.
Weights & Biases provides tools to track your experiments in your deep learning projects.
Their system is designed to save you a ton of time and money, and it is actively used
in projects at prestigious labs, such as OpenAI, Toyota Research, GitHub, and more.
And, the best part is that if you are an academic or have an open source project, you can use
their tools for free.
It really is as good as it gets.
Make sure to visit them through wandb.com/papers or just click the link in the video description
and you can get a free demo today.
Our thanks to Weights & Biases for their long-standing support and for helping us make better videos
Thanks for watching and for your generous support, and I’ll see you next time!
My Machine Learning Blog may not own some of the content presented.
Copyright Disclaimer under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education, and research. Fair use is a use permitted by copyright statute that might otherwise be infringing.
All posts on this video blog are my personal opinion and they don’t in any way reflect the opinions of my employer. All materials, posts, and advice from this site are informational, researching, and for testing purposes only. You can use them at your own responsibility. I’m not in any way responsible for any damage done by following posts, advice, tutorials, and articles from this video blog.