(00:03):
This is Five-Minute Friday on the new Imagen Video model.
(00:19):
In previous episodes of the SuperDataScience Podcast, such as #570, I’ve discussed DALL-E 2, a model made by the research outfit OpenAI that creates stunningly realistic and creative images based on whatever text input your heart desires.
(00:36):
For today’s Five-Minute Friday episode, it’s my pleasure to introduce you to the Imagen Video model published upon just a few weeks ago by researchers from Google.
(00:45):
First, let’s talk about the clever name: While pronounced “imagine” to allude to the creativity of the model and the users who provide text prompts to it, the Imagen model name is a portmanteau of the words “image” and “generation” (spelt I-M-A-G-E-N), which is rather sensible given that the model generates images. Get it? Image Gen. The original Imagen model was released earlier this year and — like the better-known but perhaps not better-performing DALL-E 2 — the original Imagen model generates still images. The new Imagen Video model takes this generative capacity into another dimension — the dimension of time — by generating short video clips of essentially whatever video clip you prompt it to generate.
(01:38):
For example, if you prompt Imagen Video to generate a video of “an astronaut riding a horse”, it will do precisely that. If you prompt Imagen Video to generate a video of “a happy elephant wearing a birthday hat walking under the sea”, well, then of course it will precisely create that video for you too! In the show notes, we’ve provided a link to a staggering 4×4 matrix of videos created by Imagen Video that I highly recommend you check out to get a sense of how impressive this model really is.
(02:10):
Under the hood, Imagen Video is the combination of three separate components, which I’ll go over here in succession. So this part of Five-Minute Friday is going to be pretty technical. The first component is something called a T5 text encoder. This is a transformer-based architecture that infers the meaning of the natural-language prompt you provide to it as an input. You check out episode #559 of the SuperDataScience podcast to hear more about transformers, which have become the standard for state-of-the-art results in natural language processing and increasingly in machine vision too. Interestingly, this T5 encoder is frozen during training of the Imagen Video model so the T5 model weights are left unchanged by training — this means that T5’s natural language processing capabilities are thus used “out of the box” for Imagen Video’s purposes, and that’s pretty cool, shows how powerful and flexible T5 is, like many of these large language models with transformers in them are. Ok, so that’s the first component, that’s the T5 text encoder which is used to understand the natural language prompt we provide to Imagen Video.
(03:18):
The second component is something called a Base diffusion model, which creates the basic frames of the video. This works similarly to the popular “autoencoder” architecture in that it deconstructs an image into an abstract representation. In the case of Imagen Video this abstract representation looks like TV static. And then the model learns how to reconstruct the original image from that abstract representation. Critically, the base diffusion model of Imagen Video operates on multiple video frames simultaneously and then improves further on the coherence across all the frames of the videos using something called “temporal attention”. Unlike some previous video-generation approaches, these innovations result in frames that make more sense together, ultimately resulting in a more coherent video clip. So, that’s the second component of Imagen Video. The first one was the T5 text encoder, which understand the meaning of the natural language prompt we provide as an input, and then the Base diffusion model through its ability to convert an abstract representation into an image it takes the information from the T5 step, the first step, then it converts that into a number of frames of video.
(04:40):
The third and final step of Imagen Video is taking those frames and making them high resolution. So this is done with something called interleaved spatial and temporal super-resolution diffusion models. These work together to upsample the basic frames that were created in step two by the base diffusion model to a much higher resolution. Since this final stage involves working with high-definition images (much more data), the memory and computational complexity considerations are particularly important. Thus, this final stage leverages convolutions, which are a relatively simplistic operation that has become a standard in deep learning machine vision models over the past decade, so uses that convolutional operation instead of the more complex temporal attention approach of the earlier base diffusion model.
All right, so a quick recap one last time of the three separate components of Imagen Video. T5 text encoder understand the natural language, the Base diffusion model takes that natural language representation and converts it into simple images and then finally, the interleaved spatial and temporal super-resolution diffusion models take the simple frames and convert them into a high resolution ones.
(05:55):
Now that you know how Imagen Video works, you might be dying to try it out yourself. Regrettably, Google hasn’t released the model or source code publicly due to concerns about explicit, violent, and harmful content that could be generated with the model. Because of the sheer scale of natural language scraped from the Internet and then used to train T5 and Imagen Video, it’s difficult to comprehensively filter out problematic data, including data that reinforce social biases or stereotypes against particular groups.
(06:25):
Despite our inability to use Imagen Video ourselves, it is nevertheless a staggering development in the fields of natural language processing and creative artificial intelligence. Hopefully forthcoming approaches can resolve the thorniest social issues presented by these models so that we can all benefit from innovations like this.
(06:43):
Thanks to Shaan Khosla, a data scientist on my team at my machine learning company Nebula for inspiring this Five-Minute Friday episode on Imagen Video today by providing a summary of the Imagen Video paper via his Let’s Talk Text Substack newsletter. He uses the newsletter to provide a weekly easy-to-read summary of a recent key natural language processing paper and you can subscribe to it for free if that’s something you’re interested in — we’ve provided a link to Shaan’s Substack in the show notes.
(07:11):
Ok, that’s it for this episode. Until next time, keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.