SDS 570: DALL-E 2: Stunning Photorealism from Any Text Prompt

Podcast Guest: Jon Krohn

April 28, 2022

Welcome back to the Five-Minute Friday episode of the SuperDataScience Podcast!

Continuing with last week’s FMF theme, Jon updates listeners on another natural language processing breakthrough –the DALL-E 2 model from OpenAI.

 

Last week’s Five-Minute Friday episode featured Jon reviewing Google’s PaLM, and this week, he’s back with OpenAI’s natural language model DALL-E 2.
When the first iteration of the DALL-E was released last year, Jon could not stop talking about its outstanding results! Named after the robot WALL-E from the eponymous Pixar film, that model used a smaller version of the well-known GPT-3 model.
While GPT-3 is trained purely on language, DALL-E is trained on a dataset of text that describes specific images. This makes DALL-E multimodal: It is both an NLP model and a machine vision model. This multimodal functionality enables DALL-E to churn out staggering visual examples of whatever text your mind can dream up. Want an illustration of a baby shark in a tutu serving ice cream? Provide that as an input to DALL-E, and it returns countless examples of precisely that bizarre illustration. Want to see a teapot shaped like a Rubik’s cube? Again, just ask DALL-E, and voilà, you’ve got it! Want to see examples of cameras from every decade of the 20th century? Even temporal information like this is encoded in DALL-E, so no problem.
But all of that incredible text-to-image functionality was already available in DALL-E last year, so what’s improved in the brand-new DALL-E 2 model? Well, to start, it’s simply more proficient at the same kinds of tasks:
  • It has 4x greater resolution, resulting in larger, more realistic-looking images.
  • In comparisons judged by human evaluators, DALL-E 2 was also preferred over the original DALL-E 72% of the time for caption-matching and 89% of the time for photorealism. 
But DALL-E 2 is not just better at the same kinds of tasks; DALL-E is also capable of accepting images as part of its input and so is capable of entirely new kinds of tasks. Finally, DALL-E 2 can receive any given image as input and create variations inspired by the original. They’ll be roughly consistent with the original in terms of both style and composition, but the variations will all be unique and never seen before.
Tune in to this week’s episode to learn more about DALL-E 2. 
ITEMS MENTIONED IN THIS PODCAST:
DID YOU ENJOY THE PODCAST?

  • What amazes you the most about DALL-E 2? What real-world applications can it already address for you?
  • Download The Transcript

Podcast Transcript

(00:05):
This is Five-Minute Friday on DALL-E 2.

(00:19):
For Five-Minute Friday last week, I introduced Google’s mind-blowing new natural-language-processing model, PaLM. This week, I have another crazy new A.I. model for you. This one produces outputs that took my breath away, that left me beside myself: the DALL-E 2 model from OpenAI.
(00:37):
A year ago, I couldn’t shut up for weeks when I saw the outputs from OpenAI’s predecessor model, DALL-E. The name, by the way, is a blend of the artist Salvador Dali and the spelling of the robot WALL-E from the eponymous Pixar film, cutely conjuring up the idea of a robot artist. That model used a smaller version of the well-known GPT-3 model, so GPT-3 proper has 175 billion parameters while DALL-E is less than 10% of that size; to get the skinny on the GPT-3 natural-language-processing model, check out SuperDataScience episode #559 to hear about it directly from one of its creators, Melanie Subbiah.
(01:14):
As detailed in that episode, while GPT-3 was trained purely on language, DALL-E was trained on a dataset of text that describes specific images. This makes DALL-E multimodal: It is both an NLP model and a machine vision model. This multimodal functionality enables DALL-E to churn out staggering visual examples of whatever your mind can dream up. Want an illustration of a baby shark in a tutu serving ice cream? Provide that as an input to DALL-E and it returns countless examples of exactly that bizarre illustration. Want to see a teapot shaped like a Rubik’s cube? Again, just ask DALL-E and voilà, you’ve got it! Want to see examples of cameras from every decade of the 20th century? Even temporal information like this is encoded in DALL-E, so no problem.
(02:04):
All right, so all of that incredible text-to-image functionality was already available in DALL-E last year; so what’s improved in the brand-new DALL-E 2 model? Well, to start it’s simply more proficient at the same kinds of tasks: It has 4x greater resolution, resulting in larger, more realistic-looking images. And, in comparisons judged by human evaluators, DALL-E 2 was also preferred over the original DALL-E 72% of the time for caption-matching and 89% of the time for photorealism.
(02:34):
But DALL-E 2 is not just better at the same kinds of tasks; DALL-E is also capable of accepting images as part of its input and so is capable of entirely new kinds of tasks. For example, DALL-2 can do image edits. If you input an image of an art gallery and then ask DALL-E 2 to add a Yorkshire terrier, depending on the location in the image you specify, DALL-E 2 will add a Yorkshire terrier in the correct style. So, if you specify that you’d like the terrier on a painting in the art gallery by Claude Monet, then the terrier will appear in the painting in Monet’s style. If alternatively, you specify you’d like the terrier on a modern painting in the art gallery, then the terrier will appear on the painting in a correspondingly modern style. And in a completely different kind of example, if you specify that you’d like the terrier to be sitting on a bench within the art gallery, then instead of being on a painting, in a painted style, the terrier will be photorealistic.
(03:30):
Finally, DALL-E 2 can receive any given image as input and create variations that are inspired by the original. They’ll be roughly consistent with the original in terms of both style and composition, but the variations will all be unique and never seen before. Check out the link to the OpenAI DALL-E 2 webpage in the show notes to play around with various inputs and see the wild, impressive outputs.
(03:53):
At this time, playing around on the webpage like that within a tightly constrained playground of pre-computed outputs is the most that OpenAI will let you do with DALL-E 2. While on the one hand, it would be awesome if we could all be using DALL-E 2 straightaway — to have fun, to help us with the design of something, to create new art, or for artistic inspiration — if the model were openly available, it’s flexibility and staggering photorealism would also make it ripe for abuse. By not including explicit content like violent or pornographic images in DALL-E 2’s training data, OpenAI have made an effort to avoid the most egregious abuses. And they’ve also limited the model’s ability to create photorealistic images of the faces of any specific real person. But, they anticipate that there will be other gaps they hadn’t thought of, so they are rolling out access to DALL-E 2 in phases. Feel free to add yourself to the waitlist! The link for that is in the show notes as well. And this does have me thinking that if OpenAI is capable of building this capabilities today we’re probably not too far off from the future from some organization with perhaps more on the [inaudible 00:05:00] ends, deciding to choose a similar kind of model but include violent pornographic images or allow the outputs of specific real people so guess that’s coming.
(05:16):
Well, frightening but at least we don’t have to worry about it today. All right, that’s it for this Five-Minute Friday episode. Keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts