SDS 429: 2020’s Biggest Data Science Breakthroughs

Jon and I covered three main topics as the most interesting breakthroughs for machine learning and cross-discipline achievements, as well as Jon’s background, and an important and exciting announcement about the future of the podcast.

About Jon Krohn

Jon Krohn is Chief Data Scientist at the machine learning company untapt. He authored the 2019 book Deep Learning Illustrated, an instant #1 bestseller that was translated into six languages. Jon is renowned for his compelling lectures, which he offers in-person at Columbia University, New York University, and the NYC Data Science Academy, as well as online via O’Reilly. Jon holds a Ph.D. in neuroscience from Oxford and has been publishing on machine learning in leading academic journals since 2010.

Overview

As we’ve already briefly discussed, Jon will be taking over hosting the podcast in 2021. For me, it was easy to select Jon as the host going forward because of his personality, kindness, and intelligence. You got a taste for it in our previous episode and Jon feels confident about this work, he is passionate about the podcast and the field, being a data science educator himself. But who is Jon? What are his values and what can you expect from the podcast?

Jon is a scientist at heart, having studied neuroscience at Oxford with the intention of becoming a medical scientist before he decided he could have a better impact in the startup world for consumers and the environment. Jon is fascinated by the ways technology and data have impacted in billions of ways we don’t realize, like the podcast itself. Jon works currently at untapt where they create algorithms for human resources needs. They use natural language processing and deep learning models to do the work and fight bias in hiring. Beyond that, Jon moonlights as a data science evangelist of sorts, hoping to spread the knowledge and benefits of data science to a wider group.

Jon’s pick of three major breakthroughs is AlphaFold, GPUs, and GPT-3. In November 2020, there was a breakthrough in protein folding. DNA, the fundamental blueprint for your body, is the blueprint of proteins that make every process in your body possible. The chain of DNA “folds” as it takes shapes in three dimensions that are incredibly important to the function of the amino acids. AlphaFold came out of DeepMind during a CASP competition in 2018 and then improved upon this past November, designed to test algorithms’ abilities to predict amino acid shapes. Tackling understanding protein shapes can help us in designing proteins to help us biologically and even our environment.

GPUs were Jon’s next choice of a big breakthrough. GPUs have been releasing as a hot item in gaming and crypto mining. Recently they’ve become big in neural networks where it can do more than a CPU. The memory has gotten bigger, more than doubled, which was a critical breakthrough for model training, as well as the ability to split models across multiple GPUs.

The final breakthrough we discussed was GPT-3. Generative pre-trained transformer produces text after large data pre-training under a transformer architecture which can allow the model to produce things, like a poem. The parameters on GPT have been getting more and more massive—which are weights and biases—over the years. Which meant they became too big to train and took too long to run. GPT-3, though not opensource, is massive in size and allows for “zero-shot learning”. It can generate quiz questions, plots, poems, perform translations, code creation, and much more. It’s impressive, but it’s not consistent yet in convincing a human it is capable of intelligent reasoning.

We wrapped up discussing a COVID-19 related application for AlphaFold. It is capable of predicting the shape of the spike protein that exists in the coronavirus. 2020 has been the year that never was for many. But all the improvements over the last century, that have helped us produce vaccines or better respond to global crises are possible through the storing and sharing of data.

In this episode you will learn:

Global warming [4:37]
Our big podcast announcement [6:57]
Who is Jon Krohn? [12:14]
Top 3 technological breakthroughs of the year [21:28]
AlphaFold [23:33]
GPUs [45:51]
GPT-3 [1:00:26]
Wrap up [1:26:40]

Items mentioned in this podcast:

untapt
Deep Learning Illustrated by Jon Krohn, Grant Beyleveld and Aglaé Bassens
Foldit
Jon’s PC Part Picker build
The Illustrated Transformer
Talk to Transformer
What It’s Like To be a Computer: An Interview with GPT-3

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Kirill Eremenko: 00:00:00

This is episode 429 with Chief Data Scientist at untapt, Jon Krohn.

Kirill Eremenko: 00:00:12

Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now let’s make the complex simple.

Kirill Eremenko: 00:00:44

Welcome back to the SuperDataScience podcast everybody, super excited to have you back here on the show. Today, we’ve got a very cool episode with Jon Krohn. We spent over an hour discussing 2020 in retrospect, in terms of technological breakthroughs. In this podcast, you will hear about the three main topics that were identified, well Jon identified, as the most interesting breakthroughs and the first one will be AlphaFold, which happened literally just a month ago in November this year. And that is a massive breakthrough for machine learning and for the space of biology as well. You’ll learn everything you need to know about protein folding, what it’s all about, the CASP competition, how AlphaFold a child of DeepMind which is owned by Google managed to solve this protein folding challenge and what it means, most importantly, what it means for the space of biology, for the future of tech and for the future of AI and machine learning.

Kirill Eremenko: 00:01:56

The second topic we spoke about our GPUs and you will find out why they’re important for the space of deep learning, what are the recent developments in the space of GPU’s, tips and tricks on how they are used by individual deep learning practitioners and enterprises alike. For instance, we spoke a bit about parallelization of GPUs.

Kirill Eremenko: 00:02:21

And finally, the third topic for today was GPT3, the revolutionary natural language processing model from OpenAI that has completely just left all other models far behind. They’re so far ahead. You’ll find out what GPT3 is, how it works in some high level ways. But we’re also going to some detail on certain aspects of it such as self attention and semi-supervised learning. You’ll find out what those things mean. And also what are the implications? What are use cases, where this is going? And you’ll hear two different opinions, like I’m super excited or like overly excited about it and optimistic. Jon is a bit more skeptical. So you’ll hear both opinions on that and how it’s going to progress. But nevertheless, one of the biggest breakthroughs in the space of natural language processing which happened not that long ago, earlier this year.

Kirill Eremenko: 00:03:24

So that’s what this podcast is about. Of course, you also hear a bit of Jon’s background and we’ve got an important and exciting announcement which we’ll mention at the very start of this episode. It’s to do with the future of this podcast and how we’re going to progress in 2021, what exciting things you can look forward to. And so without further ado, let’s get straight to it. I bring to you Jon Krohn, chief data scientist at untapt.

Kirill Eremenko: 00:03:52

Welcome back to the SuperDataScience podcast. Everybody’s super excited to have you back here on the show. And today we’ve got a very cool episode, very special guest, and you’ll find out why very special just now, Jon Krohn joining us from New York. Jon, welcome.

Jon Krohn: 00:04:14

Hey there, Kirill, it’s always a delight to be here and yes, well I was just about to spoil everything. Continue on, continue on.

Kirill Eremenko: 00:04:27

Okay. You know what, let’s create a bit of tension and put it off for a second. I wanted to talk about, to start with, we were chatting before we started the podcast that it’s snowed in New York a day ago or today, right, for you.

Jon Krohn: 00:04:47

It was snowing today.

Kirill Eremenko: 00:04:50

And it was hitting the ground melting right away. Yeah?

Jon Krohn: 00:04:54

Exactly. Some kind of deep analogy.

Kirill Eremenko: 00:05:01

It was just like, it’s just felt interesting, I don’t know, I haven’t lived in New York. I don’t know if like maybe 10, 20 years ago it snowed more. But it feels that way, that that’s where the world’s going. I’ve spoken to many people from different parts of the world and they’re like, “Oh, we haven’t had snow yet.” Or like, “Oh, it’s starting to snow later.” And this whole, like inevitably started thinking about the global warming situation and it makes me wonder how unique is our generation? Like our generations that are on the planet now that we are the ones going through such a crazy transition in the planet’s life. No other generation before, it was like, okay always kind of the same as our parents and grandparents and great grandparents. But not only do we have like this technological revolution and all this stuff going on, we also have this global warming thing that we have to face and we will have to face more and more. So it’s an exciting, alarming, but also exciting time to be alive. What do you reckon?

Jon Krohn: 00:06:02

Yeah, there are, I saw an interesting studies of how winter Olympic games sites from the 20th century, that today, and certainly in a few decades, many of those sites would not be possible sites for the games. You wouldn’t be able to have enough snow for skiing and so on. So there’s real, tangible change happening. That’s actually a topic that’s very much of interest to me. And I think that there is a surprising amount that data science can impact positively around the environment. And this is a topic that I am super interested in. I don’t think it’s one of the topics that we’re going to be talking about today, but it comes up a lot.

Kirill Eremenko: 00:06:53

Yeah, for sure. But back to our big, big announcement. So it was very briefly mentioned on the previous episode and Jon spilled the beans. It’s all good. It’s all good. And we’ll mention it again today. I’ll definitely record in the next upcoming FiveMinuteFriday, I’ll talk more about it. But once again, wanted to say that super pumped that from 2021, Jon is going to be the new host of all of the SuperDataScience podcasts. I’m a big fan of Jon’s personality, or not even personality, but character. Character is like your inner stuff. Like not what just the outer people, people see outside how you present yourself to people, but like your inner beliefs and values. And when I had to pick a host for the podcast, I told this story to Jon, I just knew right away. I didn’t have to go and interview people. I didn’t even have multiple options to select from. I knew like right away. I spoke to Jon first and Jon kindly agreed. Jon, how do you feel?

Jon Krohn: 00:08:08

Well, I couldn’t believe that that’s what the call was about when you phoned me to tell me about that, because we hadn’t spoken in many months. So I was on Kirill’s podcast which is how we met. And then I invited Kirill to be on my podcast. So at that time, I was piloting a news podcast and Kirill agreed to be on it. And I was blown away even then, because Kirill’s hosted hundreds and hundreds and hundreds of his own SuperDataScience podcasts, but he hadn’t been on a podcast in years. And it was his second podcast guest appearance ever. So we were really honored to have him. And then we hadn’t talked in several months. I don’t think we’d even had emails back and forth. And so I got this email from Kirill saying, “I have a podcast idea.” And I was like, “Cool. I would love to try to provide some kind of ideas or some feedback.” I can’t imagine what I know that Kirill doesn’t. But you got on, I don’t know if you remember this, but you started off by saying, “I’m a bit nervous about this and I don’t know how to say it, so I’m just going to do it. I’m just going to rip the bandaid off.” And then I was like, and then you asked, and I was like, “Absolutely. I would love to be the host of the SuperDataScience podcast.”

Jon Krohn: 00:09:28

It is such an amazing podcast. You’ve cultivated such a great audience and always such wonderful guests. I also love the FiveMinuteFridays. That’s something that’s a really nice, special touch that you’ve added in on top of the longer guest episodes. And it’s truly an honor. I’ll be candid with the listeners, when we were getting ready to record last week’s episode, last week’s guest episode, I was nervous going into it, and I’m not nervous ever. If I’m lecturing in front of hundreds of people live, I never have a single butterfly in my stomach, but because of the amazing audience, I don’t want to disappoint you guys and I won’t disappoint you guys, but I’m just getting a little nervous making sure that that’s true.

Kirill Eremenko: 00:10:19

Nice, nice. I’m sure everybody’s going to be patient with any blunders that help along the way, but also very understanding and love to have you. We’ve already seen that with the sign-ups to your course on Udemy, it’s 80,000, right? Over 80,000?

Jon Krohn: 00:10:43

80,000 people have signed up. And the crazy thing, so a lot of people, when they sign up for the course and do the course, I’ve gotten a lot of people reach out to me on LinkedIn and say, “That was an amazing course.” And then I always write back, because I’m like, the course is done. The course is one eighth done. So, we put out, there’s three and a half hours of content right now, but there’s going to be about 25 hours when the course is completely done. And so we took this, it’s a strategic decision. Do you wait? It’s going to be another many months at least to get that much more content recorded and out, or let the audience enjoy things as it comes out and get it to people sooner? And so that’s the route we went down. But it’s amazing to me that 80,000 people have signed up when, from my perspective, I’m like, it’s not even there yet.

Kirill Eremenko: 00:11:33

Gotcha. This episode is brought to you by SuperDataScience, our online membership platform for learning and data science at any level. We’ve got over two and a half thousand video tutorials, over 200 hours of content and 30 plus courses with new courses being added on average once per month. So all of that and more you get as part of your membership at SuperDataScience. So don’t hold off, sign up today at www.www.superdatascience.com. Secure your membership and take your data science skills to the next level.

Kirill Eremenko: 00:12:11

Jon, this wasn’t planned, but I want to ask you for our listeners who are like, “Ooh, yeah, I love Jon. This is going to be amazing.” But some are like, “Who the heck is Jon? What is he going to bring?” Tell us a bit about yourself, but most importantly, what are your values in life? So people can decide if they resonate with you and what to expect.

Jon Krohn: 00:12:40

Wow, sure, yeah. So I trained as a scientist. I think I’ve kind of always been a scientist at heart. So I did a PhD at Oxford in neuroscience and thought for a long time that I would do medicine as well, so BA a medical scientist. And I decided after the PhD that I actually wanted to get involved in the commercial world, startups, that kind of thing, because I believed that you can have the biggest impact by creating products, data-driven products in particular, and automate things, make people’s lives easier, have a good impact on the environment. There’s so many opportunities. In the same way that we are confronted by these very unusual times like you mentioned, climate change is a big thing, nuclear wars, an ever present threat in a way that none of our ancestors have ever had to worry about, and even automation itself for a lot of people is a scary thing that’s happening right now. There’s a lot of displacement and retraining that’s required in a lot of different job markets all over the world because of automation, which is a trend that’s been happening since the industrial revolution 150 years ago, but it’s accelerated now by data and machines and interconnectedness.

Jon Krohn: 00:14:08

So there’s these unique problems that we face today, but we’re also uniquely equipped to solve those problems and data is the key in all of this. So we’re recording this podcast today, I’m in New York and you are almost as far away as you could be in the world.

Kirill Eremenko: 00:14:29

11 hours difference. We have to find these crazy times that we can meet up.

Jon Krohn: 00:14:35

Yeah, Kirill, when we were talking before starting recording, he was like, “You look a little tired today.” And I’m like, “Yeah, it’s late.” Anyway, so you’re on the other side of the world and then we record this, we record the data, the media from this. And then we can distribute it to anybody anywhere in the world forever and some people learn and hopefully are also entertained by this podcast. But we also have the opportunity to teach people about data science and what’s possible. And that same kind of thing is happening in billions, trillions of different ways every day where by you accessing data on the internet. It’s magic. It’s on a server somewhere and in real time, no matter where you are in the world, you can get this information, share this information. So this interconnectivity from data allows our species to share information at light speed all around the planet. Our ancestors couldn’t have even imagined of the kinds of things that we very quickly take for granted. It’s amazing how quickly you take an iPhone for granted.

Jon Krohn: 00:15:56

And so on top of data facilitating this communication between us by storing information and allowing it to be transmitted at light speed, we also have machines that can identify patterns in data and based on a particular objective that we set for that machine, it can be trained on data and automate things in really magical ways. And this has in just the last few years, especially accelerated with deep learning approaches that allow us to find a signal in a large amount of data noise. And yeah, so I’m, I don’t know, I’ve been talking for a long time. Basically, I’m kind of trying to get the audience to understand what I’m excited about here and it’s data. It’s building models with data. It’s deploying models and making a positive impact in the world. So yeah.

Kirill Eremenko: 00:17:04

That’s cool. And you build models yourself at untapt and for fun?

Jon Krohn: 00:17:12

Yeah, exactly. So I guess I have two kind of main tracks in my life. So my day job, I’m the chief data scientist at a machine learning company called untapt. And so we build algorithms in the human resources space, and there’s a lot of opportunity there. So finding the right people for the right job so that you can be satisfied with your life and build a career, all those kinds of things are very important. But there’s also critical risks that we need to take into account around ensuring that we’re not incorporating bias into our algorithms, and so there’s also these kinds of things. You can incorporate analytics into these kinds of models to ensure that human decision makers aren’t unnecessarily biased against particular demographic groups. So that’s my day job. We use natural language processing, deep learning models to make all these things happen, and it’s a tremendously exciting space to be working in. That’s kind of my day job.

Jon Krohn: 00:18:25

But then on top of that, I have this kind of data science advocacy education that I am obsessed with and it drives everyone else in my life crazy. Family, friends, my dog, my girlfriend, I think they all secretly kind of hate me because I’m obsessed with spreading this kind of message about data science to the world. So I wrote a best-selling book called Deep Learning Illustrated. I have dozens and dozens of hours of technical, educational videos with hands-on code demos and lots of theory, particularly on deep learning historically and covering all of the major ranges of deep learning. And more recently now I’m getting into what I call machine learning foundations which is linear algebra, calculus, probability, statistics, computer science, all of these kinds of foundational subjects that allow people to be great data scientists or machine learning engineers. And I’ve come across this as an important niche over the last few years through doing all the teaching that I’ve been doing. I do many, many dozens of lectures a year, and I’ve discovered that data scientists and machine learning engineers, they really want to understand these foundations better. And I could talk for a long time about why I think that it’s also really good to know this stuff and the practical benefits that it can have in your life. But well, at some point we should get on with the actual topic of this episode.

Kirill Eremenko: 00:20:01

Yeah, man. I just wanted to come and thank you. And that also sheds light on some of your values and why I’m a big fan of yours. This obsession with spreading the word of data science and helping people learn, you don’t have to do it. Nobody’s forcing you to do it. You have a fantastic career. You could just be working and have this extra free time to yourself. But you’re putting that time in, and you’re creating this content. You’re creating these videos. And for you, it’s very important to, I know from like the way we were launching the course, for you it’s very important to reach as many people as possible and help as many people as possible. And it really reinforces for me that you are the right person to take this podcast forward because this is what it’s about. It’s about helping people either get started into data science or continue growing or build a data science team, learn new trends and things like that. So, I want to just say thank you for accepting this proposal. I’m very excited for what’s coming ahead.

Jon Krohn: 00:21:16

It was a no brainer, Kirill. This is such an amazing opportunity. No problem to get excited about this.

Kirill Eremenko: 00:21:25

Awesome. Thanks, man. Well, let’s, as you said, let’s get on with the topic. And the topic for today is 2020 is almost behind us. It is already December. We’ve got only a few weeks left of the year. And what we wanted to chat about, actually when this episode goes live, there will be like one week left of the year. What I wanted to chat about is are the top technological breakthroughs. We originally agreed with Jon to do three each, bring three each. But then when Jon sent me his three, I read through them and I realized that those are such great ones and I am so lazy that we’re just going to stick with those three. They’re good, and we’re going to go into like the first one, we’re going to talk a little bit about then the second one we’re going to go into more depth and the third one we’re going to really hit it hard and see as much as we can get out of them. So, Jon, would you care to announce to us what the three breakthroughs are?

Jon Krohn: 00:22:28

Yeah, so we are going to talk about AlphaFold. We’re going to talk about GPUs, graphics processing units and the third and final topic we will be talking about GPT3, a groundbreaking natural language processing algorithms. I also just realized, this is something probably fun for your listeners, is this is the only episode that we’re co-hosting.

Kirill Eremenko: 00:22:59

We are co-hosting. Sorry?

Jon Krohn: 00:23:01

Oh no, I guess we co-hosted the last one, too.

Kirill Eremenko: 00:23:03

Yeah, yeah. Yeah.

Jon Krohn: 00:23:04

But this one is just the two of us. That’s the only-

Kirill Eremenko: 00:23:06

[crosstalk 00:23:06] Its just the two of us. No, but do we had you on the podcast before.

Jon Krohn: 00:23:11

Yeah. But I wasn’t hosting. I was just a guest.

Kirill Eremenko: 00:23:13

Oh okay. This is the only one where we’re co-hosting and it’s just the two of us.

Jon Krohn: 00:23:18

Yeah. It wasn’t as groundbreaking as I thought it was when I started singing, but yeah. Also, we’re recording on a Monday we haven’t done that before. So all these important things.

Kirill Eremenko: 00:23:31

Fantastic. Okay. Well, let’s get straight into it. So AlphaFold, groundbreaking news. This was what, October or November? This is November, right? Like just recently-

Jon Krohn: 00:23:44

November. Yeah, November 2020.

Kirill Eremenko: 00:23:46

So you have a PhD in neuroscience. Do you mind explaining what the whole protein folding problem is all about?

Jon Krohn: 00:23:58

I would love to because I prepared notes on that. So it would be really annoying if you took that part from me.

Kirill Eremenko: 00:24:03

Please. Please, go ahead.

Jon Krohn: 00:24:07

All right. So everyone’s heard of DNA. So DNA carries like the blueprint of life. So as a human, you inherit one blueprint from your mom, one from your dad. Each of those blueprints, what is it a blueprint for? It’s a blueprint for proteins. So everyone’s heard of proteins. You think about them as something you eat alongside carbs and fats. But they play a much bigger role than just something that we eat. Proteins are part of everything in your body, everything that does work, every part of the structure of your body. So your skin, your muscles, your ability to see, your ability to think is all a result of functioning proteins in your body. So every single biological process in the world in every organism, including enzymes, which do work. So they’re actually actively doing work. Any kind of digestion, building anything, tearing anything down, all of that is done by protein.

Jon Krohn: 00:25:20

So they’re like this fundamental structure for allowing all of life to exist. DNA is the blueprint. So there are proteins in your body that reads these DNA blueprints and figures out how to create other proteins that do everything else. So even more specifically, the DNA includes sequences of letters. Kind of like if you think about it the way that computer code computers are fundamentally a sequence of zeros and ones. The genetic code consists of there’s four possible characters. The exact that those characters are in dictates specific sequences of what are called amino acids. So in humans, we have 20 possible amino acids. The sequence that those amino acids are lined up in, in a long chain is a protein. Those long chains of amino acids, they fold automatically into a specific protein shape, which allows that protein to do all the work, to allow you to see, to allow you to think, to allow you to have muscles and so on.

Kirill Eremenko: 00:26:31

I mean, if you put the amino acids into a chain… If you imagine it’s a one-dimensional, just sequence from amino acids. But in reality, because we live in a three-dimensional world, it’s not always just a chain. Something happens to it and it takes like some… There’s like an [inaudible 00:26:50] here. Basically, how to describe it in audio. Yeah, folds is the right word. Like this chain takes a three-dimensional structure. Based on the sequence of the amino acids, it will automatically take a shape, a certain shape in three-dimensions. Yeah. Is it always the same shape for the same sequence of amino acids?

Jon Krohn: 00:27:17

Yeah, I’m pretty sure it’s always the same shape. As kind of an example that the viewer can imagine if you think about red blood cells, so most people kind of know what a red blood cell looks like. It looks like a disc, like a donut shape kind of.

Kirill Eremenko: 00:27:29

Yeah.

Jon Krohn: 00:27:29

So that is a specific protein shape. So even though the protein ultimately is a long one-dimensional chain, it automatically folds up into that blood cell structure that you’re used to seeing in this donut shape. As an example of how DNA influences protein, there’s a common blood disease in Africa called sickle cell anemia. In sickle cell anemia, remember how I was describing how DNA it’s a one-dimensional sequence of four possible letters?

Kirill Eremenko: 00:28:08

Mm-hmm (affirmative).

Jon Krohn: 00:28:09

One of those letters is out of place, what we call a mutation. Which is the term everyone is aware of, even if you’re not thinking about it at a DNA level. So this mutation of a single letter in this long string of DNA, it causes a different amino acid to occur in the long chain of amino acids that creates a blood cell. Because of that mutation, that single change, that single letter change of DNA, instead of the donut shape, the red blood cell takes on a sickle shape. So sickle is like a curve-

Kirill Eremenko: 00:28:43

Like hammer and sickle-

Jon Krohn: 00:28:44

Like hammer and sickle.

Kirill Eremenko: 00:28:45

… like for what you used to cut wheat. Yeah?

Jon Krohn: 00:28:47

Exactly. It’s interesting. I mean, I could talk about sickle cell anemia for a long time and the advantages, actually. So it actually, it evolved. So that evolved as a way to avoid malaria, interestingly.

Kirill Eremenko: 00:29:02

Oh wow.

Jon Krohn: 00:29:02

So that’s why it’s so prominent in Africa because I can’t remember exactly now why, but something to do with the shape of that sickle cell. Even though it causes a disease, that disease is less bad than malaria. Which that’s sickle cell shape effectively prevents.

Kirill Eremenko: 00:29:17

Interesting.

Jon Krohn: 00:29:17

So-

Kirill Eremenko: 00:29:18

Interesting.

Jon Krohn: 00:29:18

Anyway, so there’s an example of-

Kirill Eremenko: 00:29:21

So, the function of a protein in this example, even though there’s a slight mutation, right? But the content of the protein is slightly different than the circular blood cell, or disc shaped blood cell. But also the important point to note here is that the function of a protein depends on its three-dimensional fold, right? If it’s folded one way is going to function too, if it’s folded a different way. I thought of a analogy here, like if you take a piece of flat piece of A4 paper and you can do origami with it, right? You can create a bird, you can create a box, you can create a tree, you can create a ship. It’s the same piece of paper, but depending on what origami you create, it will have a different “function”. Its not going to fly, right?

Kirill Eremenko: 00:30:11

But it will look different. It can fit into different size containers. It can make different impressions. It can be used for symbolism in different ways and so on. So here it’s a similar thing. It’s just that the protein originally is not a two-dimensional paper. It’s a one-dimensional string. But still you can kind of origami it into different shapes into different folds. If the fold is wrong, that causes diseases or causes different anomalies. So the challenge here for this being around for 50 years is… maybe you can talk about that. I know just the superficial stuff about this.

Jon Krohn: 00:30:53

Yeah. The one key thing about that analogy, which I love that paper analogy is that… you touched on it earlier. Is that the interesting distinction there is that with the same starting point, with the same strip of paper, in biology in the actual… in real biological systems, you’d always end up with like a paper crane as your origami shape.

Kirill Eremenko: 00:31:17

Yeah.

Jon Krohn: 00:31:17

It would be like you would always dictate the crane.

Kirill Eremenko: 00:31:18

Yeah.

Jon Krohn: 00:31:19

But when you, but when you look at it as a person, you might look at this piece of paper and say, “I think this could be a turtle.”

Kirill Eremenko: 00:31:27

Yeah, yeah.

Jon Krohn: 00:31:29

But in biology it always ends up being a crane. So, yeah, so there’s actually, there’s a game called Foldit, that any of our listeners can download. It’s a game that’s been around for at least 10 years. It allows you to get an amino acid sequence, and you try to figure out how to fold it together into a nice 3D shape. You kind of learn how to do these. You can develop an intuition around how to fold a protein properly. Lots of protein structure breakthroughs have happened from people playing this game, Foldit-

Kirill Eremenko: 00:32:01

[crosstalk 00:32:01] Yeah. I heard of that. It’s kind of like crowdsourcing, solving this a grand challenge. This challenge has been around for 50 years. It’s about if you have a protein and input sequence of amino acids, what shape is it going to take? Can you predict the shape that is going to take? So that’s the challenge. Why are we talking about this on a machine learning data science, AI podcast?

Jon Krohn: 00:32:26

Yeah. So then yeah, there’s this specific machine learning challenge that is the same as this kind of human folding challenge. Which is CASP the Critical Assessment of Structure Prediction, CASP C-A-S-P. Yeah, as you say, it’s been around for a while. Now, recently it happens, this competition happens every two years. So they just did the competition, the 2020 competition. There were a 100 teams in the competition this year. In every year the challenge is, you get a bunch of sequences of amino acids and the machine learning algorithms have to try to predict what the shape will be. So you get told what the sequence is, you get the paper cut out. Then you have to say, “This is going to be a paper crane, or this is going to be a turtle.”, or whatever.

Jon Krohn: 00:33:26

It’s hard. It’s a very, very hard task. These amino acids sequences can be hundreds, or thousands, or tens of thousands long. So the structures can be very, very complicated. So they use a test, that the metric of performance is how close the real protein shape is to the predicted shape. So it’s a 3D structure. So you can measure the distance between, “Okay, well, this part of it should be over here in the top right hand corner, but the algorithm got it in the middle, instead of the top right hand corner. So there’s a big distance.”

Jon Krohn: 00:34:02

So it’s like you have this global distance score, where zero is the worst possible score, and 100 is a perfect score where there’s no distance at all. So the algorithm is perfect. To give you a sense of how things have been happening recently, in 2008, the best score in this competition, this CASP Competition was 40.

Kirill Eremenko: 00:34:26

Mm-hmm (affirmative).

Jon Krohn: 00:34:26

Eight years later in 2016, the best score was 40. So things were moving around along at a turtle’s pace. But then two years later in 2018, the DeepMind Team… So there’s this company that probably many listeners have heard of, DeepMind. They’re famous for other kinds of algorithms like AlphaGo. It’s interesting because it’s also for kind of intuition. So the board game Go, most popular game in the world, many people in the West haven’t heard of it. But it’s very, very popular in Asia. It’s kind of like chess is in the West, in terms of popularity.

Jon Krohn: 00:35:08

It’s played by intuition. Geometrically, much more complex. So for example, in this protein structure case there’s something called Levinthal’s paradox. Which is that the degrees of freedom in an amino acid chain means that there’s an astronomically possibly large number of confirmation. So there’s 10 to the power of 300 possible shapes for a typical amino acid sequence. Which is if you’ve tried to compute by force, by comparing all of those 10 to 300 different possible confirmations, it would take more time than we have in the universe on a computer by brute force.

Jon Krohn: 00:35:49

So you have to have these kinds of intuitions that people use when they’re playing the Foldit game. In the same way that when you’re playing Go, you kind of, you play by intuition instead of having a lot of rigid rules. So you just kind of, “This feels right.” so those are the kinds of problems that we’ve had a really hard time training computers to play. I’ve been talking a lot Kirill but-

Kirill Eremenko: 00:36:11

No, all good. All good. So AlphaFold is from Google DeepMind, right?

Jon Krohn: 00:36:18

So AlphaFold, like AlphaGo is out of Google DeepMind. So DeepMind was an independent company. I think it was 2014 or something like around that-

Kirill Eremenko: 00:36:27

Something like that.

Jon Krohn: 00:36:28

… year, they were acquired by Google-

Kirill Eremenko: 00:36:31

For $1 billion dollars.

Jon Krohn: 00:36:32

… because they had some really cool… Yeah. Because they had a great Atari video game playing machine.

Kirill Eremenko: 00:36:38

Mm-hmm (affirmative).

Jon Krohn: 00:36:42

Anyway-

Kirill Eremenko: 00:36:43

So yeah. So it took them two years and they managed to crack it? You didn’t say the score in 2018, CASP, what did they get?

Jon Krohn: 00:36:52

In 2018 in CASP they got 60. So remember 2008, there was the best algorithm was 40. 2016, the best algorithm was 40. 2018 with this Google DeepMind team introducing AlphaFold the score was 60. So this is a huge improvement.

Kirill Eremenko: 00:37:06

Yeah.

Jon Krohn: 00:37:06

A 50% improvement after many, many years of almost no progress. The only thing that’s even crazier is that now in 2020 AlphaFold-2 obtained a score of 92.4.

Kirill Eremenko: 00:37:24

Wow!

Jon Krohn: 00:37:24

Which-

Kirill Eremenko: 00:37:24

That’s crazy.

Jon Krohn: 00:37:26

According to some reports I read that’s about the same as x-ray crystallography. Which is kind of the manual gold standard way of doing it. Not in a machine, but actually getting into the lab and very laboriously getting these results. So on moderately difficult proteins, AlphaFold-2 scored a 90 on average. Whereas other algorithms scored just 75-

Kirill Eremenko: 00:37:52

40.

Jon Krohn: 00:37:52

So yeah-

Kirill Eremenko: 00:37:53

[crosstalk 00:37:53] 75. This crystallography, it costs like $120,000 and takes a year per protein to do.

Jon Krohn: 00:38:03

Yeah.

Kirill Eremenko: 00:38:04

Whereas, now we have a machine learning thing. There’s two implications from this. First one that by knowing how these proteins fold, but if we have an algorithm that can predict how they fold, that that can potentially open up a plethora of applications and research. Like I saw briefly on the DeepMind website that for instance, we could look into creating proteins that go into the atmosphere and collect carbon. We could create proteins that create a coating that prevents… I don’t know, dirt sticking to buildings and windows. We could create proteins that go and fix stuff inside our body. We go from just a technology and bits and bytes, and… I don’t know, electrons. We could start manipulating things with proteins. There’s huge room for research there, and that could open it up.

Kirill Eremenko: 00:39:04

But the second implication is that we’re seeing one of the biggest grand challenges of biology being solved through machine learning. Right? That is one of the first massive examples where Yes. Okay, we can apply machine learning to gaming. We can apply machine learning to marketing. We can apply machine learning to… I don’t know, like satellite imagery and self-driving cars, and things like that. Those are great. But here with saw we’re actually applying machine learning, or DeepMind in this case, is applying machine learning to doing research, to solving a problem that humans haven’t been able to solve for 50 plus years in research. Now it can be solved as part of artificial intelligence machine learning.

Kirill Eremenko: 00:39:47

That’s going to open up a new opportunity as well for machine learning researchers to think about, “Oh, what else? Maybe we can apply machine learning to solve some physics problems. Maybe to create that unified theory of quantum mechanics versus special theory of relativity.” Things that we as humans haven’t been able to solve. Maybe that can be sold with AI. That’s also very exciting.

Jon Krohn: 00:40:16

Yeah, that’s a really good point. I mean, there’s definitely in terms of probably the number of data scientists employed in the world, you nailed it with, it is commercial applications like marketing and finance-

Kirill Eremenko: 00:40:32

Yeah.

Jon Krohn: 00:40:33

… video games, that kind of thing. But there are thousands, tens of thousands of research groups in academia that apply machine learning to medical problems, biological problems, chemistry problems, physics problems. So I guess it’s maybe the big thing here that’s really interesting is that a company like DeepMind, that is owned by an advertising company, that they’re given so many resources… and you do need a lot of resources to tackle these kinds of problems, by the way. So this is something that is worth mentioning quickly. Is that to train this model it takes several weeks on a machine with 128 TPU tensor processing units. So we’re going to talk about those more in the next segment. Tensor processing units, which are a specialized kind of GPU. But these are very powerful matrix multiplication machines basically.

Jon Krohn: 00:41:38

That’s working with a training data set of 170,000 known proteins, where we know the structure. The reason why it is so computationally expensive, why we need so many GPUs to do this processing is because we use… This is one of the… I don’t want to say negative aspects, but people get really excited about deep learning. But typically deep learning requires very large datasets, long training times, very big models with a lot of parameters. It’s no different here. So they use deep learning specifically using something called attention. We’ll talk about attention later again, when we talk about GPT-3.

Jon Krohn: 00:42:20

Then after they do that deep learning, they also apply some refinements based on biological research. So they take into consideration evolutionarily related sequences, something called multiple sequence alignment. So anyway, the reason why I was bringing that up is this point that Google, an advertising company is willing to pay huge salaries to people at DeepMind, pay for very expensive compute resources to allow people to be doing this biological research. It is super cool. Yeah, the applications that you mentioned Kirill are… Yeah-

Kirill Eremenko: 00:43:03

[crosstalk 00:43:03] possible applications. I was watching a video by Lex Fridman on this topic. He said that he thinks that potentially there’ll be Nobel Prizes. Not necessarily for this solution, but for things that come out of it, like the things that are [crosstalk 00:43:27] researchers can take and then create stuff with it, and then get like maybe a Nobel Prize one day. That’ll be a huge testament as well. So we’re entering a new era, right? Like when 10 years from now, as we get closer to the technological singularity where AI becomes so powerful that it just creates, Nobel prize discoveries, every single day, we’re getting closer and closer to that already. We’re starting to feel it. Very interesting. Anyway, we’ve been on this one for a while. Let’s move on to topic number two. That was really cool. That was November. Yeah?

Jon Krohn: 00:43:59

There’s just one last thing that I want to mention about this.

Kirill Eremenko: 00:44:00

Sure, sure.

Jon Krohn: 00:44:02

Which is that to give you a sense, so I said that there’s 170,000 known protein structures. That might sound like a lot. It might sound like we have a good understanding of a lot of the different kinds of proteins that exist in the world, but we’re not even close. So because we were relying on those expensive time-consuming methods, like x-ray crystallography before. Even though there’s been 170,000 identified, there are 180 million-

Kirill Eremenko: 00:44:31

Wow.

Jon Krohn: 00:44:32

… known proteins.

Kirill Eremenko: 00:44:33

Wow.

Jon Krohn: 00:44:33

That number goes up by many millions every year-

Kirill Eremenko: 00:44:36

Wow.

Jon Krohn: 00:44:37

… as we discover proteins in other forms of life on the planet. So-

Kirill Eremenko: 00:44:40

So what? 0.1%, we’ve done 0.1%.

Jon Krohn: 00:44:45

Right. Now all of a sudden, it’s going to be a bunch of TPUs cranking, I guess, at Google.

Kirill Eremenko: 00:44:49

Yeah.

Jon Krohn: 00:44:50

But we can uncover the structure of so many more proteins. But despite these big efforts on protein structure, there’s a huge limitation still today. It’s going to be a long time, I think before this is cracked. Which is that we talked about proteins doing all this work in bodies. Doing work isn’t something that a fixed structure does. The vast majority of these proteins they’re moving. So they have specific kinds of motion. There’s a fourth dimension that we need to be taking into account. That problem, we haven’t even begun to scratch the surface at all.

Kirill Eremenko: 00:45:31

Wow.

Jon Krohn: 00:45:31

So exciting things to come maybe over the coming decade.

Kirill Eremenko: 00:45:37

Yeah. Quantum computing. Maybe we’ll solve that. Yeah. Awesome. Okay-

Jon Krohn: 00:45:45

Anyway, we’ve talked about this for so long. Yeah, we should get onto GPUs.

Kirill Eremenko: 00:45:49

Yeah. Let’s move on to next one. So next topic is GPUs. Right? So Jon, am I getting this right, that when you recommended that topic you had in mind, the NVIDIA GeForce GTX 3080 and 3090, that got released recently.

Jon Krohn: 00:46:03

Yeah. I mean kind of, I mean, maybe I was inspired by the recent release of those NVIDIA chips. The kind of the thing that I wanted to talk about, because we’re talking about kind of the year in retrospect, so it’s just this general idea of GPUs and how much more powerful they’re getting, how quickly that’s happening, this kind of phenomenon. A few years ago as a data scientist, you didn’t need to do any work on a GPU, but today, if you’re going to be using deep learning models that are anywhere near the size, the state of the art in almost any field, you’re going to be using at least one GPU, in a lot of cases, several. This is kind of this big trend that’s been happening. Yeah, you’re right that these recent releases, they were on my mind, but yeah, there’s a few things I had to talk about on this topic. I’ll open the floor to you, because I’ve been talking a lot.

Kirill Eremenko: 00:47:16

Okay. Well, all I have on GPUs is, so I was chatting with my friend from Melbourne, David, and he was super excited, because he got his hands on a GeForce 3080. Basically, these graphics cards were released in, or GPUs were released in October by NVIDIA. You actually had to get onto a wait list and pay in advance to get that. You mostly use for gaming because they allow us to process graphics and-

Jon Krohn: 00:47:57

And crypto mining.

Kirill Eremenko: 00:48:00

– and crypto mining. Yeah, in some cases, I guess. Yeah.

Jon Krohn: 00:48:04

Yeah. A few years ago, I guess two years ago, three years ago, the first time that Bitcoin prices skyrocketed, NVIDIA GPUs became extremely expensive and stores, it was extremely difficult. I was building deep learning servers, and it wanted to install GPS in them, and I couldn’t buy GPUs. At the time, it was this GTX 1080 Ti, was the big one that everyone wanted, and I could not buy them anywhere.

Jon Krohn: 00:48:34

When I did find them, I’d have to get in a taxi to a far part of Brooklyn, from Manhattan, and then you’d get in the store and they’d have, limit one per customer. That was because of Bitcoin money. While graphics cards grew up for the purpose of rendering graphics for video games, I think they’re disproportionately used today for Bitcoin mining.

Kirill Eremenko: 00:49:00

That and deep learning as well. Right?

Jon Krohn: 00:49:05

Yeah, and deep learning. It’s probably a distant third, but for our audience, the most important use case.

Kirill Eremenko: 00:49:12

Yeah, absolutely. Yeah, he got his hands on one of these. It’s really cool. He’s loving it. Yeah, also they’re used for deep learning to make sure that … you need to process many operations. Like, when you have a neural net, you have lots of simple operations, but you have tons of them, depending on how big your neural network is and how many weights, how many layers and so on. You have lots of operations. A CPU can crank out more complex tasks, but a GPU is more powerful for deep learning, because you can get lots of operations happening simultaneously and thereby calculate your weights faster. That’s as I understand it, that’s correct.

Jon Krohn: 00:50:04

Yeah, spot on.

Kirill Eremenko: 00:50:06

Okay. What have been the developments in 2020, in the GPU space?

Jon Krohn: 00:50:12

To give a sense of perspective, that GTX 1080 Ti, that I had to spend about a grand on and tear out of the hands of some Bitcoin miner in Brooklyn, that had 11 gigs of memory on it. Now these latest ones, these 3090s, they have more than double the memory, so 24 gigs of memory. They cost a little bit more, so on top of being on that wait list and having to apply, they’re about 1500 bucks. Having that more than double memory is critical, because it means that you can train a model with more than twice as many weights in it. This is like a key limitation on these GPUs, is how much onboard memory you have, because that specifically limits how big the model that you can put on it, is.

Kirill Eremenko: 00:51:02

What was the, sorry, the number of cores, and the amount of memory? Number of cores is how fast it can be trained and amount of memory is how big the model that you input, can be.

Jon Krohn: 00:51:19

Yeah, exactly. Actually in terms of speed, there isn’t that much of a difference over the last few years. That 1080 Ti, a few years ago, that I bought is 1.5 gigahertz, and then the latest one [inaudible 00:51:34] 1.7. I mean, that’s still, percentage wise over just a few years, you’re talking about 10 or 15%, but the memory increase is huge. That’s just talking about the kinds of servers that you would put in a home computer. If you’re talking about a big server rack, you can put in way more powerful GPUs. The state-of-the-art from NVIDIA is this A100 GPU, which has 80 gigs of onboard memory, but I mean, it’s not for the faint of heart. It’s not for the hobbyist data scientist, that’s for sure, because one of those costs $12,500, just for one.

Kirill Eremenko: 00:52:20

Wow.

Jon Krohn: 00:52:22

This was the kind of general trend that I wanted to talk about is just these GPUs have become so important, that that kind of cost is useful to enough people that NVIDIA makes these, because by being able to train those big models, the commercial utility to somebody is even bigger than that $12,500 cost, and so it’s worthwhile. The actual, the very specific thing that made me interested in talking about this topic is something that NVIDIA makes, called the DGX Workstation, which it has four of these A100 GPUs.

Jon Krohn: 00:53:02

Each one of those is 80 gigs of memory, so it has a total of 320 gigs of GPU memory. It also has 512 gigs of system memory, a 64 core CPU, and it comes in a 91 pound bridge.

Kirill Eremenko: 00:53:22

Wow.

Jon Krohn: 00:53:22

All together in one place. Guess how much they charge for that? Remember that the individual, each one of the four GPUs is $12,000, so how much do you think NVIDIA would charge for this [crosstalk 00:07:35]?

Kirill Eremenko: 00:53:35

$50,000?

Jon Krohn: 00:53:37

$200,000!

Kirill Eremenko: 00:53:39

No way!

Jon Krohn: 00:53:41

$200,000. Yeah, it’s crazy.

Kirill Eremenko: 00:53:43

That’s crazy, man. What I don’t get, Jon, is why would you set it up at home or even at your office, as a company, versus using online solutions like AWS or Oracle cloud or something. Why go through all this effort, especially if it’s going to deprecate over a few years, and you’re going to need to buy a new one, as opposed to just going and setting up an AWS Instance for whatever size you need?

Jon Krohn: 00:54:13

Yeah, that’s a great question. I can share for people who are interested in the show notes, we can share my PCPartPicker build for the builds that I made, and you can get a sense of the price. At the time, those GTX with … I’d get two of those into one of mine. Those were about a thousand each, so $2,000 for the GPUs. The whole rig would cost about $4,000. Today, if I was to build the same rig, it would be cheaper, because the GPUs would be a lot cheaper if I was using the same ones as a few years ago. I could probably build it for about $2,500. We did the math of, what is that going to cost me $2,500 as a one-off fixed cost, as opposed to the cost of paying to be training in the cloud in an ongoing basis, and with the models that we were training at untapt, with the size of them and how often we knew we were going to be retraining them, it was a no-brainer. It paid for itself in about three months.

Kirill Eremenko: 00:55:12

Wow, okay. That’s another scale as a data scientist you can have. You can be like an accountant kind of scaled to portfolios and understanding what are the cost benefits of either using AWS or having an onsite on-premise server set up. I think that’d be valuable to any company, especially smaller businesses who need to be careful about the cost.

Jon Krohn: 00:55:41

Yeah, depending on exactly what kinds of models you’re working on, the cost benefit will swing strongly one way or the other. In our case, these relatively small models today where I can fit it on two GPUs, which is still huge in historical terms, but that we’re training those models continuously, that’s one kind of situation where then you’re going to want to buy the server yourself. The other kind of situation is where you have an absolutely enormous model. I was talking about the protein folding model, AlphaFold2, that was trained on 128 TPUs, Tensor Processing Units, which are a Google specialized kind of GPU specifically for dealing with deep learning models, training deep learning models.

Jon Krohn: 00:56:32

If you were going to be training that kind of model, so if you’re a small business and you’re like, “Wow, we have this use case where we’re going to need to train 128 GPUs,” well, then all of a sudden, it probably would tell you, that’s a huge investment to buy a model of that size, machines of that size. You might say, “Well, we’re only going to need it for a week, and then we’ll have trained the model and we’re done.” Then in that case, you’re much, much better off using cloud scaling to do that.

Kirill Eremenko: 00:57:02

Yeah. Yeah, absolutely. You mentioned you had two GPUs in your setup, $1,000 each. I think a few words worth mentioning on paralyzing or using too in combination, let’s say you have two GPUs with 10 gigabytes of memory each, does that necessarily mean that you can upload a 20 gigabyte into them? How does this setup with paralyzing work?

Jon Krohn: 00:57:35

Yeah, great question. The modern, deep learning libraries or the modern, they’re actually automatic differentiation libraries. They’re designed for doing partial derivative calculus, but we think of them as deep learning libraries, like TensorFlow and PyTorch. The modern versions of those libraries allow you to split parts of your computational graph, so different parts of your model across multiple GPUs. That’s a relatively recent idea over the last few years. Up until probably just two years ago, three years ago, you didn’t use multiple GPUs to split your model. You would only use multiple GPUs to split your training data set. Yeah, so that’s a really great question, and it’s something that has really just become possible recently or easy recently, with these libraries, like TensorFlow 2 and PyTorch.

Kirill Eremenko: 00:58:33

What I read is that if you just take blindly the two GPUs and train your model on them, then what will happen is your model will be copied into the first GPU and into the second GPU, and then it will be trained in parallel, so the data sets, the training data set will be used half here, half here. If you actually want to use them in a different way, you need to think through the architecture of your deep learning models. For instance, you create the eight layers, then you take the first four layers and you load them into the first GPU. Then the result from that goes into the second GPU, and then the second half of the deep learning model is there. That way, then you’re using the memory, not just copy, pasting the same content, but you can actually load a bigger model into the whole system, but you really need to think through the architecture and how you’re going to split the workload between the two GPUs. I found that very interesting.

Jon Krohn: 00:59:38

Yeah. That was a great summary of what’s possible. Of course, to go one little step further is that you’re not constrained even to the number of GPUs on a single machine, on say a single server. There’s not a single machine at Google DeepMind, where they have 128 GPUs plugged into one CPU. They have many servers running, I assume. Don’t actually quote me on that, but that would be all what you would do, I assume, every time. I’m not aware of it, of [crosstalk 01:00:10] situation where it’s one server with 128 GPUs.

Kirill Eremenko: 01:00:17

Got you. Okay, so that’s GPUs. Anything else to add on there?

Jon Krohn: 01:00:21

No, no, no, that’s it.

Kirill Eremenko: 01:00:24

Okay. Okay, cool. Thank you. Final topic, GPT-3, exciting. It kind of combines a lot of things that we already spoke about, such as calculations, GPUs, interconnectivity on the internet and things like that. I’ll let you take it away. What’s GPT-3 all about?

Jon Krohn: 01:00:49

Yeah. I’m going to start off with a poem.

Kirill Eremenko: 01:00:52

Let’s do it.

Jon Krohn: 01:00:53

Okay. The SEC said, “Musk, your Tweets are a blight. They really could cost you your job if you don’t stop all this Tweeting at night. Then Musk cried, “Why? The Tweets I wrote are not mean, I don’t use all caps, and I’m not sure that my Tweets are clean.” “But your Tweets can move markets, and that’s why we’re sore. You may be a genius and a billionaire, but that doesn’t give you the right to be a bore.”

Kirill Eremenko: 01:01:24

Did you write that one up?

Jon Krohn: 01:01:28

That poem was written by a machine.

Kirill Eremenko: 01:01:30

A machine? No way!

Jon Krohn: 01:01:32

That’s by GPT-3.

Kirill Eremenko: 01:01:34

No way, that’s crazy.

Jon Krohn: 01:01:38

Yeah.

Kirill Eremenko: 01:01:39

Wow, what input did it get?

Jon Krohn: 01:01:44

It was something like, Elon Musk poem, I think was the input, and then it gave that is the output. I didn’t do it. I got that from an economist article. Yeah, so this model, GPT-3, so it stands for Generative Pre-trained Transformer 3, and it’s out of a research outfit, it used to be a nonprofit, but now they do commercial work too, called OpenAI. Elon Musk is the most well-known backer of OpenAI. They’ve made a lot of different models over the years, and typically they open source them, but this one, GPT-3, they have not opened sourced. They say that it’s because they think that in the wrong hands, it could do a lot of bad things. I think it’s also because they want to make money off of it.

Kirill Eremenko: 01:02:43

Then they go and give an exclusive license to Microsoft. Well, sell, I guess [crosstalk 01:02:49].

Jon Krohn: 01:02:49

Oh yeah, I didn’t know that.

Kirill Eremenko: 01:02:50

Yeah, Microsoft has been bragging about having an exclusive license to GPT-3, and it’ll keep the API open so anybody can use it, but ” in the wrong hands,” it can do the wrong things, and let’s give it to Microsoft.

Jon Krohn: 01:03:07

Yeah. I mean, they’re like the good guy, big, bad tech company these days. Aren’t they? I mean, in the nineties, it was a completely different story, but it’s interesting now they get … You never see them when you talk about governments looking at monopolies or breaking up big tech. You’re never talking about Microsoft these days, which is interesting, even though they are actually as big as any of the other big tech companies today. No, that’s super interesting. I didn’t know that.

Kirill Eremenko: 01:03:38

Okay, well, what is GTP-3? You give us what the abbreviation means, but what is inside? I know we can’t see it, because it’s not open source.

Jon Krohn: 01:03:54

Yeah. In that, so Generative Pre-trained Transformer, so we can break down each of the three letters of GPT. Generative means that it generates some outlets. You could have, say, if you have Generative adversarial networks, those creates images, and so it’s a Generative model. It produces something. In this case, it produces typically text. Pre-trained means that it’s already trained on a very large set of data. I believe it’s trained on all of the English language internet, but I actually didn’t write that down, and so listeners shouldn’t quote me on saying that.

Kirill Eremenko: 01:04:37

They’re trained on lots of stuff like Wikipedia and the internet, like corpuses, two different, huge corpuses of books. Moreover to your point, that generative, it generates data, I also read that when they combined Generative pre-training, what they did is they not only use labeled data, but they actually got the algorithm to label unlabeled data and then train itself on that data that [crosstalk 01:05:13].

Jon Krohn: 01:05:13

Semi-supervised, semi-supervised learning, yeah.

Kirill Eremenko: 01:05:16

That is crazy.

Jon Krohn: 01:05:19

Yeah. I didn’t know about that semi-supervised part. I’m really glad to know that. That would help generate a lot more data. Tricky to get right, but when you get it right, it can be very powerful. The last letter, T, Transformer is perhaps the most interesting of GP and T. Transformer architectures, these came into the limelight two or three years ago with BERT, which was kind of the first big, famous Transformer architecture. These are like recurrent neural networks, which have been around now for 20 years, but had become prominent in the last few years. These are kinds of neural networks that deal with sequences of information, so very often natural language data, so written words or spoken words, but would also technically work on financial time series data, any kind of sequential data that occur in one dimension. What makes these Transformer architectures so powerful is that they have something called self attention, which allows the model weights to attend to particular parts of sequences of language and to really emphasize those.

Jon Krohn: 01:06:48

It allows the model to look back to much earlier on in a sentence or earlier on in a paragraph, so by attending to two different parts of a long sentence or a paragraph, you can tie those two together. Up until these kinds of Transformer architectures, if you had a sentence like, “Kirill is hosting the podcast. He also runs SuperDataScience,” the early sequential models, the early recurrent neural networks wouldn’t have been able to make that connection between Kirill and he, but with these Transformers, with these self attention mechanisms, we get these very long sequences where that same concept, the idea of Kirill being the main male subject in the text, that kind of idea can be held for long, long sequences, which allows it to, for example, write an entire poem about Elon Musk.

Kirill Eremenko: 01:07:56

Wow, thanks for explaining that. That’s really interesting.

Jon Krohn: 01:08:01

For the viewer, we can put it in the show notes, a link to a blog post by a gentleman named Jay Alammar. There’s a post that he wrote called The Illustrated Transformer, that I thought it gives a really good introduction to the topic. I love it, because I wrote a book called Deep Learning Illustrated, so I love kind of illustrations of deep learning and data science concepts, and I think he did a really great job in that blog post. The big trend in these Transformers, so I said we’ve had them for a few years now, so BERT was a famous early one. It led to another one called ELMo. It also led to RoBERTa, and with those kinds of models, we were seeing bigger and bigger numbers of parameters. The biggest BERT model has 340 million parameters, which is huge, absolutely enormous.

Kirill Eremenko: 01:09:03

What are these parameters?

Jon Krohn: 01:09:05

Yeah. So, the simplest kinds of models in machine learning and data science are probably regression models. So-

Kirill Eremenko: 01:09:18

Mm-hmm (affirmative).

Jon Krohn: 01:09:20

… the very, very simplest kind of regression model would just be trying to fit a line to points on a two-dimensional grid. So in order to do that, you need a slope of the line, and then you need to know how high the line should be on the grid. Something called the Y intercept. So that very, very simple regression model, it has two parameters, the slope of the line and the height of the line.

Kirill Eremenko: 01:09:45

Mm-hmm (affirmative).

Jon Krohn: 01:09:48

So, these parameters are numbers. They’re just numbers that are learned based on training data. So, in that case of the very simple example, where the slope of that line and how high the line is, is based on whatever dots you have on your two-dimensional grid. And, it tries to fit those dots to the best of its ability. When we have a much more complicated problem, like write a poem about Elon Musk, we can’t do that with two parameters. We need many more parameters to do well. And so, these transformer architectures like BERT, they had 340 million different parameters to allow [crosstalk 01:10:31].

Kirill Eremenko: 01:10:32

So parameters, in a deep learning setup, are they like weights?

Jon Krohn: 01:10:38

Yeah, exactly. So, in neural networks models, so including deep learning models, we have two different kinds of parameters. There’s weights and biases. And, the vast, vast, vast majority, especially in large architectures are the weights, but the biases play a critical role as well. In fact, going right back to that wine analogy, the weights are all like the line slopes. So, they determine the slope of the relationship between two variables in a way. And, the biases determine these kinds of offsets, these vertical offsets for those numbers.

Kirill Eremenko: 01:11:19

Okay. So, when you say a BURT had 340 million parameters, it’s just like we can imagine a neural network with 340 million weights with synapses. So, about the same order of magnitude number of nodes in the neural network.

Jon Krohn: 01:11:43

No, there would be far fewer notes than weights because-

Kirill Eremenko: 01:11:48

Oh, that’s right. What am I thinking? Yeah, because they are interconnected. Yeah, got you.

Jon Krohn: 01:11:55

Yeah, exactly. So, maybe there would be something like 300,000 or three million nodes or something like this, maybe 10 million. I don’t know exactly.

Kirill Eremenko: 01:12:06

Okay. Got you. So, that was BERT. And, then how did that evolve?

Jon Krohn: 01:12:11

So, there’s these kinds of two directions tugging. So on the one hand, there’s people building bigger and bigger models, because those bigger and bigger models tend to get slightly better state-of-the-art results.

Kirill Eremenko: 01:12:21

Mm-hmm (affirmative).

Jon Krohn: 01:12:23

But, there’s also a school of thought that says, “This is crazy.” These models are getting way too big. They take too long to train. They’re too expensive to train. In production, they take too long to run. And so, there’s this other direction where people are making some of these architectures smaller. So, there’s one called DistilBERT, which has a quarter of the number of parameters as BERT. So, 66 million instead of 340 million, and it has comparable performance. So, people are trying to come up with ways of how can we arrange these architectures, so that they perform really well, even if they’re not quite as big.

Jon Krohn: 01:13:03

Anyway, these GPT models from OpenAI, they’re definitely, firmly in the camp of getting bigger, and bigger and bigger. Because, so GPT-2 the predecessor of GPT-3, which came out a couple of years ago, it had 1.5 billion parameters. So, about six times more than BERT. But that is nothing, nothing at all compared to GPT-3, which has 175 billion parameters. So, if you went from that jump over a couple of years from BERT to GPT-2 where you’re like, “Okay, there’s six times as many.” There’s now a 100 times as many in GPT-3.

Kirill Eremenko: 01:13:52

That’s crazy.

Jon Krohn: 01:13:54

Yeah, so really crazy. And, you can experience for yourself as a listener, well, GPT-3 isn’t open-sourced, only Bill Gates has access to it, personally. The biggest open source model, if I remember correctly, it’s out of Facebook, it’s called Megatron. And-

Kirill Eremenko: 01:14:18

Megatron is NVIDIA.

Jon Krohn: 01:14:20

It’s NVIDIA.

Kirill Eremenko: 01:14:22

Yeah.

Jon Krohn: 01:14:22

I’m glad that you were able to correct me on that. [crosstalk 00:05:25].

Kirill Eremenko: 01:14:25

8.3 billion parameters.

Jon Krohn: 01:14:28

Which is still absolutely huge. If it wasn’t for the existence of something like GPT-3, you would assume it was the biggest thing around. And, you can use it at something called talktotransformer.com. So, you can go there and you can enter… So, you could say something like, “Elon Musk poem,” and see what comes up.

Kirill Eremenko: 01:14:49

Yeah.

Jon Krohn: 01:14:50

Yeah. You can experience with these architectures are like yourself. So, I don’t know. So, the main kind of big thing that was supposed to be really exciting about GPT-3 is that it allows things like, what they call, zero-shot learning. So without any training on data, you can give it a task and it does it. So, you can say, “Translate English to French cheese,” and it will return to you, fromage.

Jon Krohn: 01:15:22

So I don’t know, that was supposed to be really exciting. And, there’s tons of applications. You might have noted some yourself, Kirill, but I just have a quick list here of things like, it can generate code, it can create a simple user interface from the description of a user interface, it can create rejects, it can generate a plot, it can generate quiz questions on some topic. They can even, if you’re familiar with neural networks and deep learning, it can break Keras code for the layers of the neural network and even do the dataset imports so you could say like, “I want the MNIST data as an input, the MNIST handwritten digit as an input.” “And, I want 10 ReLU layers all densely connected,” and GPT-3 could create the Keras code for you. So yeah, I mean, that’s kind of cool. I don’t know if you came across other applications.

Kirill Eremenko: 01:16:15

Yeah, I got a few as well. So you mentioned translation. You can integrate it with Excel. So, you can like, [crosstalk 01:16:25] based on a few… For instance, like a certain example of, you could have two columns like state or country and population, and you could have United States and have the population number, or you could have like, I don’t know, Brazil and the population number. And then, you enter a third country, and then instead of putting a population, you can call GPT-3 and as an input, you give it the everything above. So, the name of names of the columns, plus the two rows that you already have, plus you give it the name of the country that you’re already putting in. And, it will automatically understand that you are building a table with columns and rows about countries and their populations.

Kirill Eremenko: 01:17:08

And, from its information that it has, it’ll find the population of the new country that you added. It can write SQL queries for you. It can also work with images, because the way it was trained, it went through a lot of internet data, and some images are stored. I think, the key here is like they’re stored in SVG format, which can be read as texts. And so, it can actually draw a simple images. I think the main thing it comes down to here is that, right now on the planet exists this GPT-3 algorithm that is all pre-trained, you don’t have to do anything to it. And, it’s been pre-trained based on what you at the start of the podcast. Like, we live in a time of interconnectivity, where interconnectivity from data allows our species to share information. So, it’s used… this is kind of like starting to become like sci-fi, Terminator, type of thing.

Kirill Eremenko: 01:18:09

GPT-3 should probably be called Skynet or something like that. Like, it’s been pre-trained and all this information has a huge amount of data that it’s seen. And, it can answer almost any question. It can help you identify…. It can help you create things like code and apps, and so on. I even watched a video where this one guy, it’s called… We’ll link to it in the show notes, it’s called, What it’s like to be a machine. So, this guy uses the GPT-3 API to talk with it, like ask questions, and comes back in text. But then, what he did is, he used some avatar simulations to come up with a 3D avatar for GPT-3. And then, he edited the video, so that now every time it answers, this is avatar answering. And, it’s actually him talking to this… Machine is asking questions like, “Are you sentient?”

Kirill Eremenko: 01:19:06

“Yes, I’m sentient.” “Are you able to…” “Do you have feelings?” “Yes, I do.” It actually is talking like a machine. And, the interesting thing that I also found in that same video is that GPT-3 is actually able to make jokes, and is able to lie. It will tell you false information, but you’re like, “But that’s wrong.” But then it’s like, it will lie to you, knowingly. It will know when it’s selling you the wrong information. So, you got to be careful about what you ask it, because it can either make a joke. And, after you probe it a bit more, you’ll see that it’s a joke. Or it can lie to you, and when you ask it, “Why you’re lying?” It’ll tell you something like, “It’s in my best interest to be lying to you right now.”

Kirill Eremenko: 01:19:47

So, like some crazy stuff was going on with this GPT-3 thing, but I think it could be very useful. And, another thing I found is that, in order to train it to the level where it’s at right now, Google… Ah no, it’s not Google. OpenAI had to run it for a thousand petaflop per second days. That is about 10 to the power of 15 neural network operations per second in a day, times a thousand days. So basically, no individual human on the planet who has the resources to train it to such a level. So, it’s really cool that it’s pre-trained, we can just go ahead and use it. And yeah, the applications can be outstanding.

Jon Krohn: 01:20:37

Yep. But it is still really, really done. We should be very clear to the listeners that all of the things we’ve said, it probably sounds really amazing and really scary, but in fact, it doesn’t have an understanding of anything. It’s just a large number of model weights. And, all that it does, is it predicts the probability of the next word in a sequence. So it just says, “Okay, given the last words that have occurred, what is a high probability next word?” And then, on average, it’ll give you the most probable word next. And so, it’s very simple from that respect. It can’t think or reason in any way. And, to give you a sense of how dumb it is. People… You can see lots of examples online of very simple examples.

Jon Krohn: 01:21:53

So, for all of these really great examples that we’ve been talking about on the show, you could start with the same starting point, the same seed phrase to get it going, and end up getting results that are nonsense. And, you can experience that yourself. If you go to talktotransformer.com, you can… One fun thing that I like to do, is I like to take my LinkedIn summary of myself. So, it’s like a five bullet point bio of myself. I put that in, and then I see what the Transformer architecture spits out. And, sometimes it’ll come up with, I’m like, “Wow, that is eerily like my career or the kinds of things that I’d like to do.” And then, other times it outputs complete nonsense. And so, the Guardian newspaper in the UK, which is a very popular newspaper around the world, they put an op-ed piece, an opinion piece titled, “a robot wrote this entire article, are you scared yet human?”

Jon Krohn: 01:22:51

And, in the same kind of line, is that conversation that you described with the avatar where you’re asking, “Are you a person?” And it says… ” Do you have feelings?” And it says, “Yes.” So, there’s this op-ed that seems really convincing, and is really, really compelling, very well written. But, when you read the fine print at the bottom, it says that… The first paragraph prompt was written by person, and that first paragraph is one of the best paragraphs that really digs you into it.

Jon Krohn: 01:23:27

But, that was written by a person. And then based on that, they ran GPT-3, eight times. So, they got eight different outputs. And then, they cherry picked, they copied and pasted the best parts from each of the eight, and made the op-ed from that. So, it goes to show as amazing as it is, in a lot of situations, it wouldn’t convince you that it is actually a sentient being.

Kirill Eremenko: 01:24:03

Thanks Jon. I was getting really hyped up about it. Thanks for putting it into in perspective, like how it actually works in the background, and that it’s not at the level of reasoning yet. When do you think we’ll get to a level, GPT-4, GPT-5, level of machine that every time is able to convince you that it can reason, and it’s intelligent?

Jon Krohn: 01:24:34

I don’t know, Kirill. I mean, I don’t think it’s going to be like GPT-4, GPT-5. I’m a bit skeptical on this point, actually.

Kirill Eremenko: 01:24:40

Mm-hmm (affirmative).

Jon Krohn: 01:24:43

I think a machine that is able to behave like this general artificial intelligence that could perform like any kind of natural language task, like you and I can, like host a podcast, or read a book, I think we’re a long ways away. Because, we’re limited right now. So, these kinds of transformer architectures, they require differential equations. So, in order to be able to train all the parameters, you need to be able to have partial derivative calculus, basically apply overtraining data. And, in order to make the big next leap, we’re going to need to attach non-differentiable stores of information to that kind of differentiable learning algorithms.

Jon Krohn: 01:25:46

So for example, GPT-3 doesn’t have a set of strong factual relationships between entities, like you have as a human reader of Wikipedia. It doesn’t develop like these factual connections, it just develops probabilistic connections. And so, that’s one of the big leaps that need to be figured out, and I think it’s going to be some time before it is.

Kirill Eremenko: 01:26:15

Mm-hmm (affirmative). Gotcha. Awesome. Thank you. That was very interesting chatting about GPT-3. And, that also brings us to the end of this episode. It was great, interesting year in 2020 in our review, and a difficult year, but interesting technological advancements. Anything you wanted to add before we wrap up?

Jon Krohn: 01:26:42

Yeah. I mean, I guess, I should have mentioned, I can’t believe I didn’t even think of COVID or anything, which was… If we’re doing a year review, we probably should have mentioned that at least in some way. And actually, so the very first topic that we had today, AlphaFold 2, an application of these automated protein structure predicting algorithms is that it can, for example, predict the shape of the spike protein, which is in the COVID virus, or that allows the COVID virus to attack human cells. And so, it’s just an example of how the kinds of stuff that we’re doing today is relevant to the kinds of problems that we’re facing today. And yeah, we could have touched on that a bit more, but no, I have nothing else to say. Yeah, definitely 2020 has been a weird year, Kirill.

Jon Krohn: 01:27:41

It was the year that never was, for a lot of people. And yeah, certainly a lot of hardship out there, both in terms of changes to the economic situation that have impacted people. And also, a lot of a sickness and death, unfortunately, that’s happened around the world as a result of COVID. But, going back to one of the early topics that we had at the very beginning of this call, I also think that it’s, if you think back a 100 years to the Spanish flu, we were so much worse prepared for that, and so much less equipped to come up with a vaccine quickly and distribute that vaccine. And, all of these capabilities that we’ve made over this last century are facilitated today by the storing and transmission of data, and even to some extent, the modeling of data with machines. So, that’s my main point.

Kirill Eremenko: 01:28:53

Yeah, absolutely. Well, the year is coming to an end and hopefully things will get better in 2021, to allow us to get back to our normal lives. Thank you very much, Jon. It was a pleasure as always. And, our listeners will hear much more of you in 2021, so very excited about that.

Jon Krohn: 01:29:20

Me too, Kirill. Thank you very much.

Kirill Eremenko: 01:29:29

So, there you have it everybody, hope you enjoyed this episode as much as I did, and got some interesting insights and some laughs out of it. As mentioned, Jon will be taking over the podcast in 2021 onwards. I’m super excited about it, and it’s going to be very cool to see how the podcast grows with him as the host. So, please welcome him and help him feel at home here on the SuperDataScience podcast. Now, if you’re interested in the reasoning behind why I’m stepping down from the role of the host of the SuperDataScience show, check out the FiveMinuteFriday episode that’s coming out this week.

Kirill Eremenko: 01:30:10

You’ll learn more about that. But in the meantime, we’ve got a few more episodes still to go in this year, and I can’t wait to see you throughout them. As always, you can find the show notes for this episode at www.superdatascience.com/429. That’s www.superdatascience.com/429. There, we’ll share any materials, any links, URLs, books, anything that was mentioned on the show, so check it out. And, if you are excited about any of these topics, and you know someone else who might be excited about them, then send them a link to this podcast. Share the love. It’s very easy to share. Just send the link www.superdatascience.com/429. On that note, thank you so much for being here today. I look forward to seeing you next time. Until then, happy analyzing.

Podcasts SDS 429: 2020’s Biggest Data Science Breakthroughs

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 429: 2020’s Biggest Data Science Breakthroughs

Share

SDS 429: 2020’s Biggest Data Science Breakthroughs

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

SDS 942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025