SDS 551: Deep Reinforcement Learning — with Wah Loon Keng

Podcast Guest: Wah Loon Keng

February 22, 2022

Gifted author and software engineer, Wah Loon Keng, joins the podcast for an in-depth look at reinforcement learning. From its history to limitations, modern industrial applications, and its future in the coming decades, Keng provides an in-depth introduction to deep reinforcement learning and explores the latest research and applications in the field.

Thanks to our Sponsors
About Wah Loon Keng
Wah Loon Keng is a Senior AI Engineer at AppLovin. He co-authored Foundations of Deep Reinforcement Learning and created SLM Lab – a popular Deep RL framework.
Overview
Wah Loon Keng joined Jon Krohn live in Manhattan for an introduction to deep reinforcement learning that delves deep into its rich roots in gaming. If you’re new to reinforcement learning, you’ll learn that it is the third category of learning that differs from unsupervised and supervised learning. The algorithm interacts with the environment, which changes the data, and so every time you run your algorithm, you’re using a new dataset.
At its core, Keng says that reinforcement learning is essentially “learning from trial-and-error” and compares it to a game-playing agent. And when it comes to differentiating reinforcement learning from deep reinforcement learning, Keng says the difference lies in the fact that deep RL is learning from a training set and then applying that learning to a new data set. RL meanwhile is dynamically learning with a trial and error method to maximize the outcome.
The way deep reinforcement algorithms are set up today makes them unable to reason, which is a significant inefficiency. And so even the most straightforward information is challenging to translate, says Jon.
Now that Keng and Jon have defined RL and deep RL and differentiated them from unsupervised learning and supervised learning, Keng dove straight into RL’s early beginnings and breakthroughs. From the ’80s and ’90s with Actor-Critic algorithms and TD-Gammon through to the game-changing emergence of deep reinforcement learning in the past decade with approaches like Deep Q-Learning, AlphaGo, and MuZero, Keng impressively walks us through the most critical developments that lead us to where we are now.
After a thorough history of RL and gaming, it was time for Jon and Keng to cover the limitations of RL today. While the goal is eventually to get to real-world complexity, moving from training in the virtual world to training in the real world involves passing several hurdles–some of which relate to the problems of generalization and sample inefficiency, says Keng. Other limitations include the cost of collecting data, the cost of running the algorithm and then deploying it in the real world. Despite these limitations, you still see deep RL applications within the robotics and logistics industries. Specific examples can include the scheduling of trains and the management of inventory.
Finally, Keng discussed an open-source framework he helped co-developed, called SLM-Lab. Keng’s framework provides a “right-out-of-the-box” agent for users to run their algorithms with, and can also plug into different environment packages.
As an AI engineer on a small team, Keng often finds himself working from “end-to-end,” he says. From understanding a problem to figuring out how to solve it, and coding the solution himself, Keng can do it all. And as far as the tools he uses daily, he likes to keep things simple: Python, PyTorch, and Kubernetes for deployment. 
In the future, Keng hopes to see more useful robots, but stresses that the issues with sample efficiency and generalization must be addressed to see RL become more useful in the industry.
In this episode you will learn:   
  • What is reinforcement learning? [4:50]
  • Deep reinforcement learning vs reinforcement learning [13:17]
  • A timeline of reinforcement learning breakthroughs [16:17]
  • The limitations of deep RL today [39:53]
  • Deep RL applications [53:10]
  • Keng’s open-source SLM-Lab framework [57:51]
  • Keng’s responsibilities as an AI engineer [1:02:17]
  • What is the future of RL? [1:08:05]
Items mentioned in this podcast:

Follow Keng:

Follow Jon:
Episode Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 551 with Wah Loon Keng, author of Foundations of Deep Reinforcement Learning and senior AI engineer at AppLovin. 
Jon Krohn: 00:00:14
Welcome to the SuperDataScience podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple. 
Jon Krohn: 00:00:45
Welcome back to the SuperDataScience podcast. I’m delighted to be joined today by the gifted author and software engineer, Wah Loon Keng. Keng, who prefers to go by his family name, co-authored the exceptional book Foundations of Deep Reinforcement Learning, a remarkably comprehensive and well written introduction to deep reinforcement learning that blends underlying theory with practical code demos. He also co-created SLM lab, an open-source deep reinforcement learning framework written in Python, and he’s a senior AI engineer at AppLovin, a leading marketing solutions provider. 
Jon Krohn: 00:01:20
In this episode, Keng details what reinforcement learning is, he provides a timeline of major breakthroughs in the history of reinforcement learning, including when and how deep reinforcement learning evolved. He talks about modern industrial applications of deep reinforcement learning across robotics, logistics, and climate change, the limitations of deep reinforcement learning and how future research may overcome these limitations, the industrial robotics and AI applications deep reinforcement learning could play a leading role in, in the coming decades, and what it means to be an AI engineer and the software tools he uses daily in that role. Today’s episode is more on the technical side, so will appeal primarily to practicing data scientists who are keen to get an in-depth introduction to deep reinforcement learning, as well as to folks who are already familiar with deep RL, but want to hear about the latest research and applications. All right, you ready for this wicked episode? Let’s go. 
Jon Krohn: 00:02:18
Keng, welcome to the podcast. It’s awesome to have you here. Thank you for making the journey to my apartment, to film in lower Manhattan. You had a long commute over from Brooklyn, is that right? 
Wah Loon Keng: 00:02:29
Yeah. It’s super long. Only like 10 minutes on the subway, four stops away. 
Jon Krohn: 00:02:34
Yes, that isn’t bad, and it’s nice we get to catch up a lot more closely before and after shooting. 
Wah Loon Keng: 00:02:39
Yeah. 
Jon Krohn: 00:02:40
And I hadn’t seen you in years. So I used to know you with some regularity. I used to see you with some regularity. So in 2018, we were introduced by Deborah Williams. Who’s an acquisitions editor at Pearson. So my book Deep Learning Illustrated was published by Pearson, and Deborah Williams was the acquisitions editor. And then same thing, your book Foundations of Deep Reinforcement Learning, which was published in 2020 was also published by Pearson and Deborah worked on that book too. And so I can’t remember in what context she introduced us, but she was like, “You’ve got to meet Keng and Laura, the authors of this book.” And then both of you ran. So at that time, in 2018, I was running this in-person deep learning study group in New York, which was pretty cool because we had this specific curriculum that all of us had followed. 
Jon Krohn: 00:03:30
So everybody had a pretty good understanding of deep learning in general, convolutional neural networks for machine vision, we’d done recurrent neural own networks and other kinds of natural language processing approaches for handling natural language data. So people kind of have this baseline level of understanding and we really wanted to learn deep reinforcement learning as a group, and so then it was perfect that at that same time, Deborah introduced me to you and Laura, two deep reinforcement learning experts. And so you came in and offered two day-long workshops. You came in. So my deep learning study group sessions 15 and 16, I’ll provide a link in the show notes to those so that people can see exactly what was covered in those sessions. But the two of you ran on Saturdays these like four or five-hour long workshops that were hands on. And yeah, those were really super cool. So really appreciate you doing that. And then you left. You went to the Bay Area for work. 
Wah Loon Keng: 00:04:33
Yep. 
Jon Krohn: 00:04:33
But now very recently, you’re back in New York City. 
Wah Loon Keng: 00:04:37
Yeah. For good. 
Jon Krohn: 00:04:37
For good. 
Wah Loon Keng: 00:04:38
For good this time. 
Jon Krohn: 00:04:40
I love to hear that. Well, now your wife co-owns a restaurant in the east village, so it’s a good reason to stay. 
Wah Loon Keng: 00:04:45
Yep, totally. I know. 
Jon Krohn: 00:04:50
Clearly you’re an expert in deep reinforcement learning, so let’s dig right into what that is. So I’m going to give my own kind of quick intro to it, and then you can take it away. We can use machine learning or other statistical approaches for tackling different kinds of problems. So if we’re able to have labeled data, like you have 1,000 pictures and five of them are labeled as dogs, five of them are labeled as cats, then you have this labeled data set and you can train something called a supervised learning algorithm to work with those label data. In other scenarios, you might not have labels for your data. So you could have a whole bunch of natural language data, like all of the English on the internet, and you could train some unsupervised learning models with that kind of data set. Reinforcement learning is a third category of problem that is completely different and really super cool because with reinforcement learning, the algorithm interacts with something called an environment, and that changes the data. So with those other two paradigms, those other two main paradigms, supervised learning, unsupervised learning, you can just kind of have a trained data set, and you can just keep working with it over and over. But with reinforcement learning, every time you run your algorithm, you’re training with new data that has maybe never been created before. And so yeah, tell us a bit more about that, about reinforcement learning. 
Wah Loon Keng: 00:06:19
So the way I see it is that it simply learns from trial and error. So you expose what we call an agent, which is the learning model or the agent at work. So you expose it to an environment, it interacts with an environment in real time and there is basically a feedback loop. So you do something in the environment that then feeds back to you and then you decide what to do next. And that all then feeds back to the… We call it the objective function. So you’re trying to maximize something, so in this case in reinforcement learning it’s the reward. And that’s also an interesting topic. Like how do you assign a reward? So typically the obvious case is like playing games. So a lot of times it’s easy to say that that’s how you explain deep RL to people. It’s just like a game playing agent, because that’s like the perfect example for reinforcement learning. 
Jon Krohn: 00:07:10
Right. So if you have an algorithm that’s playing a Tetris video game, it’s very easy to say, “Well, our objective is to get at the highest score possible in Tetris.” So every time the agent is playing the game, it learns how to take simple action. So it first figures out that pressing the right key is going to move the Tetris piece right and left is left. And so it figures out these simple associations, and then through kind of random trial and error, it eventually figures out that some of the movements lead to points, and then you have it do more and more and it maximizes these kinds of point scores. So it figures out what kinds of actions lead to higher and higher points. 
Wah Loon Keng: 00:07:55
Right. Exactly. So the same thing happens in robotics. So you can think of robotics as a robot kind of playing games in the real world, but of course there are more real consequences to that because robot resides in the real world and interacts with say people or objects. But yeah, that’s the general gist. 
Jon Krohn: 00:08:15
And an interesting thing there is that with some of these real world applications, because it could take so long or be so expensive to accumulate real world training data, often even real world applications of deep reinforcement learning train on video games, right? So a robot arm, you could have it learn how to perform a task in a simulation first with simulated data, and then maybe you can just fine-tune that in the real world. 
Wah Loon Keng: 00:08:45
Right. That’s a huge difference between if you look at how a human play video games, like the Atari suite of games versus how a robot plays it. So a human probably takes a few tries and then directly, oh, you just get better and better over time, maybe in less than 10 trials. But for, I don’t want to say a robot, but for an agent- 
Jon Krohn: 00:09:07
Machine. 
Wah Loon Keng: 00:09:07
Yeah, for a machine to do that, there is a way we measure the data sample we use to train, it’s called a frame. So basically one video frame is like one data point, and that’s in the millions if not billions. So actually in recent years we’ve reached billions, even hundreds of billions. And that’s a lot of video game playing. So you’d see news headlines that, that say, “Oh, this OpenAI Dota Bot plays thousands years worth of Dota and then compete against humans. But that’s like the sample efficiency of [inaudible 00:09:46]. 
Jon Krohn: 00:09:46
At least sample inefficiency. 
Wah Loon Keng: 00:09:47
Yeah, inefficiency. Exactly. So at least for now. 
Jon Krohn: 00:09:50
And yeah, so that, that is something I think we’ll talk about the, kind of the problems in this approach and in the future of research. So we’ll get to that later in the episode. But yeah, it’s a really interesting point, the way that we set up deep reinforcement learning algorithms today, or any AI algorithms really, I think people, if they’re not already practicing data scientists working with AI algorithms, you have this idea that they know things and that they reason, but they don’t at all. And this deep reinforcement learning inefficiency that you’re describing as a perfect example, because unlike a person, even a child who can pick up a controller and say, “Okay.” It instantly makes sense that pressing right is going to move the Tetris block right, and left is going to move it left. But the algorithm comes without even that simple knowledge of the world, and so even the simplest things, the idea of what could possibly be the objective in this game… For a kid when they’re seeing the Tetris blocks, they’re like, “Okay, like fitting these together.” Or somebody could even just say a sentence to them. “You’re trying to fit the blocks together without there being any gaps.” And in a single sentence, the kid’s like, “Oh, okay. I know what to do.” But you can’t explain it to a deep reinforcement learning algorithm, so it takes billions of frames like you’re describing, for it to figure it out by accident. 
Wah Loon Keng: 00:11:03
Right. So that also brings up an interesting point. So there is, I guess we can call this concept of competition history. So basically the question is how much of it is in it, like built into us, like hardwired in our body and brain, versus how much is learned on the spot. So I think things like language for us, it takes us many years to learn language. So that would be something that’s not hardwired at least, not deeply hardwired. But for machines then, if you look at why does it take billions of frames? Is it possible to shrink it down to maybe thousands of frames or just like how humans do it in dozens of episodes of a game? The thing with human is that we have priors, we have real world experience. We have a really amazing ability to transfer what we know before and apply it to gaming. We know how buttons work. We know what’s left, right, up, down. We know about spatial concepts. But for machines, when you are learning from scratch, to them, they’re just seeing numbers on a vector, or even on a matrix if you’re looking at images. So then how fast can you actually learn from that? So how much is actually the inefficiency in our algorithm versus the total compute history required from scratch to learn something. So for human, you can argue that our compute history actually traces back way back to our ancestors. And actually it might be billions of frames also. 
Jon Krohn: 00:12:39
Right. 
Wah Loon Keng: 00:12:40
So, yeah, that’s just like something that we think about in terms of efficiency. 
Jon Krohn: 00:12:41
That’s cool, yeah. I hadn’t thought of that. That’s a really good point. And interesting for you to mention kind of going back really far in human history. Though maybe we won’t do a whole history, an archeological history of humans and human cognition, but I can definitely recommend a great book to get a glimpse of that is Sapiens by Yuval Noah Harari. I love that book for kind of understanding human history in our own cognitive abilities. But let’s go over a history of reinforcement learning. Oh. And we didn’t talk about how… So we talked about this reinforcement learning paradigm, what makes something deep reinforcement learning as opposed to just reinforcement learning? 
Wah Loon Keng: 00:13:23
All right. So reinforcement learning is the paradigm of learning from trial and error. The data points are basically state, action and reward, and you repeat that over time, so rule it out in time. But what makes it deep is how you’re learning the functions in reinforcement learning. So in the typical setup, without deep, any consideration of like deep or not deep, you are learning say the objective function, or sometimes you learn the transition function basically when you’re modeling the world. And how do you learn that? So from those four data points. There are a bunch of other functions that can learn from just these data points. And what makes it deep is, of course, a function approximation component of it. So you’re using a deep neural network to learn those functions and then apply them. So traditionally it’s mostly a tabular or dynamic programming, but since DeepMind discovered how you use deep learning to learn those functions, there has just been the explosion of deeper algorithms and achievements. 
Jon Krohn: 00:14:24
So through the ’80s, the ’90s, the first decade of this millennium, reinforcement learning problems were being solved, but they were primarily using non deep learning, non neural network approaches to figure out what kind of function we should be learning inside the algorithm. And then as you mentioned, DeepMind in recent years made a lot of breakthroughs, I think first with playing Atari video games. That was the first big paper, right? 
Wah Loon Keng: 00:14:51
Yeah. 
Jon Krohn: 00:14:51
Maybe you’re about to tell us about the history, so I don’t need to. But basically, the difference between reinforcement learning and deep reinforcement learning is that with deep reinforcement learning, you’re using a neural network, perhaps a deep neural network to learn something as opposed to using some other kind of learning approach. 
Wah Loon Keng: 00:15:12
Right. 
Jon Krohn: 00:15:15
When I talk to leaders in data science, I notice they all make time for learning and encourage the same of their teams. But with your actual everyday work to do, all-day trainings, aren’t possible for most of us. That’s why an on-demand learning platform like Udemy Business makes sense. With Udemy Business, you can access over 500 cutting edge data science courses taught by real world subject matter experts and validated by other learners’ realtime reviews. Amongst these 500 courses, you’ll find my own mathematical foundations of machine learning course, as well as dozens of mega popular courses from other super data science instructors. To hear the latest on the state of data science in the workplace and discover how you can democratize data science learning in your teams through Udemy Business, check out the new video series called Insights on Demand: Diving Into Data Science. To watch this series and learn more, visit business.udemy.com/sds that’s business.udemy.com/sds. 
Jon Krohn: 00:16:11
Cool. All right. So we’ve gotten that over. Hopefully we haven’t stepped on the toes too much of the history that you’re going to give now. 
Wah Loon Keng: 00:16:17
Not really. So the history goes back, I think reinforcement learning is a really old field. It goes by a different name also. So there’s control theory. 
Jon Krohn: 00:16:26
Oh, control theory. 
Wah Loon Keng: 00:16:27
So that’s actually the same thing as reinforcement learning. And there’s also inventory management. So, it takes many different forms, but overall the mathematical formulation is the same. So we have the four tuple, the state-action… yeah, SARS. And so the history. I guess we should start from the most significant achievement of reinforcement learning. I have this page on my GitHub that I keep a track of the timeline. 
Jon Krohn: 00:17:00
Nice. We’ll put that in the show notes for sure. 
Wah Loon Keng: 00:17:01
Yeah. And I think, right, let’s see. It’s actually much older than most people that I talked to thought. “Oh, is this something new?” Because it got popular after DeepMind. 
Jon Krohn: 00:17:13
Exactly. 
Wah Loon Keng: 00:17:14
But really it came, I think you can go back as far as 1983. So that was the first actor-critic algorithm, and it was already being used, but of course- 
Jon Krohn: 00:17:22
Actor-critic algorithm was in 1983. 
Wah Loon Keng: 00:17:24
Right. 
Jon Krohn: 00:17:24
Wow. 
Wah Loon Keng: 00:17:26
Yeah, it’s pretty old. And then you have the first conversational net in 1989. 
Jon Krohn: 00:17:33
[inaudible 00:17:33] 
Wah Loon Keng: 00:17:33
Yeah. And then the same year you have Q learning, but that was like still tabular and non deep version. But the whole concept of how do you bootstrap your learning function and then feedback, and then keep on learning to approximate the value better and better. So that’s Q learning, 1989. And then in 1991, we have TD-Gammon. So that’s using TD, that’s a method, it’s called temporal difference. It’s used to play Backgammon. 
Jon Krohn: 00:18:03
And probably other applications too, but that was a famous one. 
Wah Loon Keng: 00:18:07
Yeah. That was a famous one. 
Jon Krohn: 00:18:08
It was a big breakthrough. The people were surprised that you could play backgammon very well. So that was before we had the whole deep blue thing of Kasp- Maybe you’re getting to that. 
Wah Loon Keng: 00:18:21
Actually no. I think, yeah, it wasn’t- 
Jon Krohn: 00:18:24
Well, because that wasn’t deep reinforcement learning. No. [crosstalk 00:18:25] But it was in the late nineties that you had Deep Blue playing chess against Kasparov, the world’s best human player. And now it seems like we have computers beating the best humans at lots of different things. But in the early ’90s, at this relatively complex cognitive task, backgammon, really fun game by the way, one of my favorites, you had computers that could play it extremely well. So that was a big breakthrough. 
Wah Loon Keng: 00:18:47
Yeah. Actually also that’s a good point, like chess, Deep Blue wasn’t reinforcement learning, but then the strongest computer today is a deep reinforcement learning algorithm. 
Jon Krohn: 00:18:57
Of course it is. 
Wah Loon Keng: 00:18:58
We’ll [crosstalk 00:18:58] later. So, and then 1992, that’s when REINFORCE came out. So that’s also a very canonical algorithm we actually include in the book that we published. Let’s see. So 2013 that’s when DQN came out. So that is the first deep reinforcement learning algorithm that became popular because it defeated the human. Well, actually not versus a human, as in it achieved like a super human level score. 
Jon Krohn: 00:19:30
Yeah, at video games, at Atari video games, right? 
Wah Loon Keng: 00:19:33
Yeah. Right. 
Jon Krohn: 00:19:33
And at several of them, so to tie a few things together. So you’d mentioned that Q learning this approach, and you actually you’ve mentioned a few of them. So I think all of these are covered in your book. So Q learning, which has been around since the ’80s, that’s covered in your book. REINFORCE, which was introduced in 1992, that’s covered in your book. The actor-critic algorithm that you talked about in 1985, or ’83, that also I imagine it’s got to be covered in your book cause it’s still one of the big reinforcement learning algorithms today. But this big thing there was this jump, so kind of lots of big breakthroughs in the mid ’80s to the early ’90s in reinforcement learning in general. But then it was in 2013 with people at DeepMind, and Volodymyr Mnih, I think, was the first author on that paper. 
Wah Loon Keng: 00:20:16
Yeah. 
Jon Krohn: 00:20:18
And yeah, so this idea of using a neural network to learn a function in the reinforcement learning algorithm that ended up being this big breakthrough, and then all of a sudden… So it was one algorithm I think that could have superhuman ability at several different Atari games.
Wah Loon Keng: 00:20:36
Right. So the whole journey actually took them about two years. So DQN came out 2013, but it actually took until 2015 February for them to achieve a human level control in Atari. So yeah, that’s quite a journey. That’s like the same thing that we’re going to see over and over with DeepMind as well, like they [inaudible 00:20:54] something and then that’s the first draft and then they iterate over it and then improve it significantly. And then I think in 2015, that was the start of this explosion of activities in deep RL. So you have people building on top of what DeepMind published, so the DQN and then you try to modify and improve on DQN. So you get something like double DQN where you have like two networks to learn the same function, the Q function, what you call it. Where you have like a dueling DQN or you have like adding prioritized experience replay to your DQN and stuff like that. So, 2015 a busy year, and I think TensorFlow came out 2015 November. And then you get some more activities in 2016, so that’s when AlphaGo came out actually, and they beat the champion, Lee Sedol four to one. 
Jon Krohn: 00:21:52
Yeah. Spoiler alert if you watch [crosstalk 00:21:54] documentary. The computers win. 
Wah Loon Keng: 00:21:57
Yes. 
Jon Krohn: 00:21:59
Amazing documentary, though I highly recommend checking that out. 
Wah Loon Keng: 00:22:01
Oh, totally. 
Jon Krohn: 00:22:03
It’s available on Netflix, but I think it’s also freely available on YouTube for anyone in the world to see. 
Wah Loon Keng: 00:22:07
Yeah. I think so. 
Jon Krohn: 00:22:07
And really cool. And it’s a kind of documentary, so it’s about deep reinforcement learning, but it talks so much about the people involved in it, both on the Google DeepMind side, as well as on the players’ side. It makes a really nice human story. And so it’s the kind of deep reinforcement learning documentary that you could watch with your partner who isn’t interested in machine learning. 
Wah Loon Keng: 00:22:32
Yeah, totally. 
Jon Krohn: 00:22:35
Cool. Yeah, so AlphaGo, that was a huge breakthrough. Because then Go, I think it’s the world’s most popular board game, but it’s not super popular in the west. And it’s very, very computationally complex, much more computationally complex than backgammon or chess. And so I think some people thought it could be many decades before we had a computer that could beat expert Go players, and then here we have an algorithm that could beat the world’s best. 
Wah Loon Keng: 00:23:04
Right. And the game is section very simple. It’s just like a grid and then you try to surround your enemy. So you remember like all the kids in Asia where I’m from would also play during recess. 
Jon Krohn: 00:23:16
Really? 
Wah Loon Keng: 00:23:17
Our books are not lined, but they’re grid, because you write Chinese characters, and then you can also use that to play Go.
Jon Krohn: 00:23:25
That’s cool. So you have this grid that’s designed for writing Chinese characters, but you can draw square around it. 
Wah Loon Keng: 00:23:31
Yeah. You can draw a circle or Xs depending on… On the real board you use like black and white pieces- 
Jon Krohn: 00:23:37
Black and white stones. 
Wah Loon Keng: 00:23:38
So yeah, just draw your stuff. 
Jon Krohn: 00:23:40
Cool. 
Wah Loon Keng: 00:23:40
So, and then in 2017, PPO came out. PPO stands for proximal policy optimization, that’s by John Schulman, and in just a month, well actually a month after the paper came out, they started applying that to Dota. So Dota is a very popular computer game. It’s five on five, you’re basically fighting team against team. You have like a throne to take and then you control these characters called heroes and have abilities. It’s a very rich real-time strategy game, I would say, also like reflex game, because you’re trying to like micromanage everything. I played Dota. You can’t tell right now. 
Jon Krohn: 00:24:20
Well, I was like, that is the best explanation of Dota I’ve ever heard. Here’s a guy who’s really done his research for the episode. So when you say five versus five, one player controls all five characters? 
Wah Loon Keng: 00:24:31
No. One player controls one hero. So you have five people. 
Jon Krohn: 00:24:35
So you play with four friends. 
Wah Loon Keng: 00:24:36
Yeah. I play with four friends and it takes a lot of communication, so usually you would go to like an internet cafe and you talk to your teammates and then try to strategize. So it’s a really fun, and it’s strategy on the macro level, but on the micro level, there’s also like your finger control. And you’re trying to do like a million things in a few seconds, and that takes a lot of coordination. So it’s been known for a long time that it’s a very hard game to play for humans. And because also it’s so rich because there are so many things in the game that you have combination of different abilities, different items. And there’s like some math involved as well. Okay, do you buy certain things? Do you have the budget and money and stuff like that? So that was like one-on-one, a restricted version of Dota, just like shortly after the paper came out, they applied PPO to that.
Jon Krohn: 00:25:32
So there’s this restricted version where you just had one player against another player, instead of the usual five-v-five? 
Wah Loon Keng: 00:25:37
Right. So if it’s one versus one, then it’s less about teamwork. It’s more about your individual control and strategy. And it took them until like 2019 April, that’s when OpenAI Five, so that’s the fourth version of Dota, still with some like minor restrictions, that’s when they went against the world champions actually at a world champion tournament. 
Jon Krohn: 00:26:06
Oh, really? 
Wah Loon Keng: 00:26:06
And they went against them and they had a demo, and they basically defeated humans in 2019. 
Jon Krohn: 00:26:14
A team of five people at the world championship. 
Wah Loon Keng: 00:26:16
Yeah. 
Jon Krohn: 00:26:17
Wow. 
Wah Loon Keng: 00:26:17
But of course it’s not for the tournament money. 
Jon Krohn: 00:26:21
All right. 
Wah Loon Keng: 00:26:21
It’s more for demo. And so going back a bit in terms of timeline, 2017, AlphaGo Zero actually so MastersGo. So you, you start seeing like these iterations of the DeepMind algorithms. So 2017 AlphaGo Zero, and then 2017 again, like December, just two months after the previous one AlphaZero. 
Jon Krohn: 00:26:48
AlphaZero, yeah. And the key things here are that AlphaGo that’s featured in the documentary that beats the world champion, Lee Sedol a couple years earlier, that was trained on human game play. There was some training data based on human gameplay. But AlphaGo Zero, it had no training data. So it just learned by playing against itself and making its own training data. 
Wah Loon Keng: 00:27:11
Right. Hence the zero. 
Jon Krohn: 00:27:13
Hence the zero. 
Wah Loon Keng: 00:27:13
Yeah. 
Jon Krohn: 00:27:14
And then so what was the change from AlphaGo Zero to AlphaZero? 
Wah Loon Keng: 00:27:18
So AlphaZero is that it doesn’t get restricted to only Go. So AlphaZero is no human knowledge, but it masters Go, Chess and another game called Shogi. 
Jon Krohn: 00:27:30
Yeah. Like a Japanese chess. 
Wah Loon Keng: 00:27:31
I think so, yeah. I’m not sure. Yeah, that was 2017 December. And let’s see what else happens. And then 2018 December, so just a year later, AlphaZero became the strongest player in the history in chess, Go and Shogi ever. So you see this progress made by DeepMind year after year until they basically become the best in the world. All that happened in 2018, 2019, that’s when AlphaGo defeated human, OpenAI Five defeated the world champions. Yeah. And then of course after the OpenAI Five, I think DeepMind had to respond. So 2019 August, we have the AlphaStar. So applying that to StarCraft and actually they’re pretty good. They call it the grandmaster level in the StarCraft II. 
Jon Krohn: 00:28:25
I don’t play soccer that much, but- 
Wah Loon Keng: 00:28:26
I was hoping for a good detailed breakdown. 
Jon Krohn: 00:28:31
StarCraft is another video game that’s like a strategy game, and very complex, I guess. And so what you’re saying is that because a different firm OpenAI had been getting all the successful Dota, the folks at DeepMind, who had been making all the kind of big high popularity breakthroughs in deep reinforcement learning, they made their own kind of big splash with StarCraft II. 
Wah Loon Keng: 00:29:01
And then we have two more years or three more years until today. Let’s see. 2019 October was pretty exciting. OpenAi solved, well, not really solving, but basically using a robot hand to solve Rubik’s cube. So the solving algorithm is separate. I mean, it’s already known, a computer can solve that. But to use a robot hand to control and actually solve it in the physical world, that’s pretty impressive. So we have that in 2019. 
Jon Krohn: 00:29:38
The alternative data industry is expected to grow tenfold in the next five years. So how will your skills and training stay competitive in this brave new world? Well, fortunately for us, the alternative data academy has just released the world’s first alternative data courses aimed at working professionals through its free and on-demand training platform. The courses are top of some of the biggest names in the alternative data space, including the prominent New York University professor, Gene Exter, and renowned Oxford professor Alexander Denev. Sharpen your alternative data skills for free today with the Alternative Data Academy, all the details are online at altdata.academy. That’s altdata.academy. 
Wah Loon Keng: 00:30:22
And in 2020, we have something called the Agent 57. So it beats the human benchmark on all of the 57 Atari games. That wasn’t done before, because you would have a new and improved algorithm that comes out and then, oh, it doesn’t do as well in a certain category. 
Jon Krohn: 00:30:42
Right. 
Wah Loon Keng: 00:30:43
Yeah. 
Jon Krohn: 00:30:43
Like there was something revenge. 
Wah Loon Keng: 00:30:45
Yeah. Montezuma’s Revenge. 
Jon Krohn: 00:30:48
Montezuma’s Revenge. So the early, when we were talking about Deep Q-learning around 2013, 2015, this breakthrough algorithm that could be better than expert at a lot of Atari video games. But some of them like Montezuma’s Revenge, it was terrible. And that’s because I guess in those early days in the early days of seven years ago, in 2015, the Deep Q-learning algorithm was very good at games where all the information was right there on the screen, and it dealt with fast reaction times. But Montezuma’s Revenge, I’ve never played it, but I guess you might need like to find a key in one room and then use that to open a door in another room. 
Wah Loon Keng: 00:31:31
Right. That’s also the problem, which I guess we’ll get to that later. The game is like a mystery or puzzle solving game, so you have a lot of steps you have to do. You have to go through a door, find the stairs, go to a place, get a key. But you do all of that and then you get one reward. So the rewards are very sparse. That’s why it’s super hard for a machine to learn from the reward, because you need rewards to learn because you are doing trial and error. But without that, you just can’t perform as well. That’s why it’s a hard game. 
Jon Krohn: 00:32:00
Wow. 
Wah Loon Keng: 00:32:01
But Uber’s Go-Explore, that was the algorithm that finally solved it actually in 2018. And they have a different strategy of solving that problem. Basically they go and explore, yeah, without too much technical detail. So 2020 November, AlphaFold came out. So that was applying DeepMind’s algorithm to protein falling. So that’s very useful for the medical or biological world. And then you have 2020 again, December MuZero masters Go, chess, Shogi and Atari also, again without rules. But then you start seeing that progress slows down a little bit. Well, in the sense that it’s not as frequent, you don’t have like so many new algorithms coming out and anytime there’s news about deep RL, it’s bigger and also much more expensive, but also more significant. So something like the AlphaFold. 
Wah Loon Keng: 00:32:59
And then 20 21 August, that was, I think, the last item I have on the timeline. That was when there’s… So they call it open-ended play. So what that is, is you have a virtual environment and it’s open-ended. So imagine like a game maker. So Unity, that’s what it used. So you have a physical environment that has ramps, you have maybe like blockade and you’re trying to do like general tasks. So it’s like a mini world that you put agents in and then they just train on a bunch of different tasks. So that’s open-ended play, basically just toss agents in and they play open-endedly. And what they found out was with enough of this kind of training or play, you get this emergent behavior. They call it generally capable agents. What it means is you would be training on a certain task, let’s say chase people and tag the other agent on a certain terrain. But when you transfer them to a different terrain, there are combinations that they haven’t seen before, but then they were able to generalize to that unseen terrain and scenarios. 
Jon Krohn: 00:34:16
Right. So that sounds super cool. That was your last piece on that timeline. Just to summarize some general themes that I picked up in there. In the ’80s and the ’90s we had these first breakthroughs in reinforcement learning in general. Then starting in 2013, started using neural networks to solve reinforcement learning problems. And then there ended up being these many, many breakthroughs of over just a few years by using these now deep reinforcement learning algorithms. And you’re saying in recent years, we’re not seeing as many big breakthroughs as frequently, but when they come out, they tend to be even bigger and more relevant to real world problems. So unlike playing Atari video games, they’re folding proteins, figuring out how proteins fold or using a robot hand to solve a Rubik’s cube. So it sounds like, yeah, we’re now having these very specific, but very powerful application areas coming through, which is cool. 
Wah Loon Keng: 00:35:23
Right, right. 
Jon Krohn: 00:35:24
Nice. All right. That was a very interesting history of reinforcement learning. So what’s exciting in reinforcement learning research today? 
Wah Loon Keng: 00:35:36
I would say there are still a lot of problems to tackle in deep RL. And of course, people are still trying to solve that very hard. I think in especially the last year or last two years, we are seeing a trend of… So one is reframing RL in different paradigm. So one way is to kind of find ways to reframe it. Can you transform it into a supervised learning problem? One of the more interesting approaches that came out last year, I think, was using a transformer, which is like the buzz these days, to solve RL. Right. Because you have this sequence of the tuples. So you have like SARS over and over. You do something, you get a reward and you, again, reconsider what state you are and then repeat. 
Wah Loon Keng: 00:36:29
So people have figured out how to transform that into a sequence, feed it into a transformer. That transformer basically figures out, “Okay, what’s going to potentially play out? Of course you need a lot of data to do that. So basically that’s where I’m like collect previous play data. So that’s one trend. And I guess a question that people would be asking after like Dota, StarCraft and Go is what’s next? Is it time for general robots using deep reinforcement learning? So you have a lot of directions to go from there. And I think the open-ended play and focusing on emerging behavior, that’s a really interesting trend as well because, sure, it is very hard to do because basically only DeepMind could do it. That’s due to the amount of compute resources you need, and also of course engineering time and effort. 
Wah Loon Keng: 00:37:32
And then the other thing would be like certain ideas. Like people are starting to think in the context of AI in general and how do you apply that to reinforcement learning? So Eric Jang’s recent post about Just Ask For Generalization. That’s the title, I think. How do we generalize an agent, especially when it’s an agent that you can basically drop it in an open world? And the open world is really rich and you have to figure out how to do things, how to be efficient about that. But in general, the problem still remains, we have the textbook examples or the textbook categories of the problems of RL. 
Jon Krohn: 00:38:11
Well, just quickly before you go onto the problems that remain, that’s another big trend that I didn’t highlight from your history of reinforcement learning is that, especially in recent years, we’re seeing deep reinforcement learning algorithms that are capable of solving more and more general problems. So for some of the listeners, maybe they just picked this up from what you were saying, but it’s worth highlighting that in 2013, 2015, we have these Deep Q-learning networks that can play more than one entire game. So at first, I think the 2013 paper might have been like half a dozen games. And then in 2015, it’s several dozen games. And then you’re saying, I can’t remember the exact date, but several years later, then it is expert at all of the 57 Tetris games. And then we see the same kind of thing with board games. 
Jon Krohn: 00:39:00
So you start off with AlphaGo, which is expert at just Go, and then you have AlphaZero which can play Go, chess and Shogi at expert level. And then later after that, you have the AlphaMu… No, MuZero, is the name of it, which is capable of not only playing the board games at expert level, but also Atari games, which are completely different. So there’s this kind of this general trend towards generalization, which you were just talking about again there in the context of open play. And so when you’re talking about it in the context of open play, that probably relates to something that you just kind of touched on a few minutes ago, which is this idea of maybe being able to apply deeper enforcement learning more and more into real world robots that could explore the real world. 
Jon Krohn: 00:39:52
And so the general direction that we’re moving in, is to such generalization that you could have robots that just figure out what they should be doing in different kinds of circumstances, which is all a stepping stone towards artificial general intelligence, which is the kind of AI that when you talk to non data science people about AI, it’s the AI, they see on TV and then when you say, “Oh, I work in AI.” They jump to this idea of artificial general intelligence. They could learn anything that a person could learn. And so anyway, I’m speaking way too much in your episode, but I just wanted to pick on that one more really big important theme, and maybe that also helps with talking about what the limitations are today. 
Wah Loon Keng: 00:40:42
Yeah. I think if you look at it again, like I said, if you look at a history of the type of problems that we’re applying RL to, if you think about the level of complexity, so starting with like TD-Gammon, and then you go to Atari, where you’re like one game at a time, but it’s still like just pixels and pretty small pixels and pretty simple mechanics. And then mastering a few of them. And then you go to board game, more complex, and then Dota and StarCraft, which is even more… You can simplify it, but it’s still really rich because there are just so many things. Of course the goal is to eventually get to real world complexity for something to be useful in our world. That’s where robotics would be. 
Wah Loon Keng: 00:41:29
And if you look at the OpenAI, the hand solving Rubik’s cube, it’s still a pretty restricted example because you just have camera and you have the image of the hand. The hand is fixed in a stand, but you’re starting to reach into the real world. So of course, to do that, there are a lot of challenges that we have to overcome. So generalization, like I said, is the first one. So can you generalize enough and be efficient enough? But I think given a trend that the problems that we’re solving are getting bigger and bigger, but also the compute is getting more and more- 
Jon Krohn: 00:42:06
Right. Astronomical. 
Wah Loon Keng: 00:42:07
Right. So we go from a few million frames, like maybe 10 million frames for Atari, to hundreds of millions, and eventually it hit a billion. And then even more after that. So the open-ended play from last year, that was many billions, I think hundred billions of frames of play, because you have so many different environments, they basically in their… I think Blockpost, they have like a galaxy of games. And then, oh, which galaxy of game is closer to one another? 
Jon Krohn: 00:42:41
You mean like an individual round of game play, or- 
Wah Loon Keng: 00:42:44
Yeah. It’s an open world. So maybe like this world has mountain and like a lake or something.
Jon Krohn: 00:42:49
Oh, all right. 
Wah Loon Keng: 00:42:49
And then, yeah, you have a whole spectrum of different ways you can parametrize and configure your environment. So that’s also the approach that they use so the environment is generative, and you basically just let it run and then explore as much as possible and then test on like unseen ones. So that’s pretty amazing. 
Jon Krohn: 00:43:09
So, in this open -ended gameplay kind of scenario, you train an out algorithm in an environment where it learns how to do something in like a forest. 
Wah Loon Keng: 00:43:19
Yeah. 
Jon Krohn: 00:43:20
And then you see how well it can do something in an ocean or something. 
Wah Loon Keng: 00:43:23
Yeah. Yeah. Or like if you learn how to throw something off a river, like across the river to hit something, can you do the same when there’s a little mole hill and then you have to throw it across? So you can see it as a really simple example of testing spatial awareness, for example. So, “Oh, do I go left, right, up, down? And do I know which direction I’m pointing at?” So you start seeing these concepts starting to emerge. 
Jon Krohn: 00:43:53
Ooh. So this ties back to what we were talking about earlier, where I was saying, when you train an algorithm to play Tetris or Backgammon, at least until recently, they didn’t have any sort of understanding that pressing right on the joystick was going to move the block right or move the left, and it has to learn that from scratch. But what you’re saying here is that you’re taking the model weights from one scenario, and then seeing how well they can apply those model weights in a different or related kind of scenario. 
Wah Loon Keng: 00:44:21
Right. Exactly. So it, it takes a lot of frames to do that. Of course, it would be amazing if we can have a robot that can navigate in the real world, like knowing directions or like spatial awareness. But the problem is you train it in the virtual world, but you have to transfer it to the real world, which is not easy because it’s so different from a simplified environment. The other option you can go is basically train it in the real world. But then can you train it in a reasonable amount of time? Do you have millions or billions of frames that are required? 
Jon Krohn: 00:44:56
Right. So you’re getting to a limitation. So a big limitation in deep reinforcement learning today, I’m now putting words in your mouth, is that it takes way too much compute. It’s just too expensive to get these really cool applications. 
Wah Loon Keng: 00:45:08
Right. Exactly. So those two are the problems of generalization and also sample efficiency. It’s not sample efficient today to train a complex RL agent in the real world to do useful things, but that’s why you’re seeing a lot of these delivery robots, or even self driving cars, they’re not RL, because that’s just not the way to go at least at this point. 
Jon Krohn: 00:45:31
Right. What do they use? Do you know? What does a delivery car- 
Wah Loon Keng: 00:45:36
So like Tesla, if you look at a couple of these like taxi you give [inaudible 00:45:43] It’s really a combination of… Perception would be done by just vision networks. 
Jon Krohn: 00:45:51
Conversational neural networks. 
Wah Loon Keng: 00:45:52
Yeah, conversational network, but they have a special way of merging them, so they call it the Hydra Network because you have like so many cameras, how do you include like the different modes? 
Jon Krohn: 00:46:00
Nice. 
Wah Loon Keng: 00:46:01
And how you learn to perceive basically. So those are used to perceive objects in a row and also to construct like, where’s the lane, where’s the curb? Stuff like that that you should avoid. But then really for driving or any of the control, they would stick back to like classic well-known algorithms. Sometimes I guess you don’t even have to have deep learning learned algorithms, because to drive you say, “Oh, there’s the lane. You can just go straight turn left.” So simple algorithms work in those cases. 
Jon Krohn: 00:46:36
Right. So maybe even hard-coded. 
Wah Loon Keng: 00:46:39
Yeah. 
Jon Krohn: 00:46:40
Just that with these particular signal set, so you’re using deep learning to process the information from the sensors on the vehicle, but then once you have some kind of pre-processed information down to some more simple signals, you can just then hard code from there what to do in that scenario. And I love that name by the way, the Hydra. That’s from Greek mythology. I think it’s a beast with many heads. 
Wah Loon Keng: 00:47:04
Mm-hmm (affirmative) Exactly. 
Jon Krohn: 00:47:05
And so many cameras all merged into one kind of perception. And then something I think that was big in the Tesla day 2021, some of the highlights was watching how that system can take footage from multiple different cameras and merge into one concept, and so you could actually… They created a video representation of what that universal concept looks like, and that allows even objects that are occluded to remain persistent. Yeah, which is really cool. Yeah. Anyway, big limitations, sample inefficiency and limited generalization. Do you have any other big limitations today? 
Wah Loon Keng: 00:47:46
Yeah. So we also have reproducibility. That’s more of a research problem, I would say, because of course when you publish something, it has to be reproducible. So if you release code, somebody else implements the same thing from your paper, the performance has to be… You have to match the original. But it’s been really hard because there’s the engineer aspect of it, so if you implement something in TensorFlow versus PyTorch, even if you are really meticulous, but then there are like some background details, like little math that they do differently could lead to different performance. Or even just like tiny little details, if you like shift the certain steps in a network update, that might also destroy your performance. 
Jon Krohn: 00:48:34
Yeah. Even library versions can be a nightmare where like, in my book, Deep Learning Illustrated, and so if you use the Docker image from my book, Deep Learning Illustrated, for the Deep-Q learning network that I implement in that book, it trains extremely quickly. So it learns to play a very simple game, card pool, which is like the hello world of reinforcement learning. And it does every episode of play on my laptop, like it’ll do 10 or 20 episodes per second, so 10 or 20 rounds of gameplay. But in videos that I made later using Google Colab and using more recent versions of, I don’t know if it’s a later version of TensorFlow or Keras, but it’s extremely slow. And I have not been able to figure out what the difference is, and eventually I just gave up and was like, “Well, if you want this to execute a lot faster, go back to this old Docker image that I have, where some, for whatever reason, it’s 100 times faster.” 
Wah Loon Keng: 00:49:42
Yeah. These things happen, especially if you jump like a major version, then they change. So for example, they could change the initiation function of a neural network without you even noticing. So those things lead to different performance. But also just in general, deep reinforcement learning is notoriously hard to produce, partly because there are so many moving pieces that you have to get right. So it’s not just the algorithm, you have the deep neural networks. You also have the algorithm that collects and then recalculate your values and then pass it to the network. And there’s also a pre-processing from the environment. So there are many different things going on at the same time, and to get them all right and exactly matching whatever the author was using, that’s really hard to do and it takes a lot of time. So it’s no wonder that it’s a problem. And then two other big problems that still remain would be reward design is still a very difficult thing to do [crosstalk 00:50:44] 
Jon Krohn: 00:50:44
Right. Especially in the real world. Like you were saying, it’s easy if you’re playing Tetris, then there’s lots of little actions that lead to an increase in point scores. So it’s easy to define the reward function and what we should be optimizing for. But then you already talked about the example of Montezuma’s Revenge, where it’s very rare to have a point score go up. And so then in the real world, you don’t have any kinds of point scores that are innate.
Wah Loon Keng: 00:51:05
Exactly.
Jon Krohn: 00:51:06
How do you have like… Yeah, for driving a car, how do you define a reward function? 
Wah Loon Keng: 00:51:12
Right. Yeah. Hitting a curve is -10. 
Jon Krohn: 00:51:15
Yeah. 
Wah Loon Keng: 00:51:15
Yeah. In game, it’s a matter of whether or not you are able to learn more efficiently, but in the real world, that could also mean catastrophic consequences. So with the wrong reward design, the lighter case would be, “Oh, your robot doesn’t work.” The worst case is your robot causes that real world damage is to hit something or hit a person, that’s even worse. And yeah, like I said, real life has no reward scores that are displaying in front of you. “Oh, if you go this way, you get certain points.” And there’s also a matter of, even if you specify a score, it’s hard to get it to do what you actually want it to do. Right. The scenario would be called reward hacking because in gameplay it would be finding loopholes to get a real high score, but doing totally wrong things. But in real life, that could just mean unsafe AI. So it’s a big discussion in AI safety. So in order to have an AI that you can trust and not kill someone, you have to have a really, really careful design reward signal. 
Jon Krohn: 00:52:30
Right. So yeah, that’s another, and you have one more limitation. 
Wah Loon Keng: 00:52:35
Yeah. It’s more engineering, so it’s just a cost and complexity of deploying such a system, because you need so many frames to train. It’s the cost of collecting data, cost of running the algorithm, and then deploying it in the real world. So in the loss scenarios this becomes unfeasible. So like the Dota bot, and the chess bot, last time I looked them up, they weren’t deployed actively as like a trainer for people, because I guess one of the limitation is the cost of actually doing that at scale. 
Jon Krohn: 00:53:09
Got it. All right. Despite these big limitations, which give a lot of researchers things to do, we still nevertheless see a lot of powerful industry applications today. So robotics you mentioned is one thing, so listeners can go back to episode number 503, where we have Professor Peter Abbeel focus a lot in that episode on deep reinforcement learning applications to robotics. And Peter Abbeel is perhaps the biggest name that has ever existed in real world, deep reinforcement learning application. So that is definitely an episode worth checking out. What other kinds of industry applications or industry trends do you see out there, Keng? 
Wah Loon Keng: 00:53:52
All right so I would start from a textbook example. So if you look at control theory or reinforcement learning, the typical applications are robotics, we have logistics. A really popular example was the optimization of the heating system that DeepMind applied the algorithm to. 
Jon Krohn: 00:54:13
[crosstalk 00:54:13] Google like server warehouses and data centers. 
Wah Loon Keng: 00:54:15
Right. So you have a streaming signal, you have the heat, you have to use minimum cost to achieve what you try to… Like cool down your servers basically. Logistics would be something like a routing problem. So there’s a challenge on AIcrowd that always comes year-after-year for, I think, three years now. It’s basically applying deep RL to solve a routing problem of trains. So how do you get trains to arrive on time, or how do you figure out the most efficient scheduling of trains? And you can in principle apply that to the subway in New York, for example, which really needs optimization. 
Jon Krohn: 00:55:02
The New York subway system has such bad sensors. It’ll never work. Even with human operators, too many things break down. 
Wah Loon Keng: 00:55:09
Yeah, exactly. 
Jon Krohn: 00:55:10
This is actually really funny. I don’t know if I’ll be able to find it for the show notes, because it’s many years ago now that I was reading it. But if you live in New York and take the subway, it’s unbelievable how often you hear you’ll be in between stations and the train just stops. And then the driver will come on and say, “Sorry, were you were being held back because of a signal problem on the tracks.” And so somebody, I can’t remember who they write for, maybe it was the Atlantic, but they went and they tried to find like what are these signals and how do they work? And it turns out some of them are like 100 years old and you can’t buy replacement parts. And so there’s like one shop in New York that tries to like solder back together things. But yeah, anyway so- 
Wah Loon Keng: 00:55:54
Yeah, that’s New York. 
Jon Krohn: 00:55:57
… I digress. Okay, cool. So robotics, logistics, [crosstalk 00:56:01] saving. 
Wah Loon Keng: 00:56:01
Yeah. Inventory management is very typical, and also actually up and coming area for applying that. And you do see startups or are they doing that? The problem is to manage inventory. What does that mean? Imagine you have a store. Your stock could be like grocery items or even like a warehouse, then you would be anything that comes in. You have to figure out, oh, how much do you need to keep in the store? How much do you have to ship out? What’s the rate of in and out. And what’s the cost of storing? Because there’s real space consequence, and even like the climate control in the warehouse, stuff like that. So applying that to really manage it as a whole full system end-to-end, as opposed to just say using plain deep learning to predict, “Oh, how much is going to come in, how much is going to go out?” 
Wah Loon Keng: 00:56:50
But then these startups, if you look at them, you can also clearly see that they’re using a hybrid. Because again, these problems of deep RL are not going to go away like this year probably. There are certain parts where you really need a simpler approach to… For example, like the inventory in and out, you can just use plain deep learning to predict that, or a time series prediction. And then once you have these pieces as simplified signals and can maybe pass it on to a RL algorithm to do the whole thing for optimizing. But yeah, that’s pretty much it. Like I mentioned before, it’s not being deployed as a trainer for chess or Go, or even any of the cyber games yet, but hopefully when a cost comes down, we’re going to see more of that, because it can be useful to certain people. 
Jon Krohn: 00:57:50
Right. Cool. So you have a lot of personal experience with using deep reinforcement learning. So in your book, Foundations of Deep Reinforcement Learning, you talk a lot about a open source framework that you and Laura developed called SLM Lab. It has a great name, which I didn’t know for a long time. Maybe tell us about the name and then also tell us what this framework allows you to do. 
Wah Loon Keng: 00:58:21
Yeah. So a really good friend of mine, Laura and I worked together to just initially learn about deep RL, because when it first came out, it was really interesting to us. And then we just kind of started coding up all these different algorithms and eventually consolidating and refactoring them into a framework. And eventually that framework become SLM Lab. SLM stands for strange loop machine. It’s pulled up from this really great book called GEB or Gödel, Escher and Bach. Yeah, I really loved that book and it was really- 
Jon Krohn: 00:58:56
That’s a frequent recommendation of guests on the podcast. 
Wah Loon Keng: 00:58:59
Yeah. 
Jon Krohn: 00:59:00
Douglas Hofstadter. 
Wah Loon Keng: 00:59:02
Right. And it’s a really delightful book because it’s not mathy, it’s just like words. And anyone and can really read it, but the concepts in there are really interesting. So yeah, we have that. We call it SLM because again, the loopy nature of RL seems fitting at a time. So that framework was really useful in helping us organize the structure or the topography of deep RL, like what algorithms, how they classify, how they relate to each other. And that really helped us understand the landscape, and then what are the pros and cons of certain algorithms. And then eventually that was used for the book that we wrote, Foundations of Deep RL. And that was very fitting application for it because, oh wow, since we have this framework that organized the algorithms in ways that we think was really helpful for understanding. And also it’s pretty straightforward to translate from say what you see on paper and theory. And you explain that in terms of how does the algorithm work? How does the learning function work? And then you get to see that immediately in code. So that introduces less hoops to jump through when moving between code and theory. 
Jon Krohn: 01:00:27
Yeah. I love it. And so whether it’s the actor-critic algorithm, or a DQ learning network, or the REINFORCE algorithm, you can explain in the book how the algorithm works, the math behind it. 
Wah Loon Keng: 01:00:39
Right. 
Jon Krohn: 01:00:39
But then if the reader doesn’t want to implement it themselves- 
Wah Loon Keng: 01:00:42
They can run it. 
Jon Krohn: 01:00:43
… they can go to the SLM Lab and they can get this agent right out of the box that works. And the cool thing about your framework is that you can plug that into lots of different kinds of environments. So you have the agent, it’s going to work, you might need to adjust some parameters for some particular environment, but you can plug into several of the big environment packages that are out there, right? 
Wah Loon Keng: 01:01:05
Yeah. So out of the box, it works with the classic Atari Suite, because we have to. There’s also integration to the Unity ML-Agents. So we’ve had contributors who also came in like, “Hey, I need to apply this to my thesis, for example, and I’m working on Doom.” One person who reached out was working on the classic Doom game, and so he ended up writing the integration for that. And so yeah, because the API defined the interface through the environment, so once anybody that follows the same interface, you can just easily plug it in. So currently we have that for Atari, we have some Doom games, we have Unity, but I think, yeah, that’s pretty much it. 
Jon Krohn: 01:02:03
Cool. And then not long after you started working on the book and you were developing SLM Lab, you also started working, applying deep reinforcement learning as your job. 
Wah Loon Keng: 01:02:16
Right. 
Jon Krohn: 01:02:17
So you became an AI engineer at a company called Machine Zone. So what does it mean to be an AI engineer? What does Machine Zone do? 
Wah Loon Keng: 01:02:26
Right. Good question. At first I got hired to do game RL, so that’s a really straightforward application of reinforcement learning. 
Jon Krohn: 01:02:34
So you’re building like the enemy, the bad guy in a video game or something. 
Wah Loon Keng: 01:02:39
Yeah, it would be a game playing agent that learns, and then that would be… So in classic computer game AI, or so we call them, good old fashion AI, they’re called AI, but they’re typically programmed by hand. So if you play something like Dota for example, for a long time, the AI was programmed by hand, in a file, and you can actually go and modify that. But then this would be something different, so actually applying RL as your opponent that it can play against. So kind of like make it more and more challenging. So yeah, that’s one application of RL that’s really straightforward. 
Wah Loon Keng: 01:03:15
And yeah, again, back to the question of what does it mean to be an AI engineer? That’s a very good question. I think it depends on… Of course, it’s different in different places, and in my particular case, because we have a small team, so I end up doing things end-to-end myself. So in my case, it’s a whole package. So you go from understanding a problem and then figuring out how to solve it. So that means, “Okay, what model, what method do I use? What algorithm do I use if I have the data?” And then to coding up the model. And of course you have the whole like… There’s always a phrase that says, oh, maybe modeling is like 10% of the whole work, but a lot of it is engineering. That is very true. So I do all the engineering as well, so that includes handling the data, pipeline- 
Jon Krohn: 01:04:09
Data engineering. 
Wah Loon Keng: 01:04:10
Right. Data engineering stuff, and also deployment stuff. 
Jon Krohn: 01:04:14
ML engineering, 
Wah Loon Keng: 01:04:14
Right. ML engineering in general. So yeah, a lot of time is spent on engineering. So really the modeling is the easy part. 
Jon Krohn: 01:04:26
Yeah. 
Wah Loon Keng: 01:04:26
But the engineering is the real work that I have to do. 
Jon Krohn: 01:04:29
Yeah. Which is why if you’re listening and you think you want to get an… If you aren’t already involved as a practicing data scientist, that is a really big key point, which I’m sure we’ve talked about on the show before, is that the big bottleneck isn’t in model training, it’s in deployments or data engineering, getting the data in. So if you want to get into the space, being a great engineer, a software engineer, you’re going to be even more valuable. It’s great to be a data scientist, and there are lots of opportunities for data scientists out there, but there are an order of magnitude more opportunities, for people who can engineer data pipelines or deploy machine learning models into production systems. 
Wah Loon Keng: 01:05:11
Exactly. Yeah, because the combination is pretty, I would say, mentally taxing, because you have a lot of things to keep on top of your mind and keep up to date. So yeah. 
Jon Krohn: 01:05:23
What kinds of tools and approaches do you use day-to-day to keep up to date? 
Wah Loon Keng: 01:05:29
Oh, to keep up to date? 
Jon Krohn: 01:05:31
No, not to keep up to date. I just mean what are you needing to keep up to date on what are the tools that you’re using day-to-day that- 
Wah Loon Keng: 01:05:36
Right. If I were to list them out on my resume, it would be really simple. Just three items. Python. I happen to use PyTorch and Kubernetes for deployment, because I have to manage deployment. But yeah, in terms of like data pipeline and stuff like that, that can be anything. So even in the same company you could have like a few different pipelines that come to you, but really it’s just like figuring how to plug into them in Python. So just the big three. 
Jon Krohn: 01:06:06
Right. So Python is the programming language of choice for you and most data scientists. PyTorch is the particular automatic differentiation library that you go to, which I totally understand. It is a lot more fun than the big alternative, which is TensorFlow. 
Wah Loon Keng: 01:06:22
Yeah. 
Jon Krohn: 01:06:23
And I’ll try to remember to put in the show notes, a link to a 40-minute talk I gave on TensorFlow versus PyTorch and why you might choose one or the other. And actually the spoiler alert is that you might as well just learn both. They’re Pretty similar actually, and they become more and more similar all the time. And then Kubernetes you’re using for managing your deployment. I already talked about Docker images earlier with respect to library versions. Docker allows you to have a very specific library version and that way you can have all of your library versions that you know play nicely together or are working together nicely, and be separate from whatever else you have going on on a given machine. And so that’s Docker, and Kubernetes allows you to take those Docker containers and deploy them into production systems and have them scale up depending on need and that kind of thing. 
Wah Loon Keng: 01:07:14
Right. And also when you’re deploying a system, an ML system in general industrial setting, a lot of times you also have to play with the other pieces. Okay, let’s say if you do something as simple as serving a classifier, then you have to think of, oh, how does the image come in? And then how does it go out? And then you have to consider about little things that usually in research you wouldn’t think about. For example, like the performance, the latency would be very important. So all of that also as an art, that’s where engineering comes really, really important in terms of allowing for, for an ML product to be useful in the real world. 
Jon Krohn: 01:07:56
Nice. Super cool. All right. So very cool to hear about how deep reinforcement learning is being used in industry today. What do you think is the future of reinforcement learning in industry? 
Wah Loon Keng: 01:08:10
I think we still have a long way to go for sure. There are problems that have to be solved. Of course it would be nice to start seeing like deep RL in our daily life, like delivery robots for example, or more agile, more useful robots. Imagine something that does your laundry. That would be a really big time-saver for people. But yeah, there’s still a long way to go. I think at least the in-trend would be we have to solve the sample efficiency problem before we can leap over to the real world, and also generalization because the real world is so much more complex. Like you might spill a cup of coffee and then your robot just doesn’t know what to do. 
Wah Loon Keng: 01:08:58
And I think just in general the distillation of the technique as well. So you’re seeing the open-ended play for example on a Dota Bot, and the DeepMind Alpha, the alphas or the Mus. 
Jon Krohn: 01:09:13
Yeah. 
Wah Loon Keng: 01:09:14
Yeah. So those are not as still in a sense that like engineers like you and I during our free time, we can’t just, okay, go and get help clone something and start playing with them. Classic algorithms like DQN, they have gone to that point, but these more advanced and also more expensive algorithms are still not there yet for clear reasons. Also they are incredibly complex, a lot of pieces, and also very expensive to run. So we’re just getting to the point where GPT again, it’s not for everyone to run on their laptop because of the compute costs required. But I think these are really important problems to solve off for RL to become useful in the industry. 
Jon Krohn: 01:10:03
Cool. Awesome. Does that also cover… I know you had some points on what’s really cool in research today, but yeah, are there other big things going on in research that could bridge some of these gaps that you’re describing, sample and efficiency, generalization? 
Wah Loon Keng: 01:10:22
Right. We did talk about the idea of computation history and how does that really tie to a sample efficiency? So either a better learning algorithm. We haven’t had a new RL algorithm in quite a while. Back in 2017, you see like they pop up every month and you have a new algorithm. So we don’t know if we’re already like hitting the saturation level for that. But then DeepMind work on open ended play is really interesting because that also ties into this idea of emergence, like emergent behavior. And that’s how like Eric Jang said in his post, why you should ask for a generalization and not the specific task of let’s say, “Oh, do you think X and X?” And then you just train for that. Because well we, humans are the best example that we have reference to, and that’s how we sort of learn. So you don’t specifically learn something, you play as a child and understand the world without knowing that you were learning about these concepts. 
Jon Krohn: 01:11:33
Yeah. You don’t know whether you’re going to become a fireman or an AI engineer. 
Wah Loon Keng: 01:11:37
Yep, exactly. So open-ended play, I think that trend is going to continue. More complex environments is going to be tested on, because if you look at the Blockpost, there aren’t just block worlds, but they are complex lock worlds. So there might be things like… For example for an AI to learn about texture, like whether or not something is hard, do I grab it with a lot of force, or a little force? So that is very hard to do in a simulator. 
Jon Krohn: 01:12:09
Right. Oh man, an egg would be so tricky. 
Wah Loon Keng: 01:12:11
Exactly. 
Jon Krohn: 01:12:11
An egg looks so firm. 
Wah Loon Keng: 01:12:12
Yep. But if you crush it. Yeah. Or something like fur or hair. 
Jon Krohn: 01:12:20
A puppy. 
Wah Loon Keng: 01:12:21
Yeah. A puppy. That’s the sort of thing. If you are a supervised learning algorithm, if you look at a picture of a puppy, you would think, oh, the fur is hard, right, because it doesn’t move. You can’t tell. We know because we have experience with them. So yeah, open-ended play more emergent behaviors, we’re going to see more of that for sure. And I think there is, like you said, with going towards like general intelligence, basically like learning from human as an example, we’re starting to see there’s this really old work from Gibson. He wrote this book that basically talks about how we perceive in our environment and how we basically learn from the environment. So we’re starting to see researchers in a field applying these ideas from the ’70s, but they’re really valid ideas and very interesting to try out and we’re starting to see some of them as well. So yeah, I can’t wait to see what they come up with. 
Jon Krohn: 01:13:33
Nice. Before we get to a book recommendation to kind of wrap things up here, they’re is one thing that I want to ask you about. You get a huge amount done in your life. You’re an AI engineer, which is a highly challenging career to begin with. But then on top of that, you’re doing a lot of open source development, you’ve written an amazing book. And so do you have any productivity tips? So for example, I’ve noticed that you seem to always be wearing the same outfit. 
Wah Loon Keng: 01:14:01
Yeah. All black is the way to go. Well, that’s actually a productivity tip because if you are wearing always the same thing, you don’t have to waste time to think and choose. 
Jon Krohn: 01:14:11
And it goes with everything. 
Wah Loon Keng: 01:14:12
Yeah, exactly. 
Jon Krohn: 01:14:13
So, when you open your wardrobe, is it a lot of the same black T-shirt or is it different like from different brands? 
Wah Loon Keng: 01:14:22
No, it’s the same. 
Jon Krohn: 01:14:23
It’s all the same black T-shirts? 
Wah Loon Keng: 01:14:24
It’s almost like factory issue. It’s the same [crosstalk 01:14:27]. 
Jon Krohn: 01:14:28
It’s like your cartoon character. 
Wah Loon Keng: 01:14:32
I do change daily, so don’t worry. 
Jon Krohn: 01:14:35
Sure. That’s awesome. Well, I really admire that. I wish I did it. Even if it’s not a huge amount of your day, it is definitely some of your day goes to your wardrobe decisions and why not eliminate that? I think it’s a really good idea. All right. So then what your book recommendation, Keng? 
Wah Loon Keng: 01:14:52
So my book recommendation would be Gibson’s book. So it’s called The Ecological Approach To Visual Perception. It’s a pretty old book. It was published in 1979. So actually, do you know Don Norman? 
Jon Krohn: 01:15:07
I don’t know Don Norman. 
Wah Loon Keng: 01:15:09
The Design of Everyday Things. He talks about human-centered design. Don Norman knew Gibson and they’re actually working at the same place. So he took his ideas and applied to industrial design. So that’s why I have this like human-centered design. So the ecological approach to visual perception is really about how we perceive. So in the course of his professional life, Gibson basically was hired by the military to solve this problem of visual perception from the air. Because if you’re flying a reconnaissance plane, how do you tell what’s what, or how far are you from the ground? Stuff like that. And then there are mathematical ideas about, “Oh, do the triangulation and figure out…” But then he realized when something is so far away it doesn’t matter, because we can’t resolve that finding as well.
Jon Krohn: 01:16:06
Right. 
Wah Loon Keng: 01:16:08
So a lot of the ideas, well, without spoilers, they are actually this concept of being in a world. You have a creature, so it’s not exclusively about humans. It also talks about animals and how we move around our own environment and how we had to learn to perceive things as like quasi categories. So it’s a really fascinating reflection also as a person and also as someone who works in a AI field, because you get a lot of like inspiration from that as well, but also just to understand yourself better. 
Jon Krohn: 01:16:48
Right. So like how our mind works to go from specific instances to more general cases. 
Wah Loon Keng: 01:16:54
That too. But mostly it’s really down to earth examples of, “Oh, how do I know a cliff is a cliff and I’m going to die if I step off a cliff?” 
Jon Krohn: 01:17:01
Right. Cool. 
Wah Loon Keng: 01:17:02
And yeah, how does an animal learn that? 
Jon Krohn: 01:17:07
Nice. That’s a very cool recommendation, Keng. All right, so clearly you have an incredible depth of knowledge on reinforcement learning, particularly industry applications. So how can interested listeners get more from you? Obviously they have your book, they have your open source library, but is there some way to stay in touch with you? 
Wah Loon Keng: 01:17:28
Well, I would say email, I’m pretty old fashioned. I don’t go on Twitter or any of the social media, but at email, you can find my email at GitHub. 
Jon Krohn: 01:17:36
Nice. All right. So we’ll be sure to include your email in the show notes. And if people want to work with you, you have some openings, right? 
Wah Loon Keng: 01:17:44
Oh yes. We’re hiring. I work at AppLovin now. Machine Zone is a studio under AppLovin, so I’m in this particular studio, and we’re hiring for ML engineers. So both Machine Zone and the umbrella, I mean the AppLovin. 
Jon Krohn: 01:18:02
Yeah, the acquiring company. So Machine Zone, which you started working at in the Bay Area years ago, they were acquired by AppLovin, and they’re both, the bigger company AppLovin, and Machine Zone specifically the studio that you work at, they’re both hiring machine learning engineers. 
Wah Loon Keng: 01:18:19
Right. So yeah, AppLovin mostly, it’s a leading marketing platform for apps and of course, we’re hiring in general for all roles, including ML engineers. But also specifically my team at Machine Zone is hiring a monetization engineer and ML engineer. 
Jon Krohn: 01:18:37
Cool. All right. So there you go. You could be working right beside him all day, remotely for the most part probably. All right Keng, thank you so much for being on the show. Thank you for coming in person to film with me. 
Wah Loon Keng: 01:18:50
It’s a pleasure. 
Jon Krohn: 01:18:51
It’s so much fun doing it this way, and yeah, looking forward to maybe having you on the show again in the future. 
Wah Loon Keng: 01:18:57
Yeah, totally. It’s good to be back. Good to see you again. 
Jon Krohn: 01:19:00
Nice. All right. Thank you, Keng. 
Wah Loon Keng: 01:19:02
Thanks. 
Jon Krohn: 01:19:08
How fortunate we were to have a deep, deep reinforcement learning expert like Keng on the show today. In today’s episode, he filled us in on how reinforcement learning is vastly different from other machine learning approaches like supervised and unsupervised learning, the timeline of reinforcement learning breakthroughs from the ’80s and ’90s, like actor-critic algorithms and TD-Gammon, through to the game-changing emergence of deep reinforcement learning in the past decade with approaches like Deep Q-learning, AlphaGo and MuZero. He talked about the limitations of deep RL today, such as sample inefficiency and limited generalization. He talked about how current and future deep reinforcement learning research may overcome these limitations, display further emergent behavior in open-ended play, and become widely applicable across autonomous vehicles and everyday robotics tasks. And he told us about his open source SLM framework that enables you to use state-of-the-art deep reinforcement learning agents out of the box and Python across a range of environments from Atari games to Doom. 
Jon Krohn: 01:20:08
As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Keng’s LinkedIn profile, as well as my own social media profiles at www.superdatascience.com/551. That’s www.superdatascience.com/551. If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel. I also encourage you to let me know your thoughts on this episode directly by adding me on LinkedIn or Twitter, and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. 
Jon Krohn: 01:20:41
All right, thank you to Ivana, Mario, Jaime, JP and Kirill on the SuperDataScience team for managing and producing deeply educational episode for us today. Keep on rocking it out there folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts