Podcasts SDS 275: Machine Learning Through Reinforcement & Contextual Bandits

61 minutes
Data Science, Machine Learning

SDS 275: Machine Learning Through Reinforcement & Contextual Bandits

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Today we have a very complex and informative talk with John Langford about the nuances and different methods of machine learning, including supervised, reinforcement learning, complex bandits, and more.

About John Langford
John Langford is a machine learning research scientist, a field that he says “is shifting from an academic discipline to an industrial tool”. He is the author of the weblog hunch.net and the principal developer of Vowpal Wabbit. John works at Microsoft Research New York, of which he was one of the founding members, and was previously affiliated with Yahoo! Research, Toyota Technological Institute at Chicago, and IBM’s Watson Research Center. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor’s degree in 1997, and he received his Ph.D. in Computer Science from Carnegie Mellon University in the year of 2002. He was the program co-chair for the 2012 International Conference on Machine Learning.

Overview

John Langford works heavily in reinforcement learning. He finds it interesting that it works all the way down to the worm level. He says of unsupervised learning that we lack of a strong understanding, at a theoretical level, of where it’s useful. It’s helpful when you want to build a profile of something, such as summarizing a topic in a document across a set of topics in a set of documents. It’s useful for clustering and figuring out the relevant cluster. One example is image clustering, perhaps wanting to cluster dogs from cats but because of how clustering works you end up with dark and light pictures. Supervised learning and reinforcement learning allows you to mechanistically solve these problems.

There’s a process to take unsupervised learning and, through clustering that becomes classified, it can eventually become supervised. John thinks there’s benefits to this as a first pass. But, you’re bottlenecking the information. You’d prefer to disclose more information rather than segmentation. It’s useful to augment information, but at the same time giving the algorithms access to the pre-clustering information can give it a better performance.

One great example John gives for reinforcement learning’s superiority is personalized news. While browser history might not show that John has an interest in the Ukraine, intuitively he would be interested in an article about the Ukraine because his wife is from there. Editors and the algorithms don’t know what they don’t know. The models John works with change constantly to match with the steady change of information (such as news, in the personalized news example). These contextual bandits can be used for content recommendation, for ads, for webpage layout, for bots. It’s about personalization, different people will respond differently (for example would telling someone to walk 10,000 more steps be more effective than telling them their neighbor walked 10,000 more steps than them?). Reinforcement learning and contextual bandits help with conversions on ambiguous decisions.

There’s a split in the world between research and application, while using reinforcement learning. In research, you have the ability to reset and try an alternative, which is dramatic powerful. Then there’s the “YOLO” factor but reinforcement learning has been used to solve Go, so people want to utilize it, but within a simulator there’s more powerful algorithms that could be used. Reinforcement learning makes more sense online, where you cannot reset, while research has low consequence and the ability to try different, more effective things.

John applies this work through consulting and his work at Microsoft. While he enjoys consulting and the gleams into different projects he works on, he feels platform work is what truly propels someone’s work. In the future John wants to apply this to solving contextual decision processes. He believes in a dramatic change in those capabilities in the near future through deep reinforcement learning. In the real world you need to learn rapidly, so you don’t have time to tune deep neural networks. This can assist in that, there’s a space for this in deep neural learning in the plethora of algorithms.

In this episode you will learn:

Reinforcement learning vs unsupervised learning [9:30]
Contextual bandits vs. normal bandits [27:00]
Uses of reinforcement learning [30:00]
How this helps John’s work at Microsoft [44:49]
Vowpal Wabbit [48:47]
John’s future research [54:55]
John’s recommendations for data scientists [58:00]

Items mentioned in this podcast:

Follow John

Episode Transcript

Download The Transcript

Podcast Transcript

Kirill Eremenko: This is episode number 275 with Machine Learning Research Scientist, John Langford.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur and each week we bring you inspiring people and ideas to help you build your successful career in Data Science. Thanks for being here today and now let’s make the complex simple.

Hadelin De Ponteves: This podcast is brought to you by Bluelife AI. Bluelife AI is a company that empowers businesses to make massive profit by leveraging artificial intelligence at no upfront cost.

Kirill Eremenko: That’s correct. You heard it right. We are so sure about artificial intelligence that we will create a customized AI solution for you and you won’t need to pay unless it actually adds massive value to your business.

Hadelin De Ponteves: So if you are interested to try out artificial intelligence in your business go to www.bluelife.ai, fill in the form and we’ll get back to you as quick as possible.

Kirill Eremenko: So once again, that’s www.bluelife.ai and Hadelin and I both look forward to working together with you.

Kirill Eremenko: Welcome back to the SuperDataScience podcast ladies and gentlemen, super excited to have you back here on the show because I literally just got off the phone with John Langford, who is our guest for today’s episode. So what you need to know about John is that he’s a very influential research scientist in the space of machine learning. He’s got dozens of published papers, which you can find on archive. He constantly gives talks at different conferences. For instance, just before this conversation, he had returned from the ICML, International Conference for Machine Learning where he gave not one, not two, but three top talks during the course of this event, of this seven day event. He also contributes a lot of open source code to the Online Machine Learning Community. He’s got a wonderful tool that he’s been developing for many years which is called Vowpal Wabbit and we’ll talk about that during the podcast as well. And he also works for Microsoft.

Kirill Eremenko: So as you can imagine on this podcast, John and I could delve into quite some deep topics and that’s exactly what happened. We dove deep into things like unsupervised, supervised learning, reinforcement learning, the differences between the three, the advantages, disadvantages, contextual bandits. You will learn so much about contextual bandits in this podcast. In fact, I felt like a student during this podcast, I was learning from John, absorbing all this knowledge, and I found extremely interesting. In addition to that, we talked about applications of contextual bandits and reinforcement learning in general, YOLO style algorithms versus simulator algorithms, technics for avoiding local optimums, balance between exploration exploitation, learning to search, active learning, one shot learning, deep reinforcement learning and many, many more topics.

Kirill Eremenko: So a very interesting a podcast, especially if you’ve been waiting for one with somebody who’s on the forefront of research, someone who can give you the most freshest and most revolutionary updates in the space. And that is John, and this is your podcast to listen to. So without further ado, I bring to you Machine Learning Research Scientist John Langford.

Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies and gentlemen, super excited to have you here on the show and today’s guest is John Langford calling in from New York. John, how are you going today?

John Langford: I’m all right.

Kirill Eremenko: Fantastic. Well, you just came back from LA, how was your trip over there?

John Langford: I was visiting ICML, International Conference for Machine Learning. It was fun. The conference is now seven days long, which is quite large, but there is a lot of good things that going on.

Kirill Eremenko: Did you give a talk or were you just attending listening to others?

John Langford: I gave several talks.

Kirill Eremenko: Wow. What were the talks about?

John Langford: Well, the first one was on the first day, which was Sunday, the first Sunday that I was there. This was part of the Industry Expo Day where we were talking about this new, Personalize Your Service and the technology behind it in terms of contextual advantage. And then I gave another talk in the Real World Reinforcement Learning Workshop, on Saturday morning. That was about, I guess, how you need to shift your priorities in order to do research, which is useful for real world applications.

Kirill Eremenko: What does that mean, shift your priorities in terms of what?

John Langford: Yeah, many people in reinforcement learning work in simulated environments and they would like to do things in the real world, but they sort of, they don’t have experience with that. So it was discussing how you need to change your priorities a little bit when you’re working in real world environments.

Kirill Eremenko: Okay, so less simulations, more real world data?

John Langford: Yes. Not just real world data, but real world data sources to that extent.

Kirill Eremenko: Like for instance, like web traffic and conversions on a website, something like that or less [crosstalk 00:07:37]?

John Langford: Absolutely. Yeah, maybe a very common example.

Kirill Eremenko: Okay. Okay. Got you. So two talks was it, that’s already a lot in seven days, or did you give another one?

John Langford: Yes, Saturday night I gave a talk at the LA Millimeter.

Kirill Eremenko: What was that all about that?

John Langford: That one was also about applying or doing real world reinforcement learning.

Kirill Eremenko: Okay. Got you. Okay, cool. That’s like a very busy week and I really appreciate you coming on the podcast, especially after such a big weekend, also the flights from LA to New York. So is that like your core passion? A lot of the things that I’ve read about you and your work, which is fantastic. Oh, by the way, I forgot to say, congratulations on the award that you won just recently, like you shared with me a few hours ago, how was that? How did you feel about that?

John Langford: Thank you. I mean, it seems great. This is in service that we’ve been working on, so I believe there’ll be a big transformation in how machine learning is actually deployed in the world as it becomes used more and more often. So it’s a question of the scope of application of learning paradigms, supervised learning is a very easy paradigm, is in sense of statistically safe. But a reinforcement learning I think will become a more dominant paradigm in the future. This is the first service enabling that which hopefully it gets a lot of traction.

Kirill Eremenko: As I was saying, as I understand from your biography and all, or from the information available about you online or your work, reinforcement learning is, it’s kind of like your core passion. Is that all right?

John Langford: Yes, that’s right. I still do a lot of other things of one sort or another. So for example, to NIPS, we submitted an active learning paper and a neural architecture search paper. But I feel like reinforcement learning is where the really revolutionary stuff is happening right now.

Kirill Eremenko: Why is that?

John Langford: It’s because reinforcement learning is a very fundamental natural paradigm for learning. If you think about where supervised learning gets applied in nature, well it gets applied with people to a fair extent in perhaps who are sort of social animals where they demonstrate to each other how to do things. But if you think about reinforcement learning, well reinforced learning gets to fly to all the way down to like the worm level where worms learn to avoid and seek various things based upon reinforcement signals.

Kirill Eremenko: Okay. Interesting. So what I’d like to ask is that you’re identifying two main areas, supervised learning and reinforcement learning. How about unsupervised learning such as, I don’t know, self organizing maps, Boltzmann machines, alternate coders. They don’t fall under either of those categories. Is that… or am I getting something wrong?

John Langford: You’re right, it doesn’t fall into those categories. In terms of… I guess unsupervised learning is something where we lack a strong understanding at a theoretical level of where it’s useful. So this is something I tend to overlook it, there are times of course that unsupervised learning has been very helpful to me.

Kirill Eremenko: Like when?

John Langford: When you want to build a profile of something. I’m thinking of the LDA, Latent Dirichlet Allocation where you want to summarize the topic in a document across a set of topics and instead of documents. If you don’t know what the topics are in advanced and yet nevertheless you can explain the topics in the document using LDA.

Kirill Eremenko: Okay. Got you.

John Langford: There can be a very useful summary of these documents.

Kirill Eremenko: All right. So basically, unsupervised learning doesn’t have that evidence range of applications as the supervised or even more so reinforcement learning?

John Langford: So unsupervised learning is in some sense extremely natural because you don’t require any kind of label or rewards source but the problem is that it’s sort of undirected by default, which means that it’s hard to understand how it will benefit you other than just trying it.

Kirill Eremenko: Got you.

John Langford: Particularly in a high dimensional space, think about clustering. It’s easy to cluster things and there may be multiple possible clusterings that are available and maybe only one of them is actually relevant to how you want to apply it.

Kirill Eremenko: Then you have to iterate the two in order to find that one and actually apply some sort of domain knowledge in order to identify what that one is?

John Langford: Yeah. I guess I’m thinking about something like clustering in image space, right? So you have a megapixel camera, it could be that you would like to naturally cluster the dogs separate from the cats, but instead, just because of the way clustering works, you end up clustering the darker pictures from the lighter pictures. Right? And so guiding in unsupervised learning algorithm to come to a good solution is a relatively Ad Hoc process. While for supervised learning and reinforcement learning or at least some forms of reinforcement learning, it seems like we can just mechanistically solve these things, or near mechanistically we solve them. We have a paradigm for how to think about how to solve these things, which is much more complete.

Kirill Eremenko: Got you. At the same time from… for me, I think like unsupervised learning borders closer on the line of creative thinking in terms of machines. For instance, if you take… like I completely agree with your example of the images of dogs and cats, indeed that could be an outcome probably would be the outcome. But if we take a different example, for instance, where we have customers in a company, if we have like a 100 different fields for each customer explaining from their social demographic status to their recent transactions, to their preferences, to basically every information that is available in these customers. And if we… clustering in unsupervised algorithm in this case could be very useful in terms of identifying groups of customers that we, even through domain knowledge, we cannot identify in such a multi-dimensional space.

Kirill Eremenko: Whereas if we applied clustering, we might come up with a few very interesting insights in our data and then use that clustering to further, to then in the future, turn it into classification and then it becomes supervised. What are your thoughts on that?

John Langford: Yeah, so I think this is a common approach to things. I think it can often be a good sane first pass approach to things. I guess my experience is that it’s not what you end up doing in the longterm. So let me give you an example of this. The very first applied contextual benefits paper was back when I was at Yahoo Research. We were doing this for personalized news, right? And the way it worked was we clustered users into five different segments and then we basically personalized to the individual segment. So this works, it did give some performance improvement, but at the same time you sort of bottled necked the information available to the contextual bandit algorithm to the segmentation.

John Langford: So you would prefer to expose more information to the contextual bandit algorithm rather than just the segmentation from the news by the clustering. So maybe that segmentation is a useful feature that you would want to augment the information, the profile to users you’re suggesting with, but at the same time giving the algorithms access to the RAR information, the pre-unsupervised learning information can result in a higher sealing in any performance.

Kirill Eremenko: Okay. Very, very interesting. Thank you for breaking that down. So moving back to reinforcement learning versus supervised learning, you really believe that the future is behind reinforcement learning. Is it mostly because it is, like you said, we can more mechanically implement it with the whole reward systems and we don’t need those massive labeled data sets beforehand or is there some other massive advantage that I’m missing?

John Langford: So not needing a labeled data is obviously a huge win because you eliminate a large amount of difficulty from applying a learning approach. I think there’s also just a pure scope of the application. There many situations where you just can’t label data effectively, you have to learn from rewards. So I’m thinking about… Let’s think about personalized news again because I was talking about that before. So I’ve seen people try to do supervised learning for personalized news and editors just don’t really know what people are that interested in. Right?

John Langford: They have a good general sense of what might be interesting but a specific sense of, this user is interested in this article is just lacking because all they know about a user is some previous browsing history at most and this is not a very strong indicator for an individual editor. An example when I give my talks is, would an article about Ukraine be interesting to me? Kind of hard to say. I haven’t actually looked at a news story about Ukraine in a while and yet turns out that would be interested because my wife is from Ukraine. Right. So this inability to have the information necessary to answer the problem is endemic to a lot of personalization scenarios. And then you just need an algorithm which directly learns from the interaction rather than learning from editorial labels.

Kirill Eremenko: Okay. So how would reinforcement learning solve this challenge with not knowing that your wife is from Ukraine?

John Langford: Yeah, so in reinforcement learning, you would have an algorithm that explores other alternatives to some extent. And then based upon the outcome of different choices, you get some sort of reinforcement feedback and then based on that reinforcement feedback over the explored alternatives, you can learn to choose the alternative, which is best.

Kirill Eremenko: So there’ll be some iterative process?

John Langford: Not just iterative, I mean for the news application, you’d do it in near real time. So MSN is another project where we were playing with this and we dropped a new model every five or 10 minutes. Let me give a little more detail here. You’re trying to do reinforcement learning. We’re going to limit ourselves to essentially only trying to optimize the immediate reward, which is a special case of reinforcement learning called contextual vantage. So you’re going to have an online learning system, which has a constant stream of events coming into it and you’re going to checkpoint the model. So step back, you’re doing online learning, it means that you’re looking at an example, you’re updating your model and then you’re dropping the example.

Kirill Eremenko: So you basically don’t have the luxury of running your model through a training dataset, then a test data set, finalizing your model, then deploying it. The model has to learn as you go?

John Langford: Yeah, so there’s two reasons here. One reason is probably computational. There’s actually quite a lot of data associated with some of these data feeds. Another reason is, the problem is very non-stationary, so the set of news, articles, as other interests is changing on an hourly basis. And then… yeah. We’re looking at every individual example and then we’re just check pointing the model every five or 10 minutes and then using that model to make the decisions for the next five or 10 minutes.

Kirill Eremenko: Okay, I see. Got you.

John Langford: All right?

Kirill Eremenko: Yeah, yeah, makes sense.

John Langford: So that’s super effective, in my experience. I know of dozens of applications of this sort, which are of significant value.

Kirill Eremenko: Okay, and they’re going to be applied in many different areas of business. Like one example we just gave was the optimization or for instance conversion rates for Ads on websites or YouTube and so on. But that’s just one example, there’s plenty of applications, are there any very… if applications in business or industry that stand out to you that you could name like classic examples that you’re using in your talks, for instance?

John Langford: So I’ve seen examples of applying contextual bandits in Ads as you suggested, in content recommendation of various sorts and webpage layout, in bots where you use contextual bandits to give in the context of the conversation help choose the next element of the conversation. I’ve seen it used in, wellness is a very big example as well. So in statistics, there’s quite a bit of literature on nudging people into healthier habits. So do you tell somebody, you should really walk another 10,000 steps or do you say, “Hey, your neighbor walked 10,000 more steps.”

John Langford: There’s two different ways to the get message across, for individual people they respond in different ways. So you want to personalize it. So that’s the scope of personalization applications that I’ve seen. But there’s also… it’s a very natural paradigm so there are other kinds of applications where you’re trying to optimize systems, for example.

Kirill Eremenko: Yeah, for the systems and processes within the business, like operations or flows within a factory and things like that.

John Langford: I haven’t seen flows within the factory. I would just suspect that the flows within a factory, you really need to go beyond contextual bandits because there’s a lot of state in a factory. But I have seen examples of using contextual bandits for… I think often in an application you have some parameter you need to tune, right? And you don’t know how to tune this but nevertheless you can sort of determine after the fact whether or not you had a good tuning and then exert to apply these techniques in order to help you do that tuning.

Kirill Eremenko: Nice. So basically you could use reinforcement learning and contextual bandits to tune hyper parameters of other artificial intelligence models that you’re building?

John Langford: You can, I haven’t seen it be useful in that particular application, but I have seen it be useful to… think about applications where you’re trying to make an ambiguous decision. What’s a good example?

Kirill Eremenko: Like, let’s say you have a warehouse and you build a simulator for the warehouse and you have the floor space, how many trucks are coming in, how many employees are working at that warehouse in order to sort all the inventory that’s coming in, something like that.

John Langford: I wouldn’t want to apply a contextual bandit there because this is again a stateful question. I have seen contextual bandits applied to things like how long do you wait for a… so when in the cloud, when you have a virtual machine that becomes non-responsive, you could immediately restart a new virtual machine to take its place or you could wait for it to become responsive. How long do you wait? Right.

John Langford: So it’s a decision where after the fact you could say what the right decision would have been if you waited long enough, but you don’t necessarily want to wait long enough in order to be able to determine what the right decision would have been in retrospect. Instead if it’s never coming back, you want to just immediately restart. If it is going to come back after a minute or two, if that’s typical behavior, then you would like to just wait a minute or two because it’s a lower overhead than restarting with VM.

Kirill Eremenko: Okay, got you. So in order to understand contextual bandits a bit better, could you give us maybe an overview, what’s the difference between a contextual bandit and a normal bandit?

John Langford: So a normal bandit… first of all, let’s start with the nomenclature. Where does bandit come from? It comes from a theoretical model where you’re going to Las Vegas and you have these machines they’re called bandits, slot machines, where you put money in, you pull the arm and you see what kind of performance you get out or what kind of reward you get. Right. And then a multi-arm bandit is one where you have multiple possible choices that you could make. And then contextual bandit is essentially a multi-armed bandit where you have context features available to help you make that decision.

John Langford: So in terms of applications, I think there are some applications where you can benefit from just normal bandits, but typically you benefit much more from having a context and in general just seems wasteful to not use context when it’s available.

Kirill Eremenko: Makes sense. So, for instance, somebody lands on your website, like we were not even talking about AB testing and more, instead of like AB testing six different Ads, you can apply a normal bandit, just through lots and lots of people going through the website, lots of results, you can see which Ad works the best out of the six. Or you can apply a contextual bandit and actually use some information you might have about this user, for instance, the webpage it came from or, I don’t know, the time zone they’re in or the browser or mobile versus desktop, that type of context. And that can help inform your choice of ad.

John Langford: Exactly, yes.

Kirill Eremenko: Okay, got you. Interesting, interesting. So I’m curious though, with that example I gave with the warehouse, why wouldn’t you apply contextual bandit there? What kind of other reinforcement learning would you rather apply in that situation?

John Langford: Yeah, so this is… read the one I pet peeves actually, which I want to explain in a little bit deep in detail. There’s a weird split in the world between researchers and reinforcement learning and people who want to use reinforcement learning to sell their applications. So for a researcher doing reinforcement learning, their goal is to use a YOLO paradigm, You Only Live Once. If you have an agent working in a world, it’s observing things in the world, it’s getting feedback now and then about the success or failure and the goal is to maximize the performance of the agent. So that’s an admirable research goal. I think that’s a description of an AI agent in some sense, at the same time, when people are studying a reinforcement learning, they’re often doing this with simulators these days.

John Langford: So when you have a simulator, you have the ability to reset and train an alternative. That ability to reset and train alternative is actually dramatically powerful. So in order to understand why you need to think a little bit about opportunity costs, so if you go left at some point and get some reward, in a YOLO setting or You Only Live Once, you have to estimate the value that you would have gotten if you had gone right. And then trade that off against what you observed when you went left. But in a simulator you could reset and you could go right and then you could directly observe what you get if you go right. And then you can directly know what would have been better going right or left and you can use that to drive a learning process at the decision point.

John Langford: So you see this funny thing where people are really excited about reinforcement learning because it’s been used to solve goal for example. Right? So they really want to apply reinforcement learning algorithms and yet they’re in a simulator where they could apply much more powerful algorithms. So the YOLO style algorithms are things like DQN and A3C and policy gradient, these kinds of things. The algorithms that really uses the simulator or actually the solution to go the Alpha star or this learning dessert stuff that I worked on with Hal Dawn.

Kirill Eremenko: Okay, I see. So if you are in a simulator… so basically reinforcement learning is much more powerful or useful if you’re doing things online where you don’t have the luxury of resetting. If you’re in a simulator, there’s probably better ways, like in the example I gave with the warehouse, there’s probably better ways than reinforcement learning of solving that problem, finding the optimal solution?

John Langford: There’s a little bit of a naming issue here because the definition of reinforcement learning has kind of expanded over time. So researchers are very interested in the YOLO style reinforcement learning, but mini-applications that people are nowadays calling reinforcement learning are more assumed or are things like AlphaGo, where you have a simulator and that enables you to do things that are well beyond the YOLO paradigm and very powerful in terms of how rapidly you can learn.

Kirill Eremenko: Very interesting. Wow this is some very deep considerations about reinforcement learning. Didn’t even think of it that way. What about hitting a local extramum in the reinforcement learning? Like getting stuck at a local maximum of the solution that seems optimal but actually isn’t in the global scheme of things. How do reinforcement learning algorithms sidestep that problem?

John Langford: Okay, so a common answer is they don’t, there’s only specific sub-pieces of reinforcement learning, which avoid Local Optima problems. So let me tell you several techniques for avoiding Local Optima problems. One of them is, if you can turn things into a contextual bandit then there is no Local Optima issue because you’re going to be exploring other alternatives and there’s a technique where your importance weight things. So if you explore something rarely, then what you discover when you explore it is constringent is very important with the learning algorithm. That means that you’re going to drive for the Local Optima amongst all the actions available.

John Langford: So another technique, which is available, so this is with the learning research, which I mentioned earlier. If you have at training time, a reference policy, which can get you into the base and the Global Optima, then you can use that reference policy at training time to guide the learning process over multiple decisions to learn to operate in the Local Optima. And then you can use reinforcement feedback to fine tune it to reach the actual Global Optima.

Kirill Eremenko: Wow. That one sounds really complex.

John Langford: Yeah, but it actually addresses the temporal credit assignment problem, which move lost the game problem. So it’s very useful. And then there’s an even more complex one that we’ve been working on. It’s mostly theory, but we do know that if you have an underlying state space, which is not too large, so instead of possible ways the world is, is not too large, then you can explore that underlying state space efficiently and you can then solve for a NeurOptimal policy. So the first paper of this sort was called eCubed exploit. I forgot what the other e is, but it’s an algorithm that’s going to deliberately explore as a market position process in order to find the Optima policy.

John Langford: And then recently we’ve been trying to figure out how to make these algorithms work with a rich features space. So there’s a series of papers, the last one was at this ICML about contextual decision processes where you have context features and there’s an underlying state space, which generates these features, but you don’t know what that generation process is and still you can engage in controlled exploration, define the globally optimal policy.

Kirill Eremenko: Interesting. You mentioned a couple of times, exploration and exploitation. So a very important concept I think, and the reinforcement learning really stands out in that sense. For instance, opposed to that AB test that I mentioned previously, AB tests, they simply explore. There’s no exploitation, they’re integrated and so you have to wait until your AB test is done and then you pick your solution and then you explore it. Whereas reinforcement learning combines the two exploration, exploitation. What are your comments on, from the research that you’ve done, how to find that optimal balance between exploration, exploitation?

John Langford: Yeah, that’s a really fun one. So finding optimal balance, first of all is actually pretty subtle. So for example, just with contextual bandits, which is the very simple version of reinforcement learning, we knew it was possible to get something, which is very good statistically. We didn’t know how to do that in a company’s efficient manner for a long time. So we didn’t figure out how to solve this, but that was in 2014. Let me tell you the flavor of the strategy that is used to give you a sense of that. So it’s a covering strategy, what we’re going to do is we’re going to learn multiple models, and we’re going to… The question is how do these models differ?

John Langford: So the models differ by training them to avoid the predictions of the other and also training them to maximize your reward or minimize your costs. So maybe 99% you maximize reward, 1% you train them to defer. And now that creates a variation in the models and if you choose one of those models randomly to act according to you get a good solution. This is the covering strategy. You’re trying to cover the space of plausible, good decisions effectively.

Kirill Eremenko: Okay. So that helps with the finding or balancing out that exploration vs exploitation?

John Langford: Yeah, so essentially what’s going to happen here is you’re only going to choose actions, which good models choose, and you’re going to try to choose as broadly amongst as [crosstalk 00:40:46] as possible.

Kirill Eremenko: I got you. So you’re like, every time you choose an action, you’re picking out our new model at random.

John Langford: Yeah, that’s right.

Kirill Eremenko: Okay, and then because you’ve trained them at 1% to differ from each other, you achieve that variety of actions. So basically, even though the models might be over exploiting overall, by using multiple models and the new model or like a random model every time you’re building in the exploration into that part of the person.

John Langford: Yes, that’s right.

Kirill Eremenko: Wow, that’s really cool. It’s a simple solution, yeah, when you think about it that way.

John Langford: Yeah, so that’s at the contextual bandit level. That is roughly well solved, the corner reinforce learning is most well solved. If you’re looking at something which involves a temporal credit assignment to which move lost the game type of reinforcement learning, then you have to consider the entire sequence of actions. And now, it’s not super clear what the best strategy is. So people have pretty much solved the markup decision process setting. So in markup decision process, you get an explicit enumeration of the state and then you can prove that you can essentially build a model of the world with relatively few samples by planning to escape the portion of the world that you know, to reach the portion of the world that you don’t and gathering information and then building further, right? So that’s an effective strategy.

John Langford: But the problem is that, in the real world, you don’t really have state in any meaningful sense. You tend to have features which may have all the information of a state, but it typically has much, much more information as well. So think about using like a megapixel camera for a sensor, right? So Megapixel camera may identify where you are in your local environment very well, and yet it tells you a whole lot more information. And furthermore, because your local environment changes in minor ways all the time, you never actually get the same image twice. And now given that you don’t have the same image twice, the question is how do you… well, okay, so obviously the markup decision process approaches don’t work. And then you have a subtle problem of trying to figure out the underlying state space, which is I think where a lot of the fun research is right now.

Kirill Eremenko: Well, sounds the sounds like a lot of fun, and this is a really on the forefront of reinforcement learning. Is that the research? Was it the research that you got the award for? Was that just the recent one or was at about, like somewhere in this space about reinforcement learning and the temporal allocation assignment problem, temporal credits?

John Langford: No, no, the award was really for the contextual bandits. And then contextual bandits I think is pretty unique in the sense that it’s dramatically expanding the scope of problems that we can just directly solve with machine learning techniques. Right. And turning that into a system has been my project of the last like four or five years.

Kirill Eremenko: Okay, all right. So how does all this, the research that you do link up to the work that you do at Microsoft currently? Is that all encapsulated in a product that is offered? Or is it just in the nature of the research itself is driving forward the Microsoft product?

John Langford: So there’s not a one-to-one mapping between research and products. I think in research you’re typically looking into a lot of different things and only a small fraction of those are actually directly useful for research. But there are… so let me tell you my approach here. My approach is consulting is fun and interesting you learn things about what people are working on, but if you really want to have impact, I think you need to do it through a platform, right? And so far I’ve worked on platforms at two levels. One level is where this Vowpal Wabbit open source project, which has a lot of the research that I’ve worked in, in algorithmic form, gets used by many, many companies and so forth. Because it’s very useful in terms of getting things into a usable form, but there’s not enough to really have a big impact compared to what you might imagine as possible.

John Langford: So to have a really big impact, you need to have a complete system, and I’m going to distinguish between the learning algorithms which is a small component of a complete system and the complete system overall. So the decision service that we worked on is now embodied in this, Personalize Your Service in Azure. The Personalize Your Service is made to personalize pretty much anything that you might want to try to personalize. It has a very basic interface where you feed in features, you get out an action and then later feed in a reward.

Kirill Eremenko: Oh, okay. So like a template for reinforcement learning, like a template approach?

John Langford: It’s not just a template because there’s an active backend that joins the reward to the individual event and then puts those into a log, which you can download or use. But also which enables online reinforcement learning, online contextual bandit learning.

Kirill Eremenko: So basically any business or even individual could come and use this Personalize Your Service in order to create a reinforce… or like make reinforcement learning come to life in their specific problem that they’re dealing with, even if it’s an online problem?

John Langford: Yes, that’s right. So this is dramatically lowers the barrier to entry-

Kirill Eremenko: For sure.

John Langford: So the first time that we tried to do any kind of personalization, it was like a six month project with local people. This could turn it into a one day project, right. Where you deploy it on your website, and you start using it.

Kirill Eremenko: Yeah, and makes it a… It’s like self-service, right. Somebody can just, as I understand, set it up for themselves.

John Langford: That’s right. That’s right.

Kirill Eremenko: Does this service have a name in case any of our listeners are interested?

John Langford: Yeah, it’s the Azure Cognitive Services Personalizer. So you’re using Vowpal Wabbit underneath and you can actually do things like download the log, play with it, figure out what the right parameters are and then tell the service which parameters to use.

Kirill Eremenko: So, it’s Azure Cognitive Services Personalizer?

John Langford: Yes.

Kirill Eremenko: Okay. Very, very interesting. Tell us a bit about Vowpal Wabbit, I’m having trouble pronouncing it. Why that name for starter?

John Langford: Have you ever seen Monty Python in the search for the Holy Grail?

Kirill Eremenko: Not that one. I’ve heard of Monty Python, like it’s a comedy group, right?

John Langford: You probably can’t do it right now, but if you get a chance, go to YouTube and search for Monty Python, Killer Rabbit. There’s an excellent scene there though, which still makes me laugh even after seeing it quite a number of times. So that was one of the inspirations. There’s actually three in inspirations here. It was also raised on Bugs Bunny, where Elmer Fudd tends to lisp his things a little bit and then there’s also a poem called Jabberwocky. You know, this one?

Kirill Eremenko: Nope.

John Langford: Okay. So this is an old Lewis Carroll poem, it’s a pretty fascinating poem because it has a bunch of words that don’t otherwise exist and yet still, you can understand what is being said in the poem. Anyway, one of them was a Vowpal sword. So it’s like a fast Killer Rabbit that solves your problems.

Kirill Eremenko: Okay, okay. Got you. All right. So this project you’ve been working on for years now, is that right?

John Langford: That’s right, yeah.

Kirill Eremenko: How did it all come to be and what has it turned into over the years? In a nutshell, if you’d describe Vowpal Wabbit, what is it?

John Langford: It’s an online learning system. It is a bunch of online learning algorithms and some mechanisms for deploying them.

Kirill Eremenko: Got you. I can see it’s available on GitHub, it says it’s all open source?

John Langford: That’s right. Yeah.

Kirill Eremenko: Fantastic. Okay. So how did the idea come to be and how has it grown over the years?

John Langford: Well, I started this in a little over a decade ago, I think. The basic observation at the time, which is still largely true today, is that there wasn’t a good alternative platform available for online learning. So I think batch learning, either unsupervised batch or supervised batch learning are pretty common, they’ve become much more common in the last decade. But for online learning, there’s a process of consuming an example and using it and then moving onto the next one. There were many outcomes that people were studying and there was no platform for actually deploying them or using them or even coding them up. So how does it really start?

John Langford: So way back in the day when I was at Yahoo Research and there was an internal competition where people were looking at click prediction for Ads. So I don’t entirely like click prediction as a framing, but nevertheless, it seemed like a fun thing to play with. So we looked into that and the original code for VW was built around that. So there are tricks there that you don’t see elsewhere. There was hashing which has become much more common and there was also online learning, which has also become much more common, mostly through VW. So the system performed great but I also learned the lesson. It turns out that the system that actually ended up being used at Yahoo was by some people on the West Coast who were of course right next to people doing a competition.

John Langford: All right, so what do I do at that point? Well, Yahoo was fairly open source. They didn’t really care about it. So I asked them to open source it and that is how the open source version of VW first started.

Kirill Eremenko: Oh, very cool.

John Langford: So over the years it’s become a repository of mini-research algorithms that don’t exist elsewhere. So there’s capabilities in VW, which you may not be familiar with. So there’s contextual bandits, which we discussed. There’s learning to search, which we also discussed, there is active learning, which is where you’re trying to do supervised learning but the algorithm is asking for unlabeled examples that it chooses. There’s also extreme classification. So maybe you want to choose one of a million possibilities or one of a billion possibilities. How do you do that? Well, certainly I don’t know about one in a billion, but one in a million you can easily do with VW and we have several papers on this. Another is, that’s very recent is, it’s contextual memory tree paper.

John Langford: So maybe you want to actually be able to pull up previous instances to remember them and use them explicitly, it turns out you can do that. It’s very useful to do this when you’re in sort of a one shot learning scenario. So one shot learning is where you have maybe a single example to learn from and now you want to pull up the right example when you’re trying to answer some query.

Kirill Eremenko: Okay. Then all of that is in Vowpal Wabbit?

John Langford: Yeah.

Kirill Eremenko: Wow, it has grown a lot. Very, very interesting. Speaking of all these different areas from learning to search, to active learning, one shot learning, contextual bandits, what are your plans for future research? You’ve already gone in so many areas. Is there any area that specifically excites you that you’re looking at for the next year or two years?

John Langford: The thing that excites me the most at the moment is really solving contextual decision processes. So it should be the case that if the underlying state space is not too large, even though you’re observing something which is very complex, you can effectively learn to explore and exploit in that system. So exactly how to do that, we’re working on, we have more on the way beyond the papers that are already out and I believe we can see a dramatic change in our capabilities in the near future.

Kirill Eremenko: Got you. Wow, very exciting. What role does deep reinforcement learning play in all of this?

John Langford: So deep here typically refers to having a circuit like representation with neural networks rather than using a decision tree or a linear predictor or something like that. So when you’re working in the real world, often you have to learn very rapidly, which means that you don’t have the time to tune some sort of a deep neural network. Now what can happen, which we’ve seen be very useful many times, is you sort of train your deep representation using some auxiliary data source and then you use that trained representation to enable you to do very fast learning when you actually deploy things in the real world. Does that make sense?

Kirill Eremenko: Yeah. So basically there is space for deep reinforcement learning in this whole plethora of different algorithms?

John Langford: Well, yeah, so everything that I’ve talked about is actually representation agnostic. So the representation you choose to use is whatever is appropriate to your problem. Right? So if the problem is representation for your problem is a deep reinforcement learning representation or a deep representation, then you use that. If the appropriate representation is something linear, then maybe you use that. If it’s a boost decision tree, then you use that whatever is most appropriate.

Kirill Eremenko: Got you. Okay. Understood. All right. All right. What would your recommendations be to somebody who’s being super excited by this podcast and they’re very passionate and already feel passion for this topic, for reinforcement learning and feel like this could be something they can get into? How would you advise a Data Scientist or somebody starting out into the field of data science or even somebody who’s been in the data science space for a while, but wants to now explore reinforcement learning? How would you advise them to take the first steps into this space?

John Langford: Yeah, so there are several different kinds of goals people might have, if you are interested in this Personalize Your System, as in you want to deploy things, then there’s a workshop we gave during the Industry Expo Day at this last ICML. So you can look that up, there are slides, there are pointers to more documents and so forth and that can give you quite a bit of purchase in actually applying it. If you want to understand the core technology behind contextual bandit, there’s a tutorial that I did with Alekh Agarwal a couple of years ago at ICML, so it’s at that hunch.net/~l2s. No, sorry, that’s the wrong one, hunch.net/~rwil, Real World Interactive Learning.

John Langford: And then if you are interested in using a simulator, I would recommend looking into the AlphaGo, AlphaGo Zero algorithms and also looking into the Learning to Search Algorithms. So for the Learning to Search Algorithms, that’s at hunch.net/~l2s there’s a tutorial that Hal Daume and I gave about four years ago.

Kirill Eremenko: Fantastic. Well thank you for all the resources. We’ll definitely include them in the show notes. So that’s, a workshop, a tutorial and another tutorial and yeah, there’ll be all available for… are they all available for free online?

John Langford: Yeah.

Kirill Eremenko: That’s wonderful, fantastic. Apart from those workshops, which I’m sure people are going to, who are excited about reinforcement learning, are going to check out. Apart from those where else is a good idea for somebody to follow you, follow your career, maybe get updates on some new research that you’re doing?

John Langford: So I have a blog, which is Hunch.net, and then of course, all my papers I try to put in Archive. So archive.org.

Kirill Eremenko: Got you. Is it okay for our listeners to connect with you on Linkedin as well?

John Langford: Yeah, sure.

Kirill Eremenko: Fantastic. All right, well John, thank you so much for coming on the show today. It’s been a pleasure. I’ve personally felt like a student here today. I learned so much from you and I’m sure our listeners got plenty of takeaways as well.

John Langford: All right, thank you.

Kirill Eremenko: So there you have it ladies and gentlemen. That was Machine Learning Research Scientist, John Langford and also the founder and creator of Vowpal Wabbit. Hope you enjoyed this podcast as much as I did and got some very valuable takeaways. A lot of it was quite in depth and quite complex, so don’t beat yourself up if you didn’t… if some parts didn’t really actually click, for me, I feel like I need to go and do some reading in order to understand some of these concepts that John, so generously shared on this episode. And of course we will share all of the links, all of the materials or links to materials mentioned on this episode, such as Johns workshops and tutorials in the show notes, you can find the show notes at www.superdatacience.com/275 so that’s www.superdatascience.com/275.

Kirill Eremenko: And if you enjoyed this episode, don’t just keep it to yourself. Share the love, spread the word, share this episode with somebody who you know is interested in machine learning, more specifically in the field of reinforcement learning. As you heard from John, it’s an up and coming or a really drastically expanding field in the machine learning space that has the potential to solve lots and lots of problems. So if you know any data scientists or machine learning experts that could benefit from this knowledge, then go ahead and share with them the link www.superdatascience.com/275 so they can also check out this podcast. And on that note, thank you so much for being here and spending some time with us. I look forward to seeing you back here next time. Until then, happy analyzing.

Podcasts SDS 275: Machine Learning Through Reinforcement & Contextual Bandits

SDS 275: Machine Learning Through Reinforcement & Contextual Bandits

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 275: Machine Learning Through Reinforcement & Contextual Bandits

Share

SDS 275: Machine Learning Through Reinforcement & Contextual Bandits

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

SDS 942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025