Kirill Eremenko: This is episode number 283 with scikit-learn expert Andreas Mueller.
Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur, and each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex, simple.
Hadelin: This podcast is brought to you by Bluelife AI. Bluelife AI is a company that empowers businesses to make massive profits by leveraging artificial intelligence at no upfront cost.
Kirill Eremenko: That’s correct. You heard it right. We are so sure about artificial intelligence that we will create a customized AI solution for you and you won’t need to pay unless it actually adds massive value to your business.
Hadelin: If you’re interested to try out artificial intelligence in your business, go to www.bluelife.ai, fill in the form and we’ll get back to you as quick as possible.
Kirill Eremenko: Once again, that’s www.bluelife.ai and Hadelin and I both look forward to working together with you.
Kirill Eremenko: Welcome to the SuperDataScience podcast ladies and gentlemen. Super excited to have you back here on the show, and today’s guest is one of the key people behind the Python package scikit-learn Andreas Mueller. As you may know scikit-learn is one of the most popular packages in Python for doing machine learning. In fact, our Machine Learning A-Z course leverages the scikit-learn package for approximately 70% of the models that we create there. So if you’ve done our machine learning, is it a course and you’ve definitely come across the scikit-learn package. In this podcast I had the pleasure of spending an hour with one of the key people behind scikit-learn. Andreas Mueller has been supporting this package for approximately two to three years now, and it’s a very exciting talk that we had. You will learn quite a lot of technical things. For instance, we dove quite deep into gradient boosting algorithms, so you’ll learn about things like XGBoost. Of course the famous and very popular, very powerful algorithm XGBoost. You’ll find a lot about it here. Also, you’ll learn about LightGBM and histgradientboosting.
Kirill Eremenko: In addition to that you will learn Andreas’s approach to solving problems. What machine learning algorithms he prefers to apply to a given data science challenge, in which order and why. We’ll talk a little bit about problems with Kaggle competitions. You will find out the four key questions that Andreas recommends to ask when you have a data challenge in front of you. You’ll learn about his 95% rule to creating models, and creating success in business enterprises with the help of machine learning. Finally, you’ll learn about the Data Science Institute at Columbia University. We’ve got a very exciting podcast coming up ahead with one of the key people in machine learning for Python. So without further ado, I bring to you Andreas Mueller, the expert in scikit-learn.
Kirill Eremenko: Welcome back to the SuperDataScience podcast ladies and gentlemen. Super excited for today’s episode, because on the call talking to us from New York we’ve got Andreas Mueller. How are you going Andreas?
Andreas Mueller: Hey, I’m great. Thanks for having me.
Kirill Eremenko: Very, very exciting to have you. And you mentioned the weather in New York is pretty terrible right now. Down pouring with rain.
Andreas Mueller: Yup. I just barely made it here.
Kirill Eremenko: And next week you said 120 degrees? Right?
Andreas Mueller: Yeah. I think it’s going to be 120 degrees in two days. New York’s pretty variable.
Kirill Eremenko: That’s crazy. 120 in Fahrenheit is 48 degrees Celsius. Almost 49. How is that even possible? Have you ever had that before in New York?
Andreas Mueller: New York is just crazy. We also get minus 28 Celsius. Yeah. I don’t know.
Kirill Eremenko: Minus 20 Celsius or Fahrenheit?
Andreas Mueller: Celsius.
Kirill Eremenko: Celsius.
Andreas Mueller: Fahrenheit would be worse.
Kirill Eremenko: Oh yeah. Minus 20. Sorry. I thought minus 120. Okay. Yeah, that’s crazy. Okay, wow. But you haven’t lived in New York all your life, right?
Andreas Mueller: No, no. I’m from Germany. I moved here five years ago. Originally to work for NYU, and three years ago I moved to Columbia. Columbia University.
Kirill Eremenko: Okay, and how are you finding New York? Do you like it there?
Andreas Mueller: Oh yeah. I mean apart from the weather, I really like New York. It’s really great. There’s a big data science community here. There’s a big open source and Python community here. There’s obviously lots of things to do. I’m now at the Columbia Data Science Institute and it’s really a very nice institute to work with. It’s very supportive of my open source work.
Kirill Eremenko: Okay. Very cool. All right. Well Andreas super excited to have you on the show. I am looking forward to learning a lot from you. First of all, thank you so much for your book Introduction to Machine Learning with Python. I personally haven’t read it, but several guests have recommended it on the show to our listeners and I’ve heard fantastic things about it. Congratulations first of all, on such a ground breaking book that has changed so many peoples lives. Tell us a bit about how you came up with the idea to write it.
Andreas Mueller: Oh, the story of how it got written is… It’s a very long story. But I can give you the run down. I was really interested in writing a book about scikit-learn because there wasn’t one available. And someone from the team Olivier Grisel, who has been involved with scikit-learn even earlier than I have, started writing a book. Unfortunately at the time I was working at Amazon, so I didn’t have a lot of time to help him work on a book. However, once I left Amazon I had more time, and I really wanted to contribute. By that time the book was taken over by someone else, Sarah Guido who is my co-author on the book because Olivier was not as interested anymore. And so then together with Sarah we finished it off. It took another year from when I joined the project to finalization.
Kirill Eremenko: Wow.
Andreas Mueller: In the mean time, a friend of mine Sebastian Raschka actually published his book Python Machine Learning, and I think that got a little bit more buzz than our book. I think there’s three books on scikit-learn now. There’s also Machine Learning with Scikit-Learn and TensorFlow.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Which is also a nice book I’ve heard.
Kirill Eremenko: Mm-hmm (affirmative). What are some of the main themes of your book?
Andreas Mueller: The way I wrote it, it’s basically that it’s written for programmers. It assumes that you know Matplotlib, NumPy and Panda to some degree and obviously Python, but it assumes that you don’t know any math. So it tries to explain the algorithms in an approachable way, and so there’s no gradients. I think there’s a standard deviation in one place. I’m not talking about the internals of the algorithms, and that’s a deliberate choice. Of course there’s a lot of good math books about machine learning available. For example, one of my favorite books Elements of Statistical Learning, which is… Statistics textbook is really a great introduction to machine learning, and they are doing a much better job of explaining the math than I could. There’s many other books that do that out there, and so I really focus on the coding part. And I want to make it approachable for programmers that don’t necessarily know a lot about linear algebra, for example.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. No I totally understand and appreciate that approach. When I was writing my book Confident Data Skills, I also omitted any mathematics. In fact, I even omitted the programming, because I wrote it for people who might want to… My dream was for people to read it on the train, or on the plane without access to their laptop. I think once you make it laser focused on a certain audience or a certain type of experience, then people who want that book they will be super happy. People who want to know the maths or the statistics they can… Or when people want to learn the math statistics, they can buy another book and that’s totally fine. Not to have everything jam packed into one book. Do-
Andreas Mueller: Yeah, and I heard from a lot of people that were reading my book and The Elements of Statistical Learning in parallel. You can get that one online from the author’s website, and there’s a lot of other theory books that you can actually get for free online. There’s a good compliment.
Kirill Eremenko: Okay. Gotcha. Very cool. We can probably talk about your book for hours, but I also wanted to mention for our listeners who might not know you through your book or through your other work, that you are actually one of the top contributors to the scikit-learn package in Python. That is really awesome.
Andreas Mueller: Well, thanks. Yeah. That’s sort of the main thing I’ve been doing for the last five years at least. I’ve been involved in the project, I think seven or eight years now.
Kirill Eremenko: Yeah, and so when did scikit-learn actually start?
Andreas Mueller: I think for real it started about 2009/2010. There were some early prototypes, but it mostly got really kicked off by a group in Paris at Eneria around Gael Varoquaux, Alex Gramfort, and Olivier Grisel. All three of them are still very involved.
Kirill Eremenko: Okay, and so how did you get involved with scikit-learn?
Andreas Mueller: I got involved basically during my PhD. I was working on computer vision and machine learning, and I was looking for a easy to use machine learning library. And so I stumbled over scikit-learn, and started contributing very simple fixes like formatting changes, typos and documentation, this kind of stuff. Then at some point I contributed some algorithms. I think it’s where the current approximations for support vector machines. I think that’s still the only algorithm I ever contributed to scikit-learn.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Then after I did this, I think I asked them if I can participate in a sprint at the NeurIPS Conference, then still known as the NIPS Conference. There was a coding sprint after the conference, and basically the person that was maintaining scikit-learn up until then, was I think an undergrad student of them. Was just graduating, and there was no one anymore to take the job of maintainer, and so they asked me if I’d want to do it. For some reason I said, “Yes.”
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: That definitely changed the course of my life quite a bit.
Kirill Eremenko: Mm-hmm (affirmative). Okay, wow. That sounds very exciting and you’re still doing it to this day. Tell us, for somebody who might… I’m assuming that probably most people are aware of what scikit-learn is, but just to recap what is scikit-learn?
Andreas Mueller: Sure. It’s a machine learning library in Python. It implements a bunch of static machine learning algorithms that you’ll find in textbooks. A lot of supervised learning, and some unsupervised learning as well. Feature extraction, preprocessing, model selection, model evaluation. Now we are also doing more model inspection, and we’re going to start doing more visualization. It’s basically all the tools around the machine learning workflow. The goal is to make it as easy to use as possible, and relieve robust.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: And so one of the things that we decide to do is… One thing you will not find in scikit-learn is deep learning, because it’s a very fast-moving field with a lot of people working it. I think there are a lot of great libraries for deep learning out there. Scikit-learn does not do that, scikit-learn mostly works on tabular data.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: There’s no image processing in scikit-learn. But it’s all the classical algorithms. They are actually newer than neo networks, but random forest, gradient boosting, support vector machines, K-means clustering, DBSCAN, these kinds of things.
Kirill Eremenko: Mm-hmm (affirmative). Okay, thank you very much. And what is your favorite algorithm on [inaudible 00:14:07]?
Andreas Mueller: Okay, my three favorite algorithms are logistic regression because it’s super simple and super nice and very easy to interpret, and you can’t really go wrong with it. Random forest, because it always just works.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Which is pretty interesting. And then we actually we got a new implementation of gradient boosting in scikit-learn, done by someone from my team here at Columbia. It’s called HistGradientBoosting, and it basically implements the same thing that LightGBM or XGBoost do. This is a gradient boosting algorithm that is very, very fast and very scalable.
Kirill Eremenko: What is it called again?
Andreas Mueller: It’s called HistGradientBoosting.
Kirill Eremenko: Hmm. HistGradientBoosting. Okay, cool.
Andreas Mueller: Before we had this gradient boosting [inaudible 00:15:08] and boosting regressor, but they were much slower than… XGBoost for example, and now our implementation is actually a bit faster than XGBoost.
Kirill Eremenko: Okay, fantastic. We’ll get into these more technically just now, but for the meantime what is involved to maintain a package? It’s all open source right? Anybody can see the code and understand how it works. But then there are people who create these algorithms… Like you said right now you have a new implementation of gradient boosting which is HistGradientBoosting. So somebody came up with this idea to implement it and then created it. I’m assuming different people had to test it, that it’s working okay and then you add it officially to the package. Is that how it worked?
Andreas Mueller: Generally, yes. There’s several issues and bottlenecks. The main issue is scoping. So deciding what goes in and what doesn’t go in.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: The gradient boosting for example, if you look at for example Kaggle surveys of what people use, XGBoost and LightGBM have been used very widely. They’re widely used in the industry. They’re really useful, so it was very clear that’s something scikit-learn needs.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: But for a lot of other algorithms it’s not entirely clear if they should go into scikit-learn or not.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: There’s so many machine learning algorithms, it’s impossible to have all of them. And everything that we add increases the maintenance burden. Right?
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Everything we add we need to fix bugs in it until the end of time.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: And in particular, if none of the maintainers is very familiar with the algorithm, that gets very tricky.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: And so basically scoping is one of the big issues. The other issue is reviewing all of this. Actually I’m not really that much doing the on the ground work in scikit-learn anymore, because I’m mostly on an organizer level these days. A lot of the work gets done by Joel Nothman, who’s been basically the main person to do maintenance and user support for the last three or four years.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: He actually mostly does this on his volunteer time, and so it’s very hard for volunteers to find enough time to review all this code.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Luckily now I have two people working with me at Columbia, that work full-time on scikit-learn. They’re actually paid developers. There’s three more paid developers in Paris and one more paid developer in Berlin, and so that actually allows us to hopefully… This is a very relatively new development, and so that actually hopefully allows us to review more of the code contributions. But it’s very hard. If so many people report bugs and contribute code… And this is very complex code. There’s a bunch of math usually involved, and there’s subtleties to the algorithms and maybe you only have the paper not even the reference implementation. Then these are very hard to review and it takes a lot of time to get to the quality that we need in scikit-learn. That’s one of the big bottlenecks that we have.
Kirill Eremenko: Mm-hmm (affirmative). Okay, understood. So not only do you have to add the algorithm, but then when people use it and they find maybe bugs or improvement features, somebody has to review all those contributions and see which ones will be added.
Andreas Mueller: Yeah.
Kirill Eremenko: Okay. Wow. Okay. That sounds like a lot of work. Well let’s jump into gradient boosting. I would love to talk a bit more about that, because I’ve been looking forward to… Every time I have a very technical guest, somebody who knows technical deals a lot, I know there’s many things I could learn from you. But it’s not possible to learn everything, so lets if you’re fine with this, let’s talk more about gradient boosting. And maybe to start off, can you give us a definition. What is gradient boosting?
Andreas Mueller: Sure. I’m going to explain it to you, but I’m also going to tell you there’s actually two cool blog posts by Nicolas Hug who wrote it. His name is spelt Nicolas Hug if you are American.
Kirill Eremenko: H-U-G, right?
Andreas Mueller: H-U-G, yeah. He implemented the gradient boosting and he wrote some blog posts about how it works. He actually even gave a talk that you can now find on YouTube that he gave at SciPy 2019 last week. But let me give you a quick run down. It’s easiest to think of it in a regression setting, so you have a continuous variable that you want to predict given some features. What we’re doing now is sort of like gradient descent, but we don’t do gradient descent on say a linear model or neural network. But we do gradient descent on the space of all trees, which is a little bit weird. Usually trees are used at a very restrictive… Let’s say you built a regression tree of depth. Three or something like that, and try to predict your target. Because it’s a very restrictive tree, it’s not going to do a very good job.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: It’s going to be better than just a confident prediction. Now you use this tree as sort of an approximation of your function, and you look at the residuals. So you’ll see what are the parts of the function that are not predicting it correctly by this very simple tree. And then you build a new tree, that tries to predict this residual.
Kirill Eremenko: Oh, okay.
Andreas Mueller: In a sense you can think of this as making gradient steps towards getting… Minimizing the residual towards getting closer and closer to the original function.
Kirill Eremenko: Mm-hmm (affirmative). Okay.
Andreas Mueller: Yeah. It’s an [inaudible 00:21:25] procedure, where you basically build one tree at a time. There’s another trick which is if you have something like a learning rate, which is once you build a tree you don’t actually use the whole tree. But you basically multiply the output of the tree by something like 0.1, or something like that.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Which makes you go slower, so you’re not really trusting every tree a lot.
Kirill Eremenko: Mm-hmm (affirmative). Okay.
Andreas Mueller: This is-
Kirill Eremenko: Yup.
Andreas Mueller: This is part of the bigger family of [inaudible 00:21:57] algorithms similar to random for- Excuse me. Similar to random forest where basically we see how trees can fit data very well, but they also tend to over fit and tend to be very unstable.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Instead of using a single tree, we use many trees but they’re… Each of them is quite restricted, and so we get something that is more stable and less prone to over fitting then if we just had one really big tree.
Kirill Eremenko: Okay. So that’s basically a random forest, right?
Andreas Mueller: Well, in random forest… Actually the motivation for random forest and gradient boosting are similar. In random forest you build lots of trees that are all independent, and they’re all different because you injected some [inaudible 00:22:43] in the process by re-sampling the data in some ways.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: In gradient boosting you can also re-sample the data, but it’s not necessary. In gradient boosting you iteratively add more and more trees-
Kirill Eremenko: To explain the residual.
Andreas Mueller: To explain the residual.
Kirill Eremenko: Okay, makes sense. The concept is similar in the sense that they both want to increase the stability without having over fitting. But at the same time the way they implement is different. In normal random forest you have multiple trees that are like a democracy of trees. Basically they are voting for their result, and they’re all different because of how you sampled the training data for each one of them. Whereas in gradient boosting you don’t need to sample the data, you just… Whatever residual you have from the first tree, you use another tree to predict that. And then whatever you have residually from the second tree, you use another tree to predict that. Is that right?
Andreas Mueller: Yeah, exactly.
Kirill Eremenko: Gotcha. Can you use an example of maybe a very trivial hypothetical example, with some numbers just to put it into perspective for the gradient boosting trees?
Andreas Mueller: I think that’s very hard without having illustrations. I think you can visualize it actually. A resource I might want to mention is… I’m teaching a class at Columbia every spring, and you can actually find the whole class on YouTube.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: If you go to youtube.com/andreasmueller, put just my name, you’ll find my lectures series Applied Machine Learning. There’s one lecture that’s basically just about gradient boosting, and it has a bunch of slides that give it a more visual explanation.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha.
Andreas Mueller: I think it’s hard to just talk through it with numbers.
Kirill Eremenko: Okay.
Andreas Mueller: I don’t think I’ll be able to follow myself.
Kirill Eremenko: No problem. We’ll add the link in the show notes if anybody wants to check it out, get some more details on that. Okay, so that’s in general gradient boosting. So how is this… You mentioned a couple of them like LightGBM, XGBoost, and the new one HistGradientBoosting. What’s the difference in between them?
Andreas Mueller: They are all quite similar. They’re all just different implementations. The one that first innovated was XGBoost, and I think the main difference… It was a very good implementation, so they put a lot of effort into making it work well. Then I think a trick that they did that previous implementations didn’t do, is that they also considered the Hessian term in completing the gradient. In what I said it was very high level, but there’s… You can formulate the algorithm as a gradient computation formally, and they basically also include the Hessian term. They basically did something like a [inaudible 00:26:03] instead of gradient descent. These two things together make this very fast. Another thing that I think XGBoost didn’t have at the beginning but LightGBM had was… If you build a single tree, whatever you want to find is split. You have to basically sort the data. You have to sort by that feature to find what the split is. Sorting is an analog and computation and that’s sort of one of the most expensive things in building any tree.
Andreas Mueller: A way to speed this up is to do histograms of the data. So just ahead of time you basically you bin the data, and so then you don’t have to do sorting anymore. You just know the bins. That’s a sort of trick to get rid of sorting is doing this [inaudible 00:27:03], and it makes it much, much faster. This trick is implemented in both XGBoost and LightGBM. And then there’s some other nice things that were not in the old scikit-learn. Gradient boosting for example, dealing with missing values explicitly. Basically having the algorithm deal with missing values. Having the algorithm dealing with categorical data. These are both implemented in LightGBM for example, and they’re also implemented in our HistGradientBoosting. Usually in scikit-learn you need to take care of encoding categorical variables using [inaudible 00:27:44] coding, and you have to fill in missing values before you run any of the algorithms. Because most machine learning algorithms can’t tolerate missing values. But all tree based algorithms in principle can deal with categorical variables and can deal with missing values, but it wasn’t really implemented in scikit-learn. In the HistGradientBoosting it will be implemented very soon.
Kirill Eremenko: Mm-hmm (affirmative). Okay. Gotcha, and so LightGBM adds the benefit of putting the variables into bins, makes it faster. And HistGBM what’s the difference there?
Andreas Mueller: I mean there’s not really anything new about HistGradientBoosting, it’s mostly an implementation of LightGBM. The benefit is its in scikit-learn and so we have control over it. LightGBM was implementation that was done by Microsoft. It’s really great, but it’s hard for us… We don’t want to import their C++ code, we prefer to have our own code base because it makes it easier to maintain and integrate with the rest of scikit-learn.
Kirill Eremenko: Okay, gotcha. Understood. And what about XGBoost? What implementation was that?
Andreas Mueller: Sorry?
Kirill Eremenko: XGBoost. What implementation was that?
Andreas Mueller: Oh, that’s just the name of the implementation. I think originally it was done by what was called graphLap and then Turi, and is now maybe Apple AI or something like that. I don’t know. But that was also just very nice C++ implementation. But both of them XGBoost and LightGBM had scikit-learn compatible bindings which was very nice.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: There’s another implementation that I should mention which is CatBoost.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: CatBoost was published by Yandex.
Kirill Eremenko: Yeah. The Russian companies.
Andreas Mueller: Yes. That works very well if you have a lot of categorical variables, and they actually they do some more tricks. In particular, they make the trees symmetric which is quite interesting. That seems to be also work very well. I haven’t seen that much adoption yet, but I’m also not following that closely in what’s happening on Kaggle these days. There was a paper at NeurIPS workshop last year trying to compare all of these algorithms and these implementations, but it’s very hard because it depends a lot on your data set. Which one works best for you.
Kirill Eremenko: And that’s good right? That means people can find the right one for them and their specific situation.
Andreas Mueller: Yeah, exactly.
Kirill Eremenko: And one thing that I’ve noticed is that over the past couple of years, these boosting algorithms have become extremely popular. Even though there’s a lot of different types of machine learning algorithms out there, more and more problems can actually be solved by simply applying gradient boosting. From XGBoost to CatBoost and all the ones in between, what I’ve noticed is that they are becoming the go to solution and they work in many, many cases. What is those now, why are they so powerful and why are they so popular?
Andreas Mueller: They’re so popular because they work very well. It’s definitely one of the go to solutions. If you have enough data, I think people these days more and more go also towards neural networks. Or if you can do transfer learning then neural networks are also great. But I think… Yeah. Gradient boosting is one of the standard solutions. I don’t think it’s that well understood why they work so well. One of the things about them is that you probably need to tune the parameters more carefully. That’s why I said I like to do logistic regression, then random forest, then gradient boosting. Random forest also always works, but it’s usually slightly worse than gradient boosting. But random forest basically don’t need any parameter tuning to work. Whereas in gradient boosting, you have to tune the parameters a little bit usually.
Kirill Eremenko: Okay.
Andreas Mueller: One of the things that’s nice about them is that often because it’s more… Because you’re just fitting the residuals it’s more goal oriented in a sense than random forest. You can get away with less trees, and the trees are usually smaller. So your prediction times are faster often than for random forests. Unless you’re very, very parallel or something. The thing is, random forest you can paralyze more easily than gradient boosting, but in gradient boosting the model are smaller both in terms of storage space and they’re quicker in terms of prediction time often. That’s quite nice.
Kirill Eremenko: Mm-hmm (affirmative). Okay, and so would you… Thanks for clarifying. I like this approach. Logistic regression, random forest and then gradient boosting. Those first and then solve the problem. Would you recommend for somebody who’s starting out into data science, that they just go and learn gradient boosting right away? Because that sounds like the silver bullet that can solve all problems.
Andreas Mueller: I mean, there’s not really a silver bullet. But more so, in many, many cases and you will… People that do a lot of machine learning, you will hear it from them all the time. It’s not really about your algorithms or the hyperparameters. It’s much more about what is the data, and how do you formulate the problem?
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Going from, say logistic regression to gradient boosting, can give you an improvement. But the improvement that you might get from collecting different data or formulating the problem in a different way or even just cleaning up your data, might be much, much bigger. That’s one of the issues I have with these Kaggle competitions. In a Kaggle competition the data set is fixed, and you care about the last percentage point in your accuracy. That’s not how machine learning works in the real world at all. In the real world, usually there is a way to change the data set you’re working with. Maybe it’s very expensive, but there’s some way to change the data or add more data. Or maybe you need to clean the data. It’s also usually not entirely clear how to measure the outcome. How good the algorithm is. In Kaggle, you say you want to optimize accuracy or log loss. But in the real world, no one cares about the accuracy of the algorithm. What you care is about selling your product, or getting users to your website. Or detecting cancer.
Andreas Mueller: These are all very different from accuracy, and often they are harder to measure. They are part of a bigger work flow. It’s usually much more important to think about how was the data collected, and how could I improve the data collection? How do I evaluate my algorithm, and how does the algorithm fit into the bigger workflow?
Kirill Eremenko: Wow. Those are some-
Andreas Mueller: And these are-
Kirill Eremenko: Sorry keep going.
Andreas Mueller: Yeah. These are the questions that are… There is many talks that you can see why x more experienced machine learning people that say, “These are the things that you actually care about.” And, “Okay, going from random forest to gradient boosting will give you 2% better, but actually going back and collecting more data will give you 20% better. Even if you have a very simple model.” Understanding the data, and understanding what your model does to the data is very important. And if your model is simpler, it might be more easy to understand what’s happening. It’s much easier to understand the logistic regression model, than to understand the gradient boosting model. Maybe just one more point, is that the thing that I saw firsthand when I was at Amazon. I think the situation is similar in other companies, is you don’t really care about getting the last percent right because there’s diminishing [inaudible 00:36:24] returns. If you solve a problem like… Let’s say you take a month to do a prototype and solve something like 95% correct. If it takes you another month, you go to 96%. That’s not really worth your time as a data scientist.
Andreas Mueller: Instead, you should probably go to the next problem, and do the next problem 95% of the way. I think basically in many companies, the state where their in with their machine learning is you find a problem where you can apply machine learning where it’s going to be beneficial to the business. Then the most of the work is in defining the metrics and collecting data and so on. Then you put in a logistic regression model, and it’s going to be much, much better than whatever manual process was there before. If it was even measured what was there before. But then instead of spending more money to tune it, you go to the next problem and formalize the next problem. Basically just the very simple solution might be much better than what was there before. That’s probably more beneficial to the business, though…
Andreas Mueller: If you just came out of your machine learning PhD like I did, it’s kind of not really maybe not what you want to do. Because then you’re more a product manager. It’s not about fancy machine learning algorithms because it doesn’t really matter how fancy your machine learning algorithm is. As long as it does something reasonable, it’s going to help the business a lot.
Kirill Eremenko: Wow. Thank you. That is some very deep thoughts. I love your comments about the questions, and I completely agree. It’s a matter of what is it that you want to be doing. For example, everybody knows DeepMind. Google DeepMind. It’s the company that created the artificial intelligence that won the game of Go. They constantly publish research papers, and do miraculous things with artificial intelligence. They’re really pushing the boundaries of artificial intelligence. Reading their research papers is just fascinating. They employ about 700 of the top minds in the world across London, Paris, California, Montreal and so on. Well in 2017 Google DeepMind lost their losses as a company was $368 million dollars. That’s more than a quarter of a billion dollars. More than a third of a billion dollars in losses, and what they’re doing is they’re pushing Deep. Their pushing. They’re creating new algorithms, new cool things and so on. But to your point that as a business, that’s not the priority. Right? You want to make sure that you get the results, and sometimes results don’t require cutting edge or… Maybe cutting edge is necessary, but the bleeding edge top of the rage artificial intelligence that doesn’t even yet exist. That’s not often required in order to get results.
Andreas Mueller: Yeah, I agree. And there are parts of companies that are highly super optimized, right? If you look at the ad click prediction at Google, or if you look at the timeline at Facebook. I’m sure at that places they care about the last decimal points.
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: Of the accuracy. But that’s not the case in most places and most companies.
Kirill Eremenko: Mm-hmm (affirmative). Very, very, very true. Let’s just go over those questions again. I loved that the four questions you said. First one was how was the data collected? Second one was how can I improve the data collection? What are the third and fourth? I didn’t have time to write them down.
Andreas Mueller: Oh.
Kirill Eremenko: Something like, “What is my algorithm and how can I measure how it’s doing?” Something like that.
Andreas Mueller: I think the business goal should be sort of the first thing. Usually if you have… There’s some reason you do this machine learning stuff, and ideally there’s a business goal connected to sell a product, engage a customer. I don’t know, make buses be on time. This is usually a very high level goal that is not directly related to your machine learning algorithm. Say if you want to diagnose a disease then your prediction is not going to be, “I diagnose a disease or not.” Your prediction is part of the bigger workflow, where you interact with a doctor or you run different tests and so on. Your prediction will only be part of a bigger process, and so you should be aware of, “What is the bigger goal I want to reach?” And, “How can I measure the bigger goal?” But then it can often lead to sort of measure usually on a finer grain scale, the direct impact of the algorithm. Very often you cannot do… Ideally you would do something like AB testing, but you can’t do AB testing while you’re developing algorithms. Right?
Andreas Mueller: In AB testing you would have run your whole process, say either without the algorithm or with the algorithm. Or one version of the algorithm without the other version of the algorithm. But you can’t really do that if it’s not ready to be consumer facing yet. If you’re really bad you can’t be doing that. You have to have some offline evaluation metric that’s a proxy, that says how well does the algorithm work on this offline data set? You should probably think very hard about what is this proxy metric did you use, and how does it relate to the real use case? For example, let’s say again we go through this diagnostic example, you probably don’t care about accuracy so much. But you might depending on what setting you’re in, you might care about false negative rate a lot. So then say you’re screening for cancer, and you want to make sure that everybody that maybe has cancer goes to a doctor and gets checked. Then you definitely don’t want to make any false negatives. But then there’s also a trade-off of how many people do you want to have see a doctor? [crosstalk 00:43:12].
Kirill Eremenko: Yeah. You can just send all of them right? Just send 100%.
Andreas Mueller: Yeah, and so how do you translate this trade-off into a metric? If it’s about human lives it’s very, very hard. If you’re in a business, very often you can assign some business value. If I make the sale versus if I lose this customer, how much does this cost me? And so you can assign costs to what the outcome of the algorithm… What you expect the outcome of the algorithm to cost, or to make you in money. Then you can try to use that as your metric.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. And that will help with the trade-offs?
Andreas Mueller: Sorry? I couldn’t hear you.
Kirill Eremenko: That’ll help you with the trade-offs?
Andreas Mueller: Yeah.
Kirill Eremenko: Help you understand what the trade-offs will be. Okay, so basically understanding the bigger picture of what you’re doing. Not just the specific metric like in Kaggle competitions. So what does that mean? Does that mean people shouldn’t do Kaggle competitions?
Andreas Mueller: Hmm. No. That means that they shouldn’t think the real world is like a Kaggle competition. If you want to… There’s lots of very interesting data sets out there, and there’s lots of interesting techniques. But it depends a lot on what is your position in a company, and what do you want to do really. There’s a lot of cool stuff you can do with computer vision, but maybe that’s not… If you’re in a company that is medical imaging, then maybe you’re doing exactly that thing. But if you’re sort of in a more broad business, then maybe these last percentage points are not that important and it’s more about the bigger picture, and the overall workflow. The time I spent at Amazon, I was 90% of the time that I worked on the machine learning problem, was not on the machine learning problem itself. It was about clarifying the problem, formulating the problem, talking with the business units involved, collecting the data, establishing metrics and so on.
Kirill Eremenko: Mm-hmm (affirmative). Okay. Wow. How much time did you spend on the machine learning actually?
Andreas Mueller: I mean I worked on my first project for like 10 months, and yeah let’s say 10% of that was the machine learning.
Kirill Eremenko: Wow.
Andreas Mueller: And so-
Kirill Eremenko: 10% of that was machine learning, and 70% of that was data preparation?
Andreas Mueller: Not only data preparation, but also… Even establishing what are the goals.
Kirill Eremenko: Yeah.
Andreas Mueller: Let’s say you do something that is… You do email marketing or something like that, and then your boss tells you, “Oh I want to have more reach.” Or, “I want to have my email marketing to be more effective.”
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: How are you going to measure that? There’s ten different metrics you can pick. You need to implement them. You need to [inaudible 00:46:25] to self infrastructure to log them. Even thinking about what is the right training date to collect is the first step. Well …That’s even maybe the second step. The first step is, “What is the thing that I want to do? How’s it gong to impact me?” And then, “What is the right data to collect for that purpose?”
Kirill Eremenko: Mm-hmm (affirmative).
Andreas Mueller: In the Kaggle competition someone hands you the data and the metric, which are the two hard parts in a machine learning problem.
Kirill Eremenko: Yeah. Even defining what the question is. You don’t need to do that in a Kaggle.
Andreas Mueller: Yeah.
Kirill Eremenko: Fantastic. Well, we’re not hating on Kaggle competitions here. I think they definitely have a value. It’s just that you need to be aware that that’s not what happens in real life.
Andreas Mueller: Yeah. I mean that’s a very small part of what happens in real life. I mean, depending on what your position is. If you are in the setting where it is really about just tuning an existing system more and more. If you are in a setting where there’s already a very high level of what’s implemented in terms of machine learning, and what’s there in terms of infrastructure. Then that’s maybe your main focus if you’re in a company where there’s less infrastructure, and less use of machine learning. Probably gets lots of things around it that you need to work on first.
Kirill Eremenko: Yeah. No, totally. Totally agree. What is the solution then? How can somebody who want”s to prepare… I still think Kaggle competitions will get you a job. Right? If you are the best in Kaggle or in the top 5%, you put that on your resume and that will get you a job. Fantastic. But what if somebody wants to actually, genuinely prepare for the real world out there. So that when they do go into a role as a data scientist, which maybe they got through Kaggle competitions or something. They aren’t caught by surprise… They know what to expect. They know what is going to happen, and they know that they can handle whatever’s thrown at them. What is your recommendation? How can somebody prepare for the real world out there?
Andreas Mueller: What I do in my course is that I gave very open ended homework. The data is already collected, but then I leave the evaluation and everything else up to my students. For example, very often you have information leakage. You have a column that perfectly predicts the target, and then the question is, “Should this be included in your model or not?” It’s a very common question, and it depends on the context that you’re asking the question in. Or, “How should you split your data in training and test set?”, is also very dependent on the context of your application. Let’s say… I’m going to keep going with the medical example. For example, let’s say you have multiple measurements for patients, let’s say of their blood sugar. And you want to predict if they’re diabetic or not. And so it’s very different if you have the same patient in the training and in the test set, or if you have different patients in a test set and in a training set. It’s not that one of the approaches is right and the other one is wrong, it’s that it depends how you want to apply your algorithm.
Andreas Mueller: Do you want to make predictions for patients you haven’t seen so far? So when it walks into your hospital and you want to make prediction for them. Or do you want to make predictions for people that you’ve already observed in the past? There’s this sort of subtle assumption about distribution of your training and test data, that you need to think about in a real world problem. I try not to give my students the handpicked solution.
Kirill Eremenko: Okay.
Andreas Mueller: What you can maybe also do is there’s Kaggle datasets. Kaggle datasets are much more free formed, so they are not competitions, but they are just datasets. There you can think about these datasets independent of very concrete tasks, and you can think of, “What are the interesting things to do?” “What is the natural task associated with these datasets, and how can I approach that?” It’s a little bit artificial because you start from a dataset and not from a task. In the real world you will I think always start from a task. From a thing that you want to do. But simulating that is very hard, unless you’re doing projects. Here at Columbia, one of the things I think that sets our Data Science Masters program apart here, is that we actually have these capstone projects. And the capstone projects are joint projects with industry partners. For example, Blumberg, J.P. Morgan, Microsoft, Unilever, and… Yeah. I don’t know, many more. The students actually work on a business problem. A real business problem, together with a mentor from the company and a mentor from the university. There they experience this problem of having to define metrics.
Andreas Mueller: How do you go from the business problem, to the machine learning problem? I think having actual exposure to real problems is probably the best way to do it.
Kirill Eremenko: Fantastic. Thank you. I’m so glad you brought that up. The Data Science Institute at Columbia, I want to do a huge shout out to Jeanette Wing for introducing you to me, and so that I could invite you to this podcast. I think you guys are doing a great thing there. I haven’t seen many universities in the world… Of course data science is hype and everybody wants to get a data science degree, or [inaudible 00:52:44] data science degree. But from what you’ve described, and from this example of doing data science with these capstone projects. I think that’s the best way to do it, and I think that’s one of the things that would really boost peoples careers. Or really boost people into the space of data science. It’s really cool that this is facilitated at the Data Science Institute at Columbia. Tell us a bit about that. How many people do you intake into each year at the Data Science Institute?
Andreas Mueller: That’s a good question. I don’t have any exact numbers this year, but I think… I usually teach a class in the spring that I mentioned before, and I usually have like 150 students. I think 150 is about the size of a cohort.
Kirill Eremenko: Okay.
Andreas Mueller: Maybe 170 or something.
Kirill Eremenko: And how long is this degree?
Andreas Mueller: The degree is three semesters. It’s basically two semesters that are mostly coursework, and then the third semester… The first semester is the fall semester, then you have the spring semester, both of them are coursework. Then in the summer most students do an internship at a company. Then in the fall they do the capstone project.
Kirill Eremenko: Fantastic. Then not that long at all. 1 1/2 years.
Andreas Mueller: Yeah. 1 1/2 years. It’s pretty quick.
Kirill Eremenko: Yeah, and do you need… You already have a Bachelor’s Degree to do it? Or you can start that instead of your Bachelor’s?
Andreas Mueller: No, no. There is an undergraduate program at Columbia, but that’s not at the Data Science Institute actually. That’s joint between maths and stats, I think. But this is a graduate program, so you need to have a Bachelor, but the Bachelor can be in basically any field. As long as you have some of the prerequisites. We have people that had a Bachelor in English.
Kirill Eremenko: That’s awesome.
Andreas Mueller: Or architecture. Or-
Kirill Eremenko: That’s really cool.
Andreas Mueller: It is a very broad field.
Kirill Eremenko: That’s what data science is about right? People coming from different backgrounds to leverage their relevant experiences and knowledge of the world. That’s what makes us feel so diverse and interesting, I think. Okay, well thank you so much for your time. We’ve slowly approached the one hour mark. Can you please share with our listeners where can they find you and follow you, or maybe get in touch? Submit an idea for scikit, the scikit-learn package or maybe take some of your courses. What are some of the best places to find you?
Andreas Mueller: You can find me on Twitter @amueller. A-M-U-E-L-L-E-R. I am the same on GitHub. Oh sorry, on Twitter I’m actually @amuellerml. A-M-U-E-L-L-E-R-M-L. On GitHub I’m just amueller. You can find me on YouTube as Andreas Mueller. The whole course I’m teaching at Columbia is online there, and you can also find the slides and materials on my website.
Kirill Eremenko: Fantastic. Okay thank you Andreas, and you mentioned also before the podcast you’re working on a new package called, “Dabl.” Just briefly, what is Dabl about?
Andreas Mueller: As I’ve said I’ve tried to allude to I think the bigger picture is much more important. Dabl is a package that tries to track away the parameter tuning, by doing some automatic machine learning. But it also tries to help you create a tighter loop of looking at your data, applying an algorithm and evaluating the algorithm. It’s in very early stages, but right now it has a lot of things to do. Automatic visualization, so you just give it a dataset and it does some interesting visualization for you. That’s already pretty useful, I think.
Kirill Eremenko: Yeah.
Andreas Mueller: The things that I’m working on right now, is there’s also just an automatic classifier that you can use that will give you results very quickly and just tune models for you. There’s other automatic machine learning things like auto scikit-learn. I’m not sure if you’re familiar with auto scikit-learn. Auto scikit-learn is a little bit more trying to get to the last percentage point. The absolute best model. Where in Dabl, I’m trying to give you something reasonable quickly, and then give you an explanation of the model. This way I hope people will be encouraged to integrate more quickly, and look at the data more. Look at their models more, and spend less time tuning the parameters. Instead, thinking about the problem.
Kirill Eremenko: Wow. Fantastic. And so it’s D-A-B-L? Dabl?
Andreas Mueller: Yes.
Kirill Eremenko: Is it already available for people to try it out?
Andreas Mueller: Yeah sure. It’s on GitHub. I haven’t released a version yet. The most useful thing right now is probably the visualization. But basically if you have a data frame and you want to predict one of the columns of the data frame, you can just give it that and it will show you lots of pretty pictures.
Kirill Eremenko: Oh, fantastic. Sounds like a fun thing to play around with. Whoever’s interested check out it. Dabl, you can find it on GitHub. Of course we’ll include the links in the show notes. All right, and one more thing Andreas. We already talked a bit about your book Introduction to Machine Learning with Python. Is there any other books that you’d like to recommend to our listeners. Something that helped you in your career, or in your life?
Andreas Mueller: I think most of the books that I would recommend now, they were unfortunately not available when I started my career. The one book that I already mentioned is Elements of Statistical Learning, which is a really great introduction to machine learning and available on [inaudible 00:58:45] website. If you’re new to Python, there’s the Python Data Science Handbook by Jake VanderPlas. That book’s a little bit more introductory than my book, so it gets you started with Pandas, Matplotlib, NumPy and then there’s a little bit of scikit-learn. If you’re new to doing data science with Python, I highly recommend Jake’s book. You can get all the Jupyter Notebooks for it on his website, or you can buy the print copy. There’s another book that I really like that’s actually written with a view from AR which is by Max Kuhn. It’s called Applied Predictive Modeling. It has a lot of very good insights about doing machine learning and predictive modeling in practice. He has I think some accompanying code in R. But you can also just read the book without looking at the code.
Andreas Mueller: It separates the machine learning part from the coding parts, and it’s really quite helpful.
Kirill Eremenko: Fantastic. Thank you so much. We’ll share all those wonderful recommendations on the show notes for this episode, and of course your book which is Introduction to Machine Learning with Python. On that note, it’s a wrap. Thank you so much Andreas for coming on the show, spending your time here with us and sharing your amazing expertise. It was really insightful and I personally learned a lot.
Andreas Mueller: Great. Thank you so much for having me.
Kirill Eremenko: There you have it ladies and gentlemen. Thank you so much for tuning in to this SuperDataScience podcast. That is Andreas Mueller, one of the key people behind the scikit-learn podcast. I hope you enjoyed today’s episode, and there were plenty of valuable takeaways. The discussion about gradient boosting alone was extremely valuable. Personally I learned a lot from that discussion about this algorithms. Of course personally I learned a lot from that discussion, but probably my key takeaway from today’s podcast was the approach that Andreas uses when solving problems. The three algorithms that he looks into, logistic regression, followed by random forest and if that doesn’t help then he moves onto gradient boosting. I think they’re is a great truth to first checking out simple things, the simple strategies, simple algorithms that have the benefit of being explainable. And if they don’t work, moving onto more complex things. We’ve seen this theme throughout the podcast with multiple guests, and Andreas being one of the key people behind scikit-learn also confirms it. That’s a huge testament to this approach.
Kirill Eremenko: As always you can get all of the links to materials mentioned on this episode as well as links where you can find Andreas at the show notes at www.www.superdatascience.com/283. That’s www.superdatascience.com/283. Finally, if you know anybody who’s a scikit-learn fan or expert, then considering sending them this episode. So they can also learn the amazing things which you heard about today. On that note, thank you so much for being here. I look forward to seeing you back here next time, and until then happy analyzing.