Podcasts SDS 771: Gradient Boosting: XGBoost, LightGBM and CatBoost, with Kirill Eremenko

119 minutes
Business, Data Science, Machine Learning

SDS 771: Gradient Boosting: XGBoost, LightGBM and CatBoost, with Kirill Eremenko

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Kirill Eremenko joins Jon Krohn for another exclusive, in-depth teaser for a new course just released on the SuperDataScience platform, “Machine Learning Level 2”. Kirill walks listeners through why decision trees and random forests are fruitful for businesses, and he offers hands-on walkthroughs for the three leading gradient-boosting algorithms today: XGBoost, LightGBM, and CatBoost.

Thanks to our Sponsors:

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Kirill Eremenko

Kirill is the founder of SuperDataScience and online educator who has created dozens of best-selling courses such as Data Science A-Z, Machine Learning A-Z and Artificial Intelligence A-Z. He is also a well-known instructor on Udemy where his courses have been enrolled in by over 2.5M students worldwide. Kirill is utterly passionate about the Ed-Tech Space and his goal is to deliver high-quality accessible education to everyone!

Overview

In this highly technical episode designed for intermediate to advanced practitioners, Kirill warmed up listeners with an introduction to gradient boosting, saying that it is “by far some of the most used and reliable modeling techniques in business” (04:21). He gives this as his reason for developing this second, “advanced” course in machine learning, opening his students’ eyes to the best tools for career success. Kirill says that gradient boosting methods draw together predictions from multiple models to reach conclusions, in a process known as ensembling. What is critical about this process is that the individual models that comprise the ensemble can be rudimentary, such as decision trees or other “weak learners”. However, the combined impact of these simple models results in powerful gradient-boosting models.

Kirill refreshes listeners on how decision trees work with fun case examples that we have come to expect from his teaching style! He then details the two ways to build ensembles: (1) bagging, which is short for bootstrap aggregating, and (2) boosting. In the former example, rather than simulating new data, the method ‘bootstraps’ its samples based on the information available to it. Boosting can refer to adaptive or gradient: Adaptive boosting (AdaBoost) uses regression to consider previously misclassified data points, learn from previous data, and adapt future results. Gradient boosting looks at the residuals between a model’s predictions and the ‘correct’ answer and learns to fix those residuals. Listen to the episode to get more colorful, real-world business challenges from Kirill and Jon that simplify how and when to use these methods for building ensembles.

Finally, Kirill puts XGBoost, LightGBM and CatBoost under the magnifying glass, exploring their short but no less fascinating histories, what led to their popularity and Kaggle competition dominance, and what each algorithm’s superpowers are.

Find out when to use gradient boosting algorithms for your business in this lively discussion with Kirill Eremenko, course leader of SuperDataScience’s “Machine Learning Level 2”. Not sure if you’re ready for the course? Complete beginners can get started with “Machine Learning Level 1”, also available at www.superdatascience.com.

Machine Learning Level 2 Course

Materials mentioned on this podcast come from Kirill & Hadelin’s newest course: Machine Learning Level 2. This course elevates Intermediate ML practitioners to an Advanced level by helping you master some of the top ML Models reliably used in business and industry today: XGBoost, LightGBM and CatBoost. Comprising of over 6 hours in-depth theory and hands-on practice tutorials this course will truly take YOUR skills to the next level. Course teasers below… Enroll today and get a head start into the future of your career!

Available ONLY on SuperDataScience, you can find this course at: www.www.superdatascience.com/level2

In this episode you will learn:

All about decision trees [09:17]
All about ensemble models [21:43]
All about AdaBoost [36:47]
All about gradient boosting [45:52]
Gradient boosting for classification problems [59:54]
Advantages of XGBoost [1:03:51]
LightGBM [1:17:06]
CatBoost [1:32:07]

Items mentioned in this podcast:

This episode is brought to you by Ready Tensor
This episode is brought to you by Data Universe: $300 off a Data Universe pass with promocode: SUPERDATASCIENCE
Machine Learning Level 2 (in Python)
AdaBoost
XGBoost
LightGBM
CatBoost
SDS 619: Tools for Deploying Data Models into Production
SDS 649: Introduction to Machine Learning
SDS 694: CatBoost: Powerful, efficient ML for large tabular datasets
SDS 684: Get More Language Context out of your LLM
“Greedy Function Approximation: A Gradient Boosting Machine” by Jerome H. Friedman
“Stochastic Gradient Boosting” by Jerome H. Friedman
“Gradient”, Khan Academy
“XGBoost: A Scalable Tree Boosting System” by Tianqi Chen and Carlos Guestrin
The State of Competitive Machine Learning 2022 Edition
“CatBoost: unbiased boosting with categorical features” by Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin
“LightGBM: A Highly Efficient Gradient Boosting Decision Tree” by Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu
Francois Chollet’s poll
Distributed (Deep) Machine Learning Community
Linear Algebra for Machine Learning: Level 1
Linear Algebra for Machine Learning: Level 2
Linear Algebra for Machine Learning: Level 3
SDS special code for a free 30-day trial on O’Reilly: SDSPOD23
The Super Data Science Podcast Team

Follow Kirill:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 771 with Kirill Eremenko, the founder and CEO of SuperDataScience. Today’s episode is brought to you by Ready Tensor, where innovation meets reproducibility, and by Data Universe, the out-of-this-world data conference.

00:00:20

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today, and now let’s make the complex simple.

00:00:51

Welcome back to the Super Data Science Podcast. Today we’ve got another special episode with our most special of special guests, Mr. Kirill Eremenko. If you don’t already know him, Kirill is founder and CEO of SuperDataScience, an e-learning platform that is the namesake of this very podcast. He founded the Super Data Science Podcast in 2016 and hosted the show until he passed me the reigns a little over three years ago. Kirill has reached more than 2.7 million students through the courses he’s published on Udemy, making him Udemy’s most popular data science instructor of all time.

00:01:26

Today’s episode is a highly technical one focused specifically on Gradient Boosting methods. I expect this episode will be of interest primarily to hands-on practitioners like data scientists, software developers, and machine learning engineers. In this episode, Kirill details decision trees, how decision trees are ensembled into random forests via bootstrap aggregation, how the AdaBoost algorithm form a bridge from random forests to Gradient Boosting, how Gradient Boosting works for both regression and classification tasks. He fills us in on all three of the most popular Gradient Boosting approaches, XGBoost, LightGBM, and CatBoost, as well as when you should choose them. All right, you ready for this extremely illuminating episode? Let’s go.

00:02:14

Kirill, welcome back to the SuperDataScience Podcast. We are all so delighted to have you back yet again for another technical episode this time on Gradient Boosting. You were here back in January for a technical intro to large language models. Then you came back in February to build deeper and dig into encoder decoder transformers, so like a specialized further deep dive. It was volume two of this super popular January episode, one of our most popular episodes ever, and then now you’re back for Gradient Boosting, which is quite different from LLMs, but also super valuable, super powerful. It’s going to be an awesome episode. Thank you for coming on.

Kirill Eremenko: 00:03:01

Thanks for having me, Jon. Very exciting. Probably I should say that for the benefit of our listeners that even though the space between the episodes is only about a month and a half or so, the knowledge I’ll be sharing today comes from a course that we’ve just released, but we started this course back in end of 2022. Then we put a big pause on it, so it’s not like I just put together something in a month on Gradient Boosting and I’m back here. No, it actually took a few months of research back in 2022 and then finalizing it in the past month to get it to where it is. I’m very excited now to come and share the knowledge we’ve learned creating the course for the benefit of the podcast listeners as well.

Jon Krohn: 00:03:48

Nice. When you say we, you mean Hadelin right? Hadelin de Ponteves is your co-instructor on the course?

Kirill Eremenko: 00:03:55

No, it’s just me.

Jon Krohn: 00:03:57

Oh.

Kirill Eremenko: 00:03:57

I’m joking. I’m joking, but you know how kings say we? I don’t know, the royal we, yeah, yeah. No, of course. Yes. Hadelin and I, we’ve just published a course. It’s called Machine Learning Level 2 because we have a Machine Learning 1 for complete beginners. Then Machine Learning Level 2 is for practitioners who are intermediate and want to go advanced, and it’s all about Gradient Boosting. The reason for that is Gradient Boosting is, and the underlying techniques, specifically XGBoost, LightGBM, and CatBoost are by far some of the most used and reliably used modeling techniques in industry and in business.

00:04:39

If you’re not doing deep learning, which is more for mostly, in my understanding is used for new tasks, novel problems, research based things. Of course, it has its applications in industry as well, but if you want just a reliable solution to a classification or aggression problem, XGBoost is one of the… XGBoost, LightGBM, or CatBoost are some of the go-to solutions. We want to equip our students with the best tools to make them successful in their careers. Doing fun stuff in machine learning AI is sometimes different to what you need to get the job done, and Gradient Boosting often is the solution to get the job done.

Jon Krohn: 00:05:20

Yeah, that reminds me that you and Hadelin were back on the show in episode number 649 for an intro to machine learning to your level one course as well as just a general Machine Learning 101. Yeah, this is now your fourth appearance in just a little over a year. That episode actually also, by the way, was the 10th most popular episode of 2023. We recently-

Kirill Eremenko: 00:05:46

Oh, yeah, I just listened to your podcast on that today in the car. Yeah, it was funny. I was listening to it. I was like, you mentioned the episode. I didn’t realize you’re going from lowest to highest top 10, and I thought, “Oh, we were number one in 2023, but we were number ten.”

Jon Krohn: 00:06:01

You barely squeaked into the top 10.

Kirill Eremenko: 00:06:05

I know, I know. Anyway, so yeah, that was about a year ago. That was for Machine Learning Level 1, and now we’ve had lots of people asking for Machine Learning Level 2. We’ve been delaying it because of other projects we’ve been working on, but now we finally released it. It’s just gone live, very excited about, it’s six and a half hour course. Of course, we’ll go into a lot of concepts in this podcast, but right away I wanted to say if somebody wants to check it out, you can find it at www.superdatascience.com/level2. You’ll need to subscribe to SuperDataScience membership. You’ll get access to that course, which is exclusive to SuperDataScience, not available anywhere else. Plus you’ll get access to the Large Language Models A-Z course, which is also exclusive to SuperDataScience, and all of our other 30 plus courses, our community, our workshops at Live Labs that we’re doing twice a month now, career sessions, et cetera. Worth checking it out at www.superdatascience.com/level2.

Jon Krohn: 00:06:56

I recently organically noticed how many live sessions you’re having in there, very cool. It sounds like the community is really starting to flourish at www.superdatascience.com. That’s cool. I also wanted to add, earlier you were talking about deep learning versus Gradient Boosting or decision trees in general and why you might use one or the other. I think one of the easiest ways, conceptually for me, is that when you are dealing with very large data inputs like an image, or a video, or natural language, that’s where deep learning, including deep learning transformer architectures tends to be very effective. But when you’re dealing with things like tabular data that you could put into a spreadsheet, that’s where the kind of Gradient Boosting that we’re talking about today tends to be the leading approach.

Kirill Eremenko: 00:07:51

Absolutely. I was actually looking into this yesterday to see the differences, and you’re absolutely right. Deep learning and [inaudible 00:08:04] related things are very powerful when you have additional structure to the data, whether it’s like an image and so on, or you have tabular data with additional structure, you have a time series behind it with some specifics that are not just captured or not easily captured in normal tabular data. If you have ordinary, normal tabular data, which happens to be the most common type of data that businesses aggregate consciously and process these days, whether it’s time sheets or maintenance or medical patient data, whatever, it’s mostly tabular data.

00:08:42

That’s what you usually find in business and industry without any additional pattern to it that deep learning can catch on and take advantage of. Then you can still apply deep learning, but XGBoost is just going to be Gradient Boosting models is going to be faster, more reliable, easier, quick win, and it’s just a more standard approach to these kinds of problems. You don’t have to reinvent the wheel, just apply it and off you go, some fine-tuning and you’re done.

Jon Krohn: 00:09:15

Exactamundo, amigo.

Kirill Eremenko: 00:09:17

Yep. Yep. Okay. Shall we start? We’ve got some exciting topics coming up.

Jon Krohn: 00:09:22

Yeah, yeah, let’s rock and roll.

Kirill Eremenko: 00:09:23

Okay, cool. Cool. The first thing we’re going to talk about is ensembling methods in general. What are ensembling methods and how do they work? An ensembling method, first thing that you need is typically ensembling methods they… It’s ensembling methods when you aggregate lots of models to produce one model. It’s like one model that combines lots of models, and there’s two main ways of combining models. But first, before we go to the two main ways of combining models, we need to realize that ensembling methods rely quite heavily on weak learners. They don’t need the individual models that you’re ensembling to be very smart or sophisticated. Typically, it’s something simple. It doesn’t have to be a decision tree, but in most cases, people choose decision trees because they A, are weak learners. B, they’re quick learners, and C, they capture non-linear relationships.

00:10:25

Having said that, you can use a hundred linear regressions to create an ensemble of linear regressions if your specific use case requires that, but we’re not going to go into custom use cases like that. We’re going to look at the typical approach, and the typical approach is like take decision trees, put them together and get the ensemble. In case somebody needs a quick refresher or somebody’s brand new to this, a refresher on decision trees. Basically just imagine like yes, no splits, right? Yes, if- else conditions. At the start you’ll be like, you have all this data. Let’s say you have a thousand customers and you’re modeling how much future customers will spend on your online store where you’re selling candles, for example. I was thinking, what would we be selling? Candles. I don’t know, some food supplement or something like that.

Jon Krohn: 00:11:20

Yeah, I don’t know. Candles is such a random example. Do you like rooms that smell nice, Kirill? Is that-

Kirill Eremenko: 00:11:25

I do like rooms that smell nice, but I’ve been recently learning that candles are not regulated. I don’t know about the US, but in Australia, they don’t have standards, so you got to be careful because the stuff they put in might not be healthy for you.

Jon Krohn: 00:11:37

One in every hundred is actually a stick of dynamite and you don’t know.

Kirill Eremenko: 00:11:42

That’s too funny. Oh, okay. All right. Let’s say you have a thousand customers and you want to predict based on those customers, the new customers are coming into your store in the future, how much they’ll spend in your store. It’s a regression type of problem, and what you will do is you’ll model your existing customers with a decision tree, and let’s say the decision tree splits out the following, the structure. At the top it’ll be a split on let’s say their estimated income. You have a variable of their estimated income, you’ve estimated it somehow, it’s in your input data, and you’re saying… The decision tree will say at the top, the first split is “Is there estimated income less than $47,000 per year or not?” If it’s less than $47,000 go left. That’s a yes. Go, right if it’s a no. In case, and then you just visualize this tree, it’s like a box. It doesn’t look like a tree. It’s like a box. Yes, no, then it’s an if-else condition. If you-

Jon Krohn: 00:12:41

It’s like an upside down tree. It’s a tree upside down.

Kirill Eremenko: 00:12:43

Yeah, kind of it grows. Yeah, it grows upside down. That’s right. It’s right. At the top is the beginning of the tree. Is it called the root of the tree?

Jon Krohn: 00:12:51

Yeah, the root of the tree.

Kirill Eremenko: 00:12:52

Yeah. Okay. The root of the upside down tree. Then you go left if they do earn less than $47,000 per year, then you have another split, so you have another branching of the tree, and then let’s say the condition tree from training has decided that the condition should be, “Is there age less than 45?” If yes, then go down to the left and we’re going to keep it a simple, relatively shallow decision tree, and that’s where we’ll end for that branch, and it’ll be, it’s called a terminal leaf.

00:13:24

That terminal leaf will have a value. What that value is that during training, out of all of the thousand customers that you have, all of the ones that fell into that branch that had income of less than $47,000 and age less than 45, it’ll take the average. For regression problem, it just takes the average of the customers that they spend there, and let’s say it’s $23. On the other hand, if the customer earns less than $47,000, but their age is not less than 45, so you go left first and then you go right, then the average of those customers was $15. Then let’s go back to the top. If the customer doesn’t earn less than $47,000, so they earn $47,000 or more, then at the very beginning you would’ve gone, right? There, let’s say there could be a terminal leaf right there. It doesn’t have to be symmetric. We’ll talk about symmetric trees further down in this podcast.

00:14:17

There could be another leaf there. But there, let’s say there’s another split, and it’s asking, “Is that customer signed up to your loyalty program or not?” It’s a categorical variable. If they are signed up, it’s a yes, then you go left down the tree, and because they’re signed up to a loyalty program, their income is over $47,000, the average of those customers that ended up in that bucket is quite high. It’s, let’s say, $212 that they spend on your candles per month or whatever it is that you’re modeling. But if they are not signed up to your loyalty programs, you would’ve gone right in that last branch. Let’s say the answer is there is 48 in the terminal leaf. That’s the average.

00:14:56

You get this decision tree that was built through training, and now any new customer that comes into your company, you can, based on these variables, you can model them and you can see, “Oh, is their income less than $47,000 or not,” go left or right. Then if let’s say they go left, you’re like, “Okay, is their age less than 45 or not?” If their age is 45 or more, then you go, right, and then you know, “Oh, okay, most likely they will spend on my candles in next month $15.” Then you can make business decisions from that. That’s like a simple refresher on how decision trees work. As you can see, it’s quite straightforward and they can capture non-linearity because of these if-else splits.

Jon Krohn: 00:15:39

Research projects in machine learning and data science are becoming increasingly complex and reproducibility is a major concern. Enter Ready Tensor, a groundbreaking platform developed specifically to meet the needs of AI researchers. With Ready Tensor, you gain more than just scalable computing storage model and data versioning and automated experiment tracking. You also get advanced collaboration tools to share your research conveniently and securely with other researchers and the community. See why top AI researchers are joining Ready Tensor, a platform where research innovation meets reproducibility. Discover more at readytensor.ai, that’s readytensor.ai.

00:16:20

All right, so to recap back for the audience, this decision tree concept, definitely extremely easy to understand with a visual.

Kirill Eremenko: 00:16:29

Yeah, for sure.

Jon Krohn: 00:16:30

But it’s the idea, yeah, the base of a tree, which for some reason… I guess because it ends up being on the top of the diagram because we read from top to bottom to bottom, so it makes sense to have the flow be from top to bottom, but that means that the tree shape is upside down. The base of the tree or the root of the tree is the starting point, and you have your first split right at the very top. Typically, I think with most of these approaches… I’m not the expert. I think you’re much more expert than I am, but typically that first split is it is often the most important split. It’s the split that’ll get you your biggest delta in whatever outcome. In your case, is somebody likely to spend a lot of money on candles on my website or not? That first split will often be a variable amongst all the variables available that is going to get the biggest, it’s going to have the biggest relationship.

Kirill Eremenko: 00:17:21

Yeah.

Jon Krohn: 00:17:22

In this case, in your example, it was income, which makes a lot of sense. People with more income are more likely to spend money on candles on your website. Then from there, you go two ways on this path tree of possible decisions. I guess you can also imagine it like going on a journey. You are walking along a path and the path splits in two, all of the people with high incomes go one way. All the people with low incomes go the other way. Then once you get a little bit further along the path, the people with the high incomes, they encounter another split in the road. This time it’s split on age, and so all the young-

Kirill Eremenko: 00:18:02

No, sorry, sorry. It’s not a… That’s for the higher earners. For the higher earners, it’s loyalty program.

Jon Krohn: 00:18:09

Oh, right, right, right. Sorry. Yeah, I messed up. But for the visual analogy, the higher earners, they’re going along their path in the woods and then it splits again a little while later. The ones, the higher earners on the loyalty program go one way. The higher earners that aren’t go the other way and the same thing happens on the other side, but like you said, it doesn’t necessarily need to be symmetric. It doesn’t need to be the same variables that you’re splitting on. The low income earners as they walk along their path, when they encounter a split, they have to split on their age instead of on the loyalty program. Yeah, I’ve never thought of it that way as the path, but I think that’s easy… At least in my head as I’m speaking, it’s quite an easy thing just to imagine that you’re on this journey.

Kirill Eremenko: 00:18:55

I love it. Yeah. That’s good for visualizing training, exactly how it would happen in training. Then a new candidate that comes onto your website would have to go down this path and look at the signs. I think we should rename decision trees to decision paths going forward. It’s brilliant, seriously.

Jon Krohn: 00:19:12

When you get to the end, so in this case, so you could have… It’s a hyper parameter in your model when you set it up. You could have lots of levels, lots of bifurcations in the path. It’s always two, by the way. You never get to a point on this journey and there’s three possible paths. It’s always two.

Kirill Eremenko: 00:19:31

It’s always if-else.

Jon Krohn: 00:19:32

Always if-else, like you said. When you get to the end of that journey, which is a leaf node, so again, if you imagine the terminal node, leaf node, if you imagine-

Kirill Eremenko: 00:19:42

Leaf node.

Jon Krohn: 00:19:44

A terminal leaf node, if you imagine that the tree was upside down, these would be a whole bunch of leaves emanating out from the base of the tree. It’s like holding a Christmas tree upside down after you’ve already… It’s Christmas is over and now you’re taking your Christmas tree out of the house. That’s what a decision tree looks like. Yeah, you’re holding it from the base up by your head. When you get to that terminal leaf node on our path analogy, then you could imagine that you’re asked at that terminal point, “How much did you spend on candles at the website?” Then you can average all the people who got to that terminal node. You have different values. Yours, your high income earners who signed up to the loyalty program, they had an average of $212 spent.

Kirill Eremenko: 00:20:36

That’s right.

Jon Krohn: 00:20:38

And so on. We might be belaboring what decision trees are now to people who were already familiar with them, but for people who weren’t, hopefully this discussion has been-

Kirill Eremenko: 00:20:49

Hopefully the people who were already familiar with them, forgive us for this slight easy concept detour, because that was important for everybody to get on the same page. From now on, everything’s going to be a lot of fun, and we’re going to dive into more advanced topics. Right away, I wanted to also say that decision trees can be used for regression as we just discussed, and they can be used for reclassification. As you can imagine, reclassification is even easier. You go down these paths, as Jon was saying, during training, customers go. Then there’s a yes, no question, “Did this customer churn or did this customer not churn? Does this patient have cancer? This patient does not have cancer?” Based on what you get through training, your final decision tree will either assign, you can set it up to assign a label. As soon as a new candidate goes through the tree and gets to the end, you can assign a label cancer or no cancer, or in case of classification problems, you can do that, or you can assign a probability if you like, 70%, 20%, whatever else.

00:21:57

It’s two kind of ways to set it up for classification problems. That was a basic decision tree. Let’s get to the fun stuff, ensemble models. Ensemble models combine weak learners. As we established decision trees are great candidates for weak learners. There’s two main ways of building ensembles. One is called bagging, the other one is called boosting. We’ll start with bagging. Bagging is a cool term because it’s actually short for bootstrap aggregating. It’s just one of those times in life when the real technical term, bootstrap aggregating, actually abbreviates to a cool world bagging, which properly describes the concept.

00:22:35

I was reading about bootstrapping yesterday and really interesting etymology of this word. Bootstrapping comes from some boots, especially cowboy boots, on the back of the boots, they have these straps. I don’t know what they’re for, maybe hanging them up or something like that. You know what they’re for?

Jon Krohn: 00:22:53

No idea.

Kirill Eremenko: 00:22:54

Okay, so bootstrapping is kind of like, let’s say you have a fence in front of you and you need to get over the fence and nobody’s around to help you. Well, the idea of bootstrapping is you pick yourself up by these bootstraps and you throw yourself over the fence, something that’s physically impossible. You don’t have… It’s just weird, you can’t pick yourself up. It just doesn’t make sense. But that’s where the term comes from. Visualize that jump, picking yourself up by the bootstraps. In terms of statistics, there’s how it’s applied. Why is it called bootstrap aggregating?

00:23:26

Well, the whole concept of these bagging type of models, let’s say you have a data set of a thousand observations, and in statistics, you don’t want to… Let’s say you don’t know the underlying distribution of this data set, or you don’t want to make assumptions about this underlying distribution of this data set, and you want to make some inferences from it. What the process of bootstrapping is in statistics is taking this thousand observation dataset and taking samples out of it. You take, just imagine you put all of this thousand dataset, a thousand samples into a bag, and you pick out a thousand with replacement. You pick out a sample, you note it down which one you picked out, you picked out, I don’t know, sample number 747. Then you put it back down, back into the bag, then you pick another sample and so on.

00:24:15

You pick out a thousand samples, but because you’re doing it with replacement, that means that you might, and you likely will pick out same samples several times and some samples will be missed. That way you’ve just now created, and from your original sample, you’ve created a new sample of a thousand that is different to the original, but it consists of the same total observations. You do that multiple times. You bootstrap, let’s say a hundred times, now you have a hundred samples, and now you can make certain inferences like apply… I don’t know. The central limit theorem to that, or the law of large numbers, things like that, just do statistical inferences from that. Effectively why it’s called bootstrapping is because you’ve done the impossible. You only had one dataset of a thousand samples, and then you’ve lifted yourself up by these bootstraps. Nobody was there to help you. You didn’t make any assumptions about underlying data, and yet you created a hundred samples, which are all different, and now you can make statistical inferences. That’s called bootstrapping.

Jon Krohn: 00:25:16

Kirill, I looked up why boots have bootstraps. It’s pretty obvious, they’re for pulling on the boots.

Kirill Eremenko: 00:25:24

Oh. We’re idiots. Yes, of course. Oh, I love it. Yeah, that’s good. Yeah, to help you put them on. Yeah, was my description of bootstrapping, correct for statistics?

Jon Krohn: 00:25:37

Oh, it was unbelievable. I feel like there’s almost even no point in me saying it back to you in my own words because it was beautiful.

Kirill Eremenko: 00:25:43

Awesome. Thank you.

Jon Krohn: 00:25:44

Bootstrap aggregation. Yeah, picking yourself up by your own bootstraps. That’s a common expression. I think it’s been around for a very long time. But yeah, just this idea that you’re without, you’re not really simulating new data. You’re not simulating new individual samples. You’re not needing to go out and collect new data. You are bootstrapping based on what you have, and you’re creating a whole bunch of samples with just what you have, so it’s bootstrapping.

Kirill Eremenko: 00:26:10

That’s right. Bootstrap aggregation is the process of bootstrapping many times and then doing some analysis and aggregating. In the case of ensemble methods, that’s exactly what we’re going to be talking about. We’re going to be doing bootstrap aggregating as we can see just now. But I just wanted to make a quick comment that the term bagging, like the abbreviation bagging, makes perfect sense because you’re putting these samples into the bag and then you’re pulling them out of the bag. It’s a good mnemonic to remember what bootstrap aggregating is.

00:26:42

Let’s talk about ensemble methods. We are already into this first one called bootstrap bagging, short for bootstrap aggregating. Let’s talk about an example of a bagging method. That one to many of the listeners will be familiar. It’s called random forest. What you basically do with random forest is you do bootstrap aggregating. You’d say you have a thousand samples. You want to create a random forest, which is an ensemble method, combining decision trees. Let’s say you want to have a hundred decision trees in this random forest. You do bootstrap aggregating. You create a hundred different samples based, each one has a thousand observations based on your original one, just the way we just described, using that bootstrapping method. Then you build a decision tree from each one of these samples. Each one of the decision trees will see a slightly altered version of the original data.

00:27:34

Therefore, each one of these decision trees, even though they might have the same, they will have the same hyper parameters. Their tree structure and the breaks and the splits in the tree will be different. Then you get the results of each decision tree and that the final model, the random forest prediction, will be the average for… Basically will be the average of these decision trees. Let’s say you’re talking about this candle sales example and how much each customer will spend. Instead of building one decision tree and predicting based on that, what you can do is do this bootstrapping process, build a hundred decision trees, each one with slightly different underlying data. Then see, let’s say for a new customer that comes through into your store, you see what each model will predict.

00:28:26

Some models will predict $23, some models will predict $15, some $212, et cetera, et cetera, et cetera, or other values because each tree is built differently. Some models might predict $78. Some models predict $300 depending on how the tree was built. Then you’ll just take the average. You’ll say, “Okay, so this customer came into the shop.” These hundred decision trees make their predictions. The average of what the random forest predicts is $51 and 23 cents. That will be your final output from the random forest. That’s what you’re going to use.

Jon Krohn: 00:29:03

This episode is brought to you by Data Universe coming to New York’s North Javits Center on April 10th and 11th. I myself will be at Data Universe providing a hands-on generative AI tutorial. But the conference has something for everyone. Data Universe brings it all together, helping you find clarity in the chaos of today’s data and AI revolution. Uncover the leading strategies for AI transformation and the cutting edge technologies, reshaping business and society today, data professionals, business people, and ecosystem partners, regardless of where you’re at in your journey, there’s outstanding content and connections you won’t want to miss out on at Data Universe. Learn more at datauniverse2024.com.

00:29:42

Yeah, random forests are amazing, powerful models. What you’re going to get into next with Gradient Boosting, it makes them even more powerful, but random forests on their own, they make that decision tree idea that we walked through in detail. The idea of going down those paths, or the upside down Christmas tree, when you only have one of those upside down Christmas trees, it’s relatively…

Kirill Eremenko: 00:30:04

Terrible.

Jon Krohn: 00:30:04

Limited, yeah. The advantage of that kind of single decision tree is that it’s very easy to understand. You can see it, you can see each of the bifurcations in the path, and you have very clear end values. But with a random forest, when you bootstrap aggregate a whole bunch of different samples, and then maybe randomly turn off some of the input variables for some of those random forests optionally, you end up with a super powerful machine learning model already. Random forests are amazing. They’ll often get you near the top possible performance on tabular datasets, like we talked about at the beginning of this episode already. Random forests are super powerful, but the boosting now that we’re going to get into, that you’re going to get into, is even more powerful. Where random forests fall down, boosting managed to fill in the gaps, and do even better.

Kirill Eremenko: 00:31:04

Yeah, absolutely. Before we get to boosting, I wanted to give a real-world analogy for random forests that really helped me understand this concept. Have you ever been to a fair, Jon?

Jon Krohn: 00:31:17

Sure, yeah.

Kirill Eremenko: 00:31:19

You go to a fair and there’s rides, and roller coasters, and other little games that you play, and so on. One of the games that you sometimes see at the fair is this big jar with lots of jelly beans inside, thousands, and you need to guess what the number of jelly beans is in there. You’ve seen that one?

Jon Krohn: 00:31:37

I’ve seen that one. When I was a kid, actually, it wasn’t at a fair, but it was at a friend’s house. It was his birthday party, and there were 20 kids, and they had this game. I might’ve been maybe 10 years old, and I guessed the number of jelly beans on the dot.

Kirill Eremenko: 00:31:50

Wow, very good. Very good. Basically, the principle is the person who gets closest wins the prize, or maybe some rules might be different, but let’s say you might have to guess on the dot, like Jon did, or the person who guesses closest, or within a certain range.

Jon Krohn: 00:32:11

You just have to get closest, I think.

Kirill Eremenko: 00:32:14

What the most optimal strategy for this is, you combine a ensemble of weak learners, and because humans are not designed to predict the number of jelly beans inside a jar, where there’s thousands of them, or hundreds-

Jon Krohn: 00:32:31

Speak for yourself.

Kirill Eremenko: 00:32:33

You seem to be very good at it. Humans, apart from Jon, are not designed for doing this.

Jon Krohn: 00:32:38

I’m batting one for one on jelly bean guessing. I’m never going to do it again.

Kirill Eremenko: 00:32:43

Keep it high. Keep your stats high. Humans are bad at that. Humans are perfect weak learners. What you need to do is, you get a notepad and a pen, and every time somebody comes to the stand and makes a guess to whoever owns this challenge, when they walk away, you ask them, “Hey, what was your guess?” You just write it down, and then the next person comes and guesses, the owner or whatever tells them if they’re close or not. Doesn’t even tell them, just writes it down, writes the contact detail, to contact them if they’re a winner. Then they walk away, and they walk past you, and you ask them again, “What did you guess?” You ask everybody who made a guess what they guessed.

00:33:22

Unless there’s some sort of trickery going on, like it’s a hollow in the middle type of jar or something like that, if there’s no trickery going on, you’ll get hundreds of these guesses which are a bit high, a bit low, a bit high, a bit low, and so on and so on. But then you take the average of them, like a random forest does, you created a own ensemble. You take the average, and the average will be the best guess. The average, in many cases, will be the closest to the actual amount, because people have their own differences in their thinking, in their perception and so on. Some will guess higher, some will guess lower, but on average, you’ll be very close. If, the next time you’re at a fair, you see one of those, give that a try. In general, in my view, that’s a great analogy of what a random forest does.

Jon Krohn: 00:34:06

That was a really nice analogy. Another one that is worth mentioning quickly is just that visual of this random forest. The clue of what’s happening there is right in the name. You take a whole bunch of decision trees, and trees make up a forest. Each of those trees in the forest is slightly random, a random forest, in that there are different bootstrap aggregated data sets that make up each of the individual trees, so there’s randomness there. As I mentioned earlier, there’s also randomness around, sometimes optionally, what input variables are being considered, what independent variables are being considered.

Kirill Eremenko: 00:34:47

Yes, the features.

Jon Krohn: 00:34:48

The features, yeah.

Kirill Eremenko: 00:34:49

The feature selection. There’s selection by bootstrapping, the underlying rows are different. But also, you could set a parameter saying that, “The trees don’t see all of the features, they only see 80% of the features randomly.” Each tree not only sees different rows to other trees, but also sees different columns, and that’s a great way of combating over-fitting.

Jon Krohn: 00:35:15

Going back to your earlier example of the single decision tree, where three of the variables that you got into were income, age, and whether they were signed up to the loyalty program or not, in a random forest, the first tree in the random forest might only have income and age. Then randomly, the second tree has age and loyalty program, and so on. You get slightly different answers every time.

Kirill Eremenko: 00:35:43

Yeah, and it’s actually a good point to say that trees can reuse variables. If it used income at the top and then it used loyalty program in the next split, and again, can use income. It is not limited to using a feature only once. It can be done as well. There’s other hyperparameters, like the depth of the tree. You could set the maximum depth to eight or whatever. There’s a hyperparameter for a random forest. You can set how main trees, 100, 1,000, how many trees you want, et cetera. We’ll get a bit into that further down. I feel it’s important to also mention quickly, on random forests, you can also use it for classifications. What we just discussed was regression.

00:36:20

Just keep in mind, throughout this podcast, we’ll be talking about regression classification from time to time. Those are two big separate types of problems that are solved with all the methods, what we’re discussing. You can also use it for classification. A random forest for classification would be, rather than taking the average of all of the trees that you have, you would use it as a voting system. It’s like a democracy, a democracy of random trees. Basically, the trees make their predictions, will this customer churn? Will this customer not churn? For a medical data set, does this patient have cancer or not?

00:36:53

Then you have these predictions from all the trees. Basically, you look at it like a vote. Out of 172 trees voted that this customer does not have cancer, 28 voted that they do have cancer. You could say that’s a no, or if you want to be more cautious to avoid what is going to be a type two type of error, where you’re saying they don’t have cancer but they actually do, you might say, your threshold is not 50/50, your threshold is 75/25. In this case you’ll say, yes, they have cancer, just to make sure and double check. Basically, you’d use it as a voting system.

Jon Krohn: 00:37:30

Yep.

Kirill Eremenko: 00:37:30

Cool. All right, let’s move on to boosting, so excited. All of that was up to… which year was that? Up to 1995, and 1995 was the first year when boosting was introduced conceptually. It didn’t become very popular as this random forest until around 2016, when XGBoost came out, and we’ll get to that further down. 2014, that’s when XGBoost came out. Random forest was dominating, and for example, Kaggle competitions, a lot of people were using random forest all the way up to 2014, 2015. Whereas boosting slowly started growing, got developed and started growing from 1995. 1995 was when two authors, Yoav Freund and Robert E. Schapire, I’m not sure if I’m pronouncing that correct, Schapire, from AT&T Labs, they published a paper. Actually, no, they didn’t publish a paper. They developed the concept of Gradient Boosting and then later, they published their paper in 1999. Sorry, not Gradient Boosting, they didn’t develop Gradient Boosting. They developed the concept of AdaBoost, so just boosting. The method, the model that we’re going to talk about is called AdaBoost. Just keep in mind, very important, AdaBoost is not a Gradient Boosting model.

Jon Krohn: 00:38:53

The Ada there stands for adaptive.

Kirill Eremenko: 00:38:55

Exactly. Thanks Jon. It’s adaptive boosting. They got the prestigious Gödel Prize in 2003 for their work. It is for theoretical computer science. It’s like the Nobel Prize, I guess, or a Nobel… actually, Gödel himself got the Einstein Prize. A Gödel Prize is a prize, but it’s tiny. For somebody in this space, it wouldn’t be a lot of money. I believe it’s $5,000, so it’s not a huge amount of money, but at the same time, it’s more prestige. They got this prize in 2003.