Erin is Chief Machine Learning Scientist at H2O AI, the cloud AI firm renowned for its eponymous open-source automated machine learning (AutoML) library. In this episode, Jon Krohn and his guest investigate how AutoML supercharges the data science process, the importance of admissible machine learning for an equitable data-driven future, and what Erin’s group Women in Machine Learning & Data Science is doing to increase inclusivity and representation in the field.
Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Erin LeDell
Dr. Erin LeDell is the Chief Machine Learning Scientist at H2O.ai, the company that produces the open source, distributed machine learning platform, H2O. At H2O.ai, she leads the H2O AutoML project and her current research focus is automated machine learning. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE) and Marvin Mobile Security (acquired by Veracode), the founder of DataScientific, Inc. and a software engineer. She is also founder of the Women in Machine Learning and Data Science (WiMLDS) organization (wimlds.org) and co-founder of R-Ladies Global (rladies.org). Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley and has a B.S. and M.A. in Mathematics.
Overview
In 2015, Erin joined H2OAI as its only ML scientist, and for the last seven years, she has seen her position evolve from one that designed algorithms to “data science headhunter”, searching for the latest algorithms and learning how to incorporate them into H2O’s operations.
One hurdle the company wanted to overcome was the repetitive process of finding the best tools for feature engineering. Erin and her team found a way to automate these processes with Driverless AI, an AutoML tool that helps run extensive feature selection and creation and can be deployed to all the major cloud service providers (AWS, Google Cloud). For Erin, AutoML makes the code much cleaner and streamlines the process, leaving data scientists having to grapple with far less code so they can focus instead on experimenting. AutoML can, therefore, also become a great learning tool for people who want to understand at a top level which algorithms perform the best for their individual project’s needs. Of course, as the AutoML tool aims to be as widely applicable as possible, this will naturally mean that another tool might outperform it for particular needs. Jon and Erin discuss this phenomenon as part of the “no free lunch” (NFL) theorem, which postulates that no single approach will be optimal for all problems.
Erin and Jon also discuss the importance of admissible machine learning, which they hope will be a step in the right direction for regulatory approaches to crunching sensitive data variables like demographics, age, sex, gender and race. In admissible ML, potentially discriminating factors are isolated from the training set, leading to results based on fairer assessments. Erin notes, however, that we should not lay all of our decisions at the threshold of an algorithm and instead do something about the bias that the ML has recognized, hopefully by removing the features associated with the sensitive variables, thereby reducing unfairness and unconscious bias.
Listen in to this broad-reaching discussion as Erin discusses major ethical data issues, why she founded Women in Machine Learning and Data Science, and whether this might be the age of the generalist for the data science industry.
In this episode you will learn:
- The H2O AutoML platform Erin developed [07:43]
- How genetic algorithms work [19:17]
- Why you should consider using AutoML? [28:15]
- The “No Free Lunch Theorem” [33:45]
- What Admissible Machine Learning is [37:59]
- What motivated Erin to found R-Ladies Global and Women in Machine Learning and Data Science [47:00]
- How to address bias in datasets [57:03]
Items mentioned in this podcast:
- Datalore – Use the code SUPERDS for a free month of Datalore Pro, and the code SUPERDS5 for a 5% discount on the Datalore Enterprise plan.
- H2O.ai
- XGBoost
- Gradient Boosting Machines
- Admissible Machine Learning
- GLM
- H2O Wave
- h2oai/h2o-3
- Women in Machine Learning and Data Science
- R-Ladies Global
- SDS 539: Interpretable Machine Learning with Serg Masís
- Viral Justice by Ruha Benjamin
- SuperDataScience Podcast Survey
- Jon Krohn’s Podcast Page
Podcast Transcript
Jon Krohn: 00:00:00
This is episode number 627 with Dr. Erin LeDell, Chief Machine Learning Scientist at H2O.ai. Today’s episode is brought to you by Datalore, the collaborative data science platform.
00:00:15
Welcome to the Super Data Science Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.
00:00:46
Welcome back to the Super Data Science Podcast, we’ve got the exceptional doctor Erin LeDell on the show today. For the past eight years, Erin has been working at H2O.ai, the cloud AI firm that has raised over 250 million in venture capital and is renowned for its open source AutoML library. She currently serves there as Chief Machine Learning Scientist. Celebrated for her talks at leading AI conferences, Erin also founded Women in Machine Learning in Data Science, which today has more than a hundred chapters worldwide, and she co-founded Our Ladies, a global community for genders, currently underrepresented amongst R language users. Previously, she was Principal Data Scientist at two AI startups that were acquired. She holds a PhD from UC, Berkeley, that focused on machine learning and computational statistics. Today’s episode is relatively technical, so will primarily appeal to technical listeners, but it would also provide context to anyone who’s interested to understand how key aspects of data science work are becoming increasingly automated.
00:01:50
In this episode, Erin details what AutoML, Automated Machine Learning is, and why it’s an advantageous technique for data scientists to adopt. She also talks about how the open source H2O AutoML platform works, what the no free lunch theorem is, what admissible machine learning is and how it can reduce the biases present in many data science models. She talks about the new software tools she’s most excited about, and how data scientists can today prepare for the increasingly automated data science field of the future. All right, you ready for this phenomenal episode? Let’s go.
00:02:29
Erin, welcome to the Super Data Science Podcast, this has been a long time in the making. I’ve wanted to have you on the show for ages and now you’re finally here. It’s awesome. Where are you calling in from, Erin?
Erin LeDell: 00:02:42
Hi. Nice to be here. I’m calling in from Oakland, California in the US today.
Jon Krohn: 00:02:48
Nice. Well, so at H2O, you are the chief Machine learning Scientist, and you’re creating an open-source, distributed, automated machine learning platform. So I’ve thrown a lot of terms out there. Perhaps you can fill us in on what H2O does, what the chief machine learning scientist does at H2O, and then what it means to be building an open-source, distributed, automated machine learning platform.
Erin LeDell: 00:03:17
Okay. So yeah, H2O is both the name of the software that I work on, but also the name of the company, so H2O.ai is the name of the company, and we have a number of different, basically machine learning platforms or products. But the one that I focus on is called H2O, the namesake, and kind of the original software that we were incorporated to produce. So I could talk a little bit more about that, but I think I’ll just start with what is my title about. So at some point, my title was just Machine Learning Scientist. So when I first joined the company back in 2015. And I guess, well, at the time I was the only, I guess, Machine Learning Scientist there. Everybody else were really hardcore engineers pretty much, and a couple sort of stats people.
00:04:21
But my experience is really just designing algorithms and getting them to work fast. And so I focused on that and then eventually after just being there a long time, at some point you get promoted, so that’s my new title. And I think actually I’m the only person still in the company with a Machine Learning Scientist type of title. Most of the other people are software engineers or data scientists, and the data scientists work a little bit more hands-on with the data and helping customers. So my job is, I do specifically work on one product, but I also have been at the company a really long time. I’m one of the earlier people joining the company that’s still there. So I kind of know a lot of historical knowledge about our products. And so I’m sort of an internal consultant in a way as well to help other people either interface with H2O or just do other machine learning stuff.
00:05:35
I also do research, so that’s just another aspect of what I do. So I try to make sure I know what’s going on in the machine learning world and to try to take advantage of new algorithms as they come out and identify them and figure out how to maybe put that into our products. So it’s a lot of stuff. I used to do a lot of community work as well in the earlier days. Lot of meetups. There’s a lot of webinars, I tweet about it. So all the jobs really.
Jon Krohn: 00:06:17
Well, it sounds really exciting. You get to be involved in the development of some of the most powerful applications in a high-powered, well-funded, very popular machine learning company. So that’s super cool. And then you also get to be involved with, it sounds like, lots of projects across the company by providing you the institutional knowledge that you’ve accumulated over the eight years that you’ve been working there. And then, I don’t know, I mean, you’re kind of preaching to the choir here a bit with somebody who’s a podcast host, but I also love talking about and spreading the good news about data science and machine learning like you do. So that sounds like a really fun part of the job to me too. And that is actually what drew me to you initially is you do create great content, you do amazing presentations at top conferences.
00:07:09
The specific thing that drew my attention and it caused me to ask you to be on the show was that you’d posted this brilliant lecture that you’d given at NeurIPS, Neural Information Processing Systems, which is the most prestigious machine learning conference around. And you’re doing keynotes at places like that. I’m honored to be speaking to you and it must be so fun being able to do those kinds of things. We’ve talked about H2O, we’ve talked about what it means to be the Chief Machine Learning Scientist there. Tell us about the open-source, distributed, automated machine learning platform that you’re responsible for developing.
Erin LeDell: 00:07:51
Sure. So the platform is H2O, and I’ve been involved in just generally H2O for a while, so not the AutoML stuff before that. So just every aspect of it, what the API looks like, what the algorithms do, what kind of details do we store in the model objects. Anything related to that I’ve been involved with since the beginning. And so that just is a foundation for the automated platform. So basically, that’s just a whole bunch of algorithms. All the good ones, none of the bad ones. So your friends like GBM or XGBoost or random forest, we have some deep learning, not as extensive as a PyTorch or a TensorFlow, but basic deep learning the works on tabular data. And there’s a whole bunch of them. I’m not going to list them all. We have lot of unsupervised algorithms as well; anomaly detection, things like that. So that’s kind of just the foundation.
00:09:02
So we have all these algorithms. And now, so after we finished that part, I was sort of noticing that I kept writing the same code over and over again, new data set, let me try my whole thing. And I’ll try all the different algorithms, do grid searches, other random searches, just do as much searching as possible, then kind of bundle all that together with a stacked ensemble to get the best performance. And so I was already doing the same thing over and over again. And so then that-
Jon Krohn: 00:09:41
You were an AutoML algorithm?
Erin LeDell: 00:09:43
Yes. I mean, everyone is. Not everybody goes on and creates a whole software around it, but you should, because that’s what I did. So basically that became H2O AutoML, and now it’s just this sort of algorithm slash wrapper function that just does a whole bunch of tasks and it will modify itself based on resources. If you only have short amount of time versus a long amount of time or things like that, it will modify itself. And so the goal after… We first released that in 2017, and at the time it was basically teapot and auto SK learn and maybe a few other open-source libraries that are not as popular. And so the AutoML space was a lot different than it is now.
00:10:48
In my opinion, H2O is a platform that’s designed to be very robust and fast and enterprise level, low memory, all the optimizations that you would want in a machine learning library. So I think H2O AutoML represents the first enterprise-ready AutoML system, at least in the open-source. I can’t say what else is out there otherwise. So yeah, that was kind of the goal.
Jon Krohn: 00:11:18
And then, so if this amazing, robust, optimized AutoML tool is open-source and any of our listeners can be going out and using it, how does H2O make money?
Erin LeDell: 00:11:32
That’s a good question. I should check on that. No, I’m just kidding. So yeah, the company started as exclusively open-source. H2O was the only thing at the beginning. And in the beginning we did enterprise support contracts, focusing on big clients like big banks and insurance companies, healthcare companies, people that can afford to pay for something essentially that’s free. So that was the first revenue stream. Then we wanted to add more revenue streams. Of course, we’re a VC-funded company, so that’s how things go. So then we created a whole additional proprietary platform called Driverless AI, which is another AutoML platform. There’s some similarities and differences between the one I work on and that one. I won’t go into too much detail unless we want to revisit that, or maybe-
Jon Krohn: 00:12:38
No, yeah, go into it. That’s kind of interesting.
Erin LeDell: 00:12:39
Okay, sure.
Jon Krohn: 00:12:40
So I’m kind of guessing maybe it leverages some of the ideas, the open-source things that you’re doing with the main H2O AutoML library, but maybe there’s some, I’m guessing, if I was doing this, I might want to be having some bells and whistles that specifically cater to enterprise clients, like security features or something like that.
Erin LeDell: 00:13:01
So the algorithm itself is a little bit different, but a lot of it’s also about the additional features that help you operationalize things better. And fundamentally, they’re both ensemble algorithms. One of them, so driverless AI, uses genetic programming to do very extensive feature selection and feature creation. So it’s kind of an automated feature engineering piece to that, which obviously helps with performance. So that’s probably, in terms of just the algorithm itself, what is the difference. In the open-source, we do some feature processing, but not this sort of proprietary thing. So one of the things about H2O is that we’ve hired a lot of Kaggle grand masters, and they’re the ones who are very good at this dark art of feature engineering. And so they are the ones that came up with the automated feature engineering. So that’s one thing. Driverless AI is also more focused on GPUs, whereas H2O is CPU.
Jon Krohn: 00:14:18
Got it.
Erin LeDell: 00:14:19
And in a sense, they’re quite comparable. So it’s just sort what you’re looking for. Usually people, kind of have, either they use both, or they just use one of the other. And so that’s another revenue stream. And then the last thing that we have now is the H2O AI Cloud. So that’s a whole nother business basically within H2O where it just makes it… I mean, I think we could have even done this years ago, but we were more focused on building the actual algorithms themselves. And so now we’ve kind of done that and we are still iterating and improving, but this just is the next iteration of, okay, now we have all this good software and machine learning algorithms, let’s make this a lot easier to use on the cloud. And so we kind of just have our own cloud because of course you can use either tool on whatever cloud you want, but there’s just a lot of model tracking and governance and other features that you would get.
Jon Krohn: 00:15:28
Today’s show is brought to you by Datalore. The collaborative data science platform by JetBrains. Datalore brings together three big pieces of functionality. First, it offers data science with a first class Jupiter notebook coding experience in all the key data science languages; Python, SQL, R and Scala. Second, Datalore provides modern business intelligence with interactive data apps and easy ways to share your BI insights with stakeholders. And third, Datalore facilitates team productivity with live collaboration on notebooks and powerful no-code automations. To boot, with Datalore, you can do all this online in your private cloud or even on-prem. Register at datalore.online/sds and use the code super DS for a free month of Datalore pro, and the code super DS five for a 5% discount on the Datalore enterprise plan.
00:16:18
Okay, cool. So the H2O AI Cloud, it can be deployed to any of the major cloud service providers like AWS, Google Cloud. So it allows model tracking and other kinds of governance features like you’re describing in any of those physical clouds.
Erin LeDell: 00:16:40
Yes, and with the caveat that we can also just do everything on our own cloud, which is hosted on Amazon, we have that, but we also have what’s called the hybrid cloud where it’s sort of also on-prem. So you can choose whichever meets your needs.
Jon Krohn: 00:16:58
Awesome. So we’ve got, in terms of the H2O product universe, we’ve got the original [inaudible 00:17:09] H2O open-source library, which is what you are focused on primarily. And any of our listeners can use that AutoML library right now. Go access it on GitHub, use that library. We’ve also got the driverless commercial package, which also does AutoML, and it has automated feature engineering features. And it works well on GPUs, it’s optimized for GPUs. And then we’ve got, finally, you mentioned there, the H2O AI Cloud, which allows distributed training to happen on either on-prem with the client’s own cloud, with their own servers, or on a third-party cloud or even H2O’s own cloud service. So that’s super cool. Let’s dig into just a little bit more, some of those terms.
00:18:04
When we were talking about driverless, we were talking about, we don’t need to dig into this too deeply because some of our listeners will be aware of them, but others might not be. So let’s talk a little bit about feature engineering and why it’s important to have algorithms potentially be able to work on GPUs. So the feature engineering is critical for a lot of models because, and you can explain this better than me, but just giving you a starting point, it allows us to take however the raw data is provided to us, and then run functions over those data to extract the features that are most likely to provide the most valuable signal to the machine learning algorithm downstream. Does that sound like a reasonable description?
Erin LeDell: 00:18:53
Yeah. And so the specific genetic algorithm that’s used, it automatically creates a bunch of candidate features and then kind of evaluates their usefulness and does that in the genetic algorithm type of way.
Jon Krohn: 00:19:13
Let’s talk about that a little bit too, because that’s really fun. How do these genetic algorithms work?
Erin LeDell: 00:19:19
Well, I guess I can’t say specifically how the one in driverless works, because that’s our thing. But genetic algorithms, the way they work, is there’s an evolutionary process and it sort of goes through the different stages of evolution and whatever sort of left at the end is what you come up with. And it just uses standard ways to evaluate model performance during the process as well. I mean, the concept has been around for a long time. It’s just I’ve been seeing a little bit more of that lately with people using it in different machine learning applications.
Jon Krohn: 00:20:02
It can have really cool impacts. It can solve in situations where other kinds of approaches like stochastic gradient descent, kind of standard machine learning approach, might not allow you to converge on an optimal solution. Genetic algorithms can sometimes still do really well in those situations. And I think to dig into it a little bit more, So it follows an evolutionary process, meaning specifically that you start with random starting points in a whole bunch of different situations. And then you take the best performers that just happen to randomly perform best at your task and then you mate them together. So you randomly take the best parts of one of your top performing algorithms from the first iteration and another one of the parts of your top performing algorithms from that first iteration, combine them together randomly, and then you see how all these children perform on the task.
00:21:01
And then the top performing children, you mate them together and after many generations of mating all these children together, you end up with descendants that just happen to, by chance, perform really well at whatever task you’re trying to get them to do, like in this case, feature engineering. I think they’re really fun. I’ve never really deployed them in any production situation, but I guess I know a little bit about them and they’re super fun. I’d love to find a use case. So that’s the feature engineering with genetic algorithms. And then you also talked about how the driverless package works well on GPUs. So why might somebody need to be doing something on GPUs as opposed to just CPUs?
Erin LeDell: 00:21:47
It depends. I think it just depends on the software you want to run, and the size of your data, and of a lot of different things. But for driverless, one of the algorithms under the hood is XGBoost, and that can be… It’s a software that is optimized for GPUs or you can run it on CPUs or GPUs, but can run a lot faster on GPUs. The other thing is the genetic algorithm, that’s kind of a beast. So want to just throw as much compute at that as possible. So yeah, I mean GPUs, they’re more expensive, they’re more resource intensive, but they can also just solve problems quickly. So it’s just kind of whatever works for you. Some people don’t have a GPU, you can rent them, but they’re a little expensive to rent on the cloud, and because H2O, the OG, was created in 2012, the company was founded.
00:22:56
So if you can remember back 10 years in the data science world, like everything, well, Amazon EC2 was newly becoming a popular thing and it was like all of a sudden we could get very, very cheap CPUs like widely accessible. And so that was kind of one of the goals is like let’s build a library that can take advantage of that compute infrastructure at the time. And fast forward 10 years, we have a lot more advances in GPUs and yeah, a lot of things you can do with both. H2O also can use GPUs when you’re using XGBoost, but that’s the only third party algorithm that we have incorporated in the tool and also the only one that can take advantage of GPUs.
Jon Krohn: 00:23:41
Well. And I’m not surprised, given all the Kaggle Grand Masters at H2O, that XGBoost happens to be the algorithm that gets all that extra attention, because XGBoost is often the winning algorithm in a given Kaggle competition. And probably many listeners are aware of Kaggle, but so we should mentioned that a few times now. But it’s a platform that we’ll be sure to have in the show notes in case you aren’t aware of it. It allows you to test your chops at solving data science problems against other people around the world. And the top performing algorithms in those competition in those competitions are often XGBoost. And the people who regularly top these competitions are Kaggle grand masters like chess grand masters and yeah, H2O is famous for hoovering up all those grand masters.
Erin LeDell: 00:24:33
Yeah, I mean they also like LightGBM and sometimes CatBoost, so I’ll just give them a shout out as well, not just XGBoost, but yeah, a lot of people like LightGBM too.
Jon Krohn: 00:24:43
That’s true. Yep. So, all right, so I’ve forced you to go off piece and tell us tons about driverless and the H2O iCloud. Let’s go back to H2O Proper, the open source tool that you’re primarily responsible for. So I know that part of this is something that you are into encouraging users to do is to use the auto explainability feature to make sure that there’s still a human in the loop for important model choices. So why is that important and how do you nudge people into the direction of using that auto-explainability feature?
Erin LeDell: 00:25:31
Yeah, so in the open source we have kind of an automatic explainability, kind of just a wrapper function. It’s a function, but it’s really just iterating through a number of what we call, explainers, that are available in the tool. And really that’s not meant to make people not have to think, it’s more just write one line of code instead of a bunch of lines of code. Because that’s kind of something that I like to do is wrap things up. So yeah, I think the idea is you generate it with one line of code and then you get a whole bunch of explanations, but the explanations are technical, so there are things like shapley plots or partial dependence plots or things like that. So you have to actually be some level of data scientist to understand what it’s saying. So I think we’re very far away from any kind of automated decision making type thing. So, yeah, I think in that sense the human has to be in the loop, because there’s nothing more to do with it other than to interpret.
00:26:44
And you have to know a little bit about each of these different types of explainers and what they’re good for and what they’re telling you. So yeah, I think, but just sort on the human in the loop topic, there’s other ways to handle that. So there’s like if you have a model governance platform for example, you could have things kind of models generated automatically, then explained automatically, but then also you could set up some kind of controls where certain people with certain positions that the company have to approve things and things like that. So I think there’s other things that we’re doing in the cloud that makes it a little bit easier for non data scientists to do that kind of thing. So there’s a new thing that we have coming out, called narrative explanations, where it’s basically just words and it’s kind of explaining what’s happening with words so that a non-data scientist could try to make sense of that like a business leader.
Jon Krohn: 00:27:49
Nice. That sounds really cool. So we’ve talked a lot about AutoML and how this issue of open source product allows people to do it. And you even talked about how the genesis of that AutoML tool was you being kind of an AutoML algorithm yourself. So if our listeners aren’t yet using AutoML, why should they be considering it? And a related question is, can you please explain to us, your No Free Lunch Theorem?
Erin LeDell: 00:28:29
Okay. So first, why should people use AutoML? I think because it’s a tool that can be useful for anyone, regardless of where they are in their data science career. It’s useful to somebody like myself, that’s why I wrote it, because I wanted to have cleaner code. I knew what I wanted to do. It was pretty much the same process every time. There’s variations on it, which are now sort of bundled into the algorithm, but it just makes everything really cleaner, more reproducible. We generate what we call a leaderboard. So that’s an idea from Kaggle, where you have everybody ranked. So in this case we just rank all the models by whatever metrics you’re interested in. And yeah, I think there’s, it’s just kind of automating not the hard parts, but almost the boring parts of experimentation. I think people, when they’re learning algorithms, that’s a good thing to do, because you want to kind of investigate more on a manual level.
00:29:41
If I change this, how does that affect? And you kind of build up this intuition over time. But for somebody who’s kind of got their thing down packed about how they approach a data science problem, it’s just less code basically. And then for people who are not as experienced, it can actually be a learning tool I think, because you have this easy function where you just run it and then you get to see the results and you’re like, okay, this is interesting. Why are the GBMs on top and the deep learnings below? You kind of start to notice trends in the leaderboards and which algorithms perform well. And then you can look at, okay, here’s the top three models are like XGBoost models. Let me go ahead and then look at the parameter settings and what’s different about these versus that? And then you could even have more context that you might not even think to address.
00:30:41
For example, prediction speed is something that if you’re using machine learning in the enterprise, it’s something that you might care about. So it’s not stuff that new data scientists always think about. So if you have this leaderboard and you have model ranked by model performance and then you also have a column with prediction speed and then you start to look at things, you can see, oh what have I learned about very deep trees? Oh, they’re slow to predict. So let me keep that in mind when I’m building things. So there’s little sort of nuggets that sort of automatically bubble up to the surface, that I think you can learn from. And then you don’t have to be an expert on every single algorithm, because you might be very good at let’s say just XGBoost, but you don’t know how to do anything else.
00:31:34
There’s a lot of people like that out there. If you’re going to pick one tool, it’s not a bad choice. So it takes a long time to become an expert at a single algorithm. So then with the AutoML process, you get all the algorithms basically, and you don’t have to know. If you don’t know anything about deep learning, that’s okay, because we’ve already thought about what would make sense to do there and sort of generated what we think is useful. And then it can also just be used as a baseline. You could just start there and see what you could do on your own to outperform the AutoML.
Jon Krohn: 00:32:12
What do you think about the Super Data Science podcast? Every episode I strive to create the best possible experience for you and I’d love to hear how I’m doing at that. For a limited time, we have a survey app at www.superdatascience.com/survey where you can provide me with your invaluable feedback on what you enjoy most about the show and critically about what we could be doing differently, what I could be improving for you? The survey will only take a few minutes of your time, but it could have a big impact on how I shape the show for you for years to come. So now’s your chance. The survey’s available at www.superdatascience.com slash survey, that’s www.superdatascience.com/survey.
00:32:51
Super cool. I had never thought of how valuable that kind of AutoML leaderboard is for learning about algorithms. It’s a super cool application that I hadn’t thought of. I always think of the things that you did mention as well, like making model development easier, less boring, cleaner code. Those are all the kinds of things that I expected you to say. I didn’t expect the model leaderboard answer, which was so illuminating. And I can definitely see how that would work. It would be so interesting as somebody getting into data science or even like you say, somebody who’s maybe expert in some specific algorithms and not others, to see how all these algorithms compare against each other with respect to model accuracy as well as efficiency. Very cool. I like that. Then, so Erin, what’s the no free lunch theorem?
Erin LeDell: 00:33:47
Well, so the no free lunch theorem, is in the machine learning context, it’s like a theorem. There’s a paper related to it, but it’s one of the authors is this guy named David Wilpert, who is one of the inventors of, or is the inventor of stacking, so that’s Stacked Ensembles. So probably the reason that he came up with such a thing is because he understood this where he had this kind of notion about algorithms and the no free lunch theorem states that any optimization algorithms or any two algorithms are equivalent over if you average over all the problems that they could be applied to.
00:34:32
So there’s no one best algorithm, is the other way to think about it. And so that’s kind of the approach that you take when you’re a person who does stacking. You’re kind of like, well first of all, there could be a single best algorithm on a specific data set, but you don’t know that in advance. So what we like to do is just try a whole bunch of things on a bunch of different algorithms and then stack them together. And that’s actually going to give you in a reliable way, better performance than any, like if you just choose GBM all the time or something like that. So that’s kind of where it comes from. And yeah, there’s no trick, and that’s true of AutoML tools as well. There’s not just going to be one AutoML tool that wins on every single problem, it’s just more levels of that.
Jon Krohn: 00:35:25
But AutoML is kind of a step in the direction of evaluating at least all of the possible models in the universe. Maybe not in the universe, but all the popular algorithms that are likely to be useful. So AutoML is an avenue to try to minimize the impact of the no free launch. So to kind of state back to you something that you already said, you could have somebody who is expert at [inaudible 00:35:53] boost or expert at GBM, and so they are always looking to use that hammer on any nail that they confront, even though there could be situations where, oh, deep learning actually would’ve been a better choice here than XGBoost or GBM or maybe even some.
00:36:12
Because of characteristics of this particular problem, a really simple logistic regression model is just going to be able to do it super efficiently and get you the same kind of accuracy anyway. So the no free lunch is that no one single hammer is going to be the best hammer when you encounter any possible nail.
Erin LeDell: 00:36:35
Correct.
Jon Krohn: 00:36:35
I guess a better analogy would be that some models are like hammers, some are like saws and you encounter different kinds of situations in this home building exercise analogy for some reason. And yeah, so AutoML is a step in the direction of being able to try out all those different tools, given the new situation that you’ve encountered for the first time. But then it was a really nice couch overall for you to mention that even then a given AutoML tool like H2O, while it presumably endeavors to be as broadly useful as possible, there will inevitably be some situations where some other AutoML tool might actually be able to outperform it.
00:37:26
And you talked about stacking a little bit there, and so that might be a term that listeners aren’t totally familiar with, but we can actually put a pin in that, because I’m going to be asking you some very specific questions about stacking later on. So listener, don’t worry, we will get back to that. Before we get there, I wanted to talk about a concept that you’ve been promoting called, admissible machine learning, and I had never heard of that before I started doing research for this episode. So can you fill us in on what admissible machine learning is?
Erin LeDell: 00:38:01
So yeah, admissible machine learning is a new topic, which is why you haven’t heard of it yet. So there was a paper in the journal, Machine Learning, that came out in January 2022. So that’s this year, for anyone who’s watching, I guess.
Jon Krohn: 00:38:23
In the future.
Erin LeDell: 00:38:23
In the future. So yeah, it’s not super well-known. So this is something that we are the first people to kind of implement. So there’s not admissible machine learning in [inaudible 00:38:40] learn or anything like that yet. So this is, yeah, an H20 is the only place that has it. So the idea is we were thinking a while ago about what could we do with regards to fairness, the topic of fairness in machine learning. And this was, we’re working with a consulting researcher who came up with this whole very dense, very theoretical foundation for what’s called now, admissible machine learning. And so it basically relies on information theory and non-parametric methods. One of the things that it makes extensive use of is the conditional mutual information calculation. The idea is, you have your set of predictors X, you have your response Y, and then you might have some other variables that we consider to be sensitive.
00:39:42
So that could be something like gender, race, other demographic variables, which could either be either regulated in some way in the lending industry. It’s fully regulated that you cannot use anything deriving from that or just maybe you’re just trying to make better models that are not discriminating people on various properties. So yeah, the idea is the mutual information of Y and X, given these sensitive variables, will be zero if and only if there’s basically, there’s no sort of leakage from the sensitive variables through X into Y. So what happens often is you have your predictors, which don’t look like they kind of have anything scandalous in them. Then we have the sensitive or protected variables and they do. But then a lot of times if you draw a directed [inaudible 00:40:56] graph kind of going through where everything is influencing everything else, you get still information coming through the X variables that are from the sensitive ones, even if you remove them from the training set.
00:41:11
So the idea is we can come up with a metric for how much the sensitive variables are sort of leaking through the other ones, and then we can rank things based on that. So rank models and that can be used. So one way that we use it as kind of a feature selection algorithm. So what it can do is identify of what you think are your safe variables, are there any that are, let’s just call it leaking that information? And so one thing you can do is kind of apply it as a feature selection method where you just remove those ones that have beyond some threshold that you’re comfortable with in terms of mutual information. So that’s kind of the main application of how you might use this. And then the goal there is then your training models on this more, this subset of the original data set.
00:42:11
And then ideally they should be more fair based. And when I say more fair, that’s a very vague, not well defined term, but that’s up to the user to decide what kind of fairness metrics are you interested in, what do you care about, what’s okay, what’s not okay? We’re trying to basically make a general toolkit for people to have all the tools they need to make better decisions without trying to tell people how to do that. We don’t want to try to do automated fairness, for example. I think that’s probably a bad idea. So yeah, that’s kind of the idea behind it.
00:42:52
And yeah, in H2O, there’s something that we call the infogram, and that just is a graph of, basically, it’s sort of variable importance on the x axis and then this net information that we call it on the Y axis. And you can look at, it gives you a tool also to decide, okay, if I’m choosing between these two features that are leaking stuff equally, is one of them not even that important? So it gives you kind of this visual tool to say, okay, this is great, it was a bad feature anyway, it’s not even actually that valuable. So let’s get rid of this one, but maybe let’s keep this one, or things like that. So yeah, it’s kind of just a whole set of tools. So basically if you want to learn more, go on the 2O user guide and there’s a section called, Admissible Machine Learning, and that will have a link to the paper and you can read more about it.
Jon Krohn: 00:43:56
Nice. Super cool. I just learned so much about this new topic that as I was saying, to know anything about before researching this episode. And I’ll try to recap back to you what you said and you can tell me what I get wrong. So admissible machine learning relies on information theory and non-parametric methods to quantify how much sensitive variables are influencing the model outcome Y, via the inputs X. And so this allows us to potentially remove features that are associated with these sensitive variables that could be associated with unfairness. Is that a reasonable recap?
Erin LeDell: 00:44:35
Yeah. And so then the idea would be if you’re building on a subset, hopefully you’re kind of removing some of the bias at least. I think with all these problems there’s a lot of different ways to attack the problem, like on the data, after you train the model, et cetera. So it could just be one tool in a whole suite of things that you try to get rid of the bias.
Jon Krohn: 00:45:01
Nice. And then the H2O infogram enables visualization of everything that you’ve just been describing. And so it allows people to be making these evaluations, potentially removing some unfairness, some bias from the modeling approaches. So tangentially related to the topic of unfairness or bias, is that you’ve found it or co-founded two massive and highly respected organizations to champion and promote underrepresented communities in the field of data science. So you founded Women in Machine Learning and Data Science and you co-founded Our Ladies Global. And so I’m actually indebted to you personally, because women in machine learning and data science was my initial break in the speaking circuit. So the New York Community for Women in Machine Learning and Data Science was the first meetup that gave me a shot at giving public meetup talks.
00:46:00
And so yeah, that was kind of like a starting point that I was able to leverage and say, “Hey, I did this talk, maybe you’d like to welcome me and do a talk with you as well.” So thank you very much Erin, for creating that spectacular community. This was before the pandemic and so I’m sure things have changed as things have become more virtual. But pre-pandemic, it was such an engaged community. You saw the same people coming out to meetups regularly and supporting each other around learning machine learning concepts, around getting great employment opportunities in data science and machine learning. So great organization. I’m less personally familiar with R-Ladies Global, but I see it everywhere. So yeah, truly it’s remarkable to have been able to found or co-found these two massive organizations. So what motivated you to create them and what challenges did you overcome along the way?
Erin LeDell: 00:47:05
Well, it’s great to hear that story by the way. We hope that this is a venue for people to get experience and then they can go… It’s exactly that. And yeah, so I’m glad to hear that. So yeah, so why did I found each of these? So the Women in Machine Learning and Data Science was founded in 2013, and the idea for it comes from the Women in Machine Learning workshop, which is at Nurx. And I went to that in 2012 and then 2013 as a grad student. And I just thought, this is so great, but if I go back, I don’t have this community as much. They do a little bit more online stuff now, but back in the day, it was a one day conference, once a year, and it was exclusively focused… Well, I shouldn’t say exclusively, but primarily focused on academics and a lot of grad students.
00:48:07
So at the time I was going to a lot of meetups and they’re mostly men there. So I thought, why not try to make this Women in Machine Learning concept to be a meetup and then we can meet once every two months or meet the women community basically. So I just… I don’t know, I think I waited a year and I was like, if somebody else will do that, that’d be cool. And then it didn’t happen. So right after Nurx in 2013 and the Women in Machine Learning workshop, I went on meetup, just created a meetup and I called it Bay Area Women in Machine Learning. And I added the and data science because I did want to kind of extend to just more software, data science, just generalizing it a little bit. And way back when, we did have a few talks with the WiML, well at least one person at WiML that was like, “Okay, maybe we could do chapters of WiML.” But that it kind of just never panned out, so I just took it on my own and did it.
00:49:27
Yeah, there was probably 15 people that came to the first meetup maybe, then we just kept doing it. And a couple of years later, somebody in, I think… Well, I think it was, yeah New York was I think the second chapter. So someone in New York just contacted me and said, “Hey, I like this thing that you’re doing. Could we do one in New York?” And I was really excited that anybody cared. And so then that’s when the New York chapter started and I think maybe a year later, one in North Carolina started. And then, I don’t know, just over time, there’s a lot now. So there’s like a hundred and something.
Jon Krohn: 00:50:12
Wow.
Erin LeDell: 00:50:13
Yeah, so we tweet a lot @WiMLDS. And so I think that’s maybe where people find out about it these days and then they see it online and they think, okay. We have stuff on our website, like contact us if you want to start your own chapter in your city. And we pay for the meetup fees and all the infrastructure. So you basically just focus on running your own group. So yeah, I guess the goal was to create this community beyond just a single day of the year and kind of expand it a little bit more and make it localized so you get to know your local folks. And then-
Jon Krohn: 00:50:52
That’s super cool.
Erin LeDell: 00:50:52
Yeah.
Jon Krohn: 00:50:53
And then R-Ladies-
Erin LeDell: 00:50:58
So R-Ladies Global is basically a nonprofit that is, in spirit, very similar to PyLadies. So that’s another one, and that’s where the name comes from. So it’s groups of people that are interested in learning R and using the R language. And it’s a little bit the same vibe as Women in Machine Learning Data Science, but just exclusively focused on R. And that kind of came together at, there’s a yearly conference in the R world called useR! It’s the big conference once a year. And so me and I think it was five other people, could be six, I can’t remember how we were counting because people kind of dropped out a little bit. But me and maybe five other people getting together. And there was one chapter already in San Francisco that was existing and then some people from London met us from San Francisco at the useR! conference. And they were like, “Well we want a chapter two, how do we do that?” And then other people at the conference were like, “Well, we want a chapter.” So then it just became very big very quickly because everyone’s like, “Oh, we want that too.” And so we just started a nonprofit, applied for some grants to get money to pay for the meetups, because meetup. com is not cheap, don’t know why.
00:52:35
But anyway, that’s a whole nother rant I’ll go on. But anyways, super expensive for what it is. But everybody’s on it, so it’s a good place to launch meetups. So we basically spent a lot of time at the beginning just on developing the values that we wanted in the organization. What would be okay, what’s not okay. Some of the things that are important to us, there’s no commercial agenda whatsoever. So one of the things about women in tech groups is there’s always companies that want to come and use you to get diversity points in the world. So we didn’t want that or promotional stuff. So everybody was on the same page in terms of values of what we were trying to do here and what was okay and not okay. We spent a lot of time on branding at the beginning. I remember looking at many hex codes of purple and voting on which purple was the best purple for the R.
00:53:39
And we had people that were good at making logos or branding and then we have somebody running the Twitter. So I think that’s probably why you see them everywhere. We did from the beginning, kind of put a lot of focus on brand and consistency and we wanted the same values propagated through all the chapters so that it’s not just this disjoined thing. So it’s a little bit… PyLadies is a little bit different, in then it’s a little bit more decentralized, whereas R-Ladies is a little bit more centralized and there’s one group that’s running the show.
Jon Krohn: 00:54:15
Kind of how R has a central CRAN repository.
Erin LeDell: 00:54:18
Yeah, which we know, there are good things and bad things about that. Yeah oh, what a horrible comparison.
Jon Krohn: 00:54:25
Exactly.
Erin LeDell: 00:54:25
Oh man. Never been compared to CRAN before.
Jon Krohn: 00:54:34
All of the R-Ladies meetups must have a PDF in exactly this format and it must be approved to get in. No, I’m joking.
Erin LeDell: 00:54:44
No.
Jon Krohn: 00:54:46
So yeah, so amazing organizations, Women in Machine Learning and Data Science, and R-Ladies Global, thank you so much for founding and co-founding these organizations respectively, they make a huge impact worldwide. And so obviously for our female listeners out there, these are organizations that you should be thinking about getting involved with. But also for other folks, I’ve been involved with Women in Machine Learning and Data Science as there’s ways that you can help out, even if you aren’t female. For example, by giving talks to these organizations. I’m sure there’s other ways and yeah, really amazing organizations. Thank you so much, Erin.
Erin LeDell: 00:55:28
Just I’ll just say one thing on the topic of who these communities are for. So back 10 years ago, we weren’t thinking a whole lot about, it was just women is what we were focused on. And now I want to be clear that we’re open to… Basically we’re open to everybody, anyone can come to our meetups, but we do try to promote women and non-binary speakers. Not exclusively, but the idea is that that’s who we’re trying to help. But I just want to give a shout out to the non-binary folks who may not feel represented.
Jon Krohn: 00:56:04
For sure. Thank you for calling that out. And so a related question, tying your work as a grassroots community leader in these organizations as well as your machine learning background. So there are gaps today, well publicized gaps in diversity and income in a lot of industries, including in machine learning and data science. And while we hope that in the future those will be completely resolved, do you worry about how historical data that we use to train our models encodes past practices and behaviors related to these gaps in diversity and in pay? How do you think we should address bias in data sets and the data generation process? I mean I guess we have some, now I know about admissible machine learning, so that sounds like one really cool way to be doing it. Do you have any other thoughts on that topic?
Erin LeDell: 00:57:17
Well yeah, first of all, I do think it’s a big problem. Because yeah, pretty much a lot of the data sets that are out there weren’t necessarily created with the intention of doing machine learning on them. And there might be, the data’s created in a weird way that could encode biases or there is just historical bias represented in different industries that’s getting encoded into the models. So yeah, it is a huge problem. People have been talking about this a lot for, I don’t know, very much a lot in the last five years at least. It was talked about before then but yeah, I’m really noticing more prominent people speaking out about this. But there hasn’t really been any huge progress. We still see brand new systems that were just designed in the last year. Some of these image generation platforms like Mid-journey, Dall-E, et cetera, and they have really objectionable content in there.
00:58:27
A vanilla example is type in, “Show me a CEO” and it will show a white man and a suit. That’s just, same with Google images. I guess, yeah it’s concerning because at this point we know all of this and there’s still not anybody really regulating it in any way. And it’s kind of a weird situation to be in because we’re kind of trusting… Not that we are doing this intentionally, but the way that society is set up is we’re trusting the companies that are producing these to kind of self-police. And I don’t think they are, I think they sometimes respond to things when there’s some kind of media thing about, oh this system did this and it’s racist or sexist or whatever. And then they say, “Oh we’ve been working on that, we’re trying to fix it.” And it’s like, but we have all the tools and the researchers who are knowledgeable about this stuff that work at all these companies. So I don’t know, there’s a disconnect there. So I don’t know really.
00:59:32
And it’s hard to measure these things anyway, how fair is this? And if it’s below this threshold, we’re going to trash it. There’s not really an easy answer, but it’s just hopefully, the thing that I’m doing, because this is my skill set, is trying to make tools that can make this stuff a little bit easier so that it’s not this mysterious thing that you don’t know about and it’s just right there in whatever library you’re using and you start to learn more about it that way. So I think we have a long way to go. And then once the data scientists know a little bit more and have the tools, then we have to figure out the other piece, which is if it’s more profitable to show this type of thing, even though it’s racist or sexist, it’s probably still going to happen because that’s just capitalism.
Jon Krohn: 01:00:27
Yeah, it is somewhat encouraging to see in recent years that conferences like NeurIPS have tracks specifically focused on these bias issues. Which prior to five years ago you didn’t see it all. It was really a fringe topic. And so it is hopefully encouraging that it is something that is being discussed. However, as you say, yeah it’s amazing how so many major production algorithms that come out today still have these issues. And so I guess five years… I don’t know if there’s just not enough of the sense of urgency, even though these are important issues.
01:01:05
I think perhaps part of the issue, which will take some time to resolve is that these algorithms that you’re describing, the people who are working on them are still disproportionately white men and so they get released and these kinds of issues like you’re describing about using an image generation algorithm and typing that you want to see a CEO and it comes out as a man and they don’t bat an eyelid at that. And so I don’t know if that’s part of the issue and hopefully that… I don’t know, I guess there’s a lot of different parts of these historical systems that reinforce these issues and it is disappointing that we’re not making faster progress. But hopefully at least by having all these conversations, more meaningful strides can happen more and more in the future. Yeah, I don’t know if you have any other thoughts on that?
Erin LeDell: 01:02:05
Yeah, I think the last thought would be, rewarding people for looking at these topics is a big important thing. So it’s a big deal that NeurIPS had this new data sets and benchmarks track last year, which is the one that I gave the keynote at. And yeah, a lot of the people that work at the big research labs, they need to publish papers and so if they’re not able to publish papers at the top conferences on these topics, then they’re probably not going to work on it or be less inclined to. So I think that was a big step and I hope that track can grow and I’m glad that it’s become… I mean I’m also very passionate about the benchmarking side of things because I don’t know, I like justice. So it’s just, I like it to be clear where everything is and I don’t like a lot of marketing, I like numbers. So yeah, I think that’s a big contribution and I’m happy that NeurIPS did that.
Jon Krohn: 01:03:10
Nice. All right thank you for that insight at the convergence of both your community involvement as well as your machine learning expertise. Really nice to be able to ask you questions and get your expert perspective. So moving on to another part of your expert perspective, something from the past. Prior to what you’re doing nowadays, you were doing a PhD at University of California Berkeley. So at Berkeley your focus was to reduce the computational burden of the stacking, I promised we would get back to stacking, of the stacking methods developed by your advisor Mark van der Laan to scale machine learning to very large data sets. So could you explain this topic of stacking methods and then maybe more specifically your research on how you were able to adapt stacking methods for very large data sets?
Erin LeDell: 01:04:11
Yeah sure. So yeah, that’s one of the things that I focused on in my PhD and it really just came about from just using the software that was available to do stacking and it was just taking a really, really long time to run my models. And there’s one point where I calculated it was going to take two weeks to run this model that I wanted to run with multi level, cross validation and stuff like that. So yeah, I was just a little bit… I think one of my first classes at Berkeley was with my advisor and he introduced, he calls it the super learner algorithm because he has a paper called Super Learner, which basically proves all the theory behind stacking and why it works. So stacking existed a lot earlier than people knew how exactly it worked and were able to publish theoretical guarantees of performance and things like that.
01:05:10
So that’s why he calls it super learner. So he is presenting this algorithm and he says, “This algorithm provably does better than any single algorithm.” And I raised my hand and I’m like, “Okay, well if that’s true, why have I first of all never heard of it? And second of all, why isn’t everybody using it all the time?” If you had one algorithm that was always better, why wouldn’t you just use that one? It doesn’t make any sense to me. And he’s just like, “I don’t know. I don’t know why they don’t use it.” So it’s like, this doesn’t make any sense. So from early on, actually before I went to Berkeley, I knew about super learner and I just didn’t understand why people weren’t using it. But then I started to try to use it in my research. I also did some applied research and it was slow.
01:06:06
It was training a whole bunch of algorithms in R and all the code was in R, and so it just wasn’t scalable to big data sets. And I worked in biostatistics so a lot of the data sets are smaller anyway, so people aren’t testing it on million rows or anything like that. So yeah, basically I just was like, this needs to have a better user experience because I like the algorithm but it’s not working for me right now, the software. So one of the things I did was discover H2O around that time, which had an R library and so that was through some other projects and benchmarking and another startup that I used to work at. So I discovered H2O and I realized through extensive benchmarking that this platform is really fast, really low memory, has a bunch of algorithms, not as many back then as we do now. But I was like, maybe instead of using a bunch of random R packages that are implemented by all different people with different skill sets and some of them are really slow, you don’t even know which ones are slow in advance.
01:07:21
So basically I saw that as a platform that I could develop the super learner slash stacking algorithm on top of because it didn’t have it. So I started just writing that code and making that software. I first started out as an R package that was using the H2O R package. So all the code I wrote was in R, but really it’s just kind of instructional code. It’s not just saying train this algorithm and then take these cross-validation folds and do this with them. So it didn’t really matter in terms of performance that was written in R, but it was as an R package. And then eventually I graduated and then went to go work at H2O and then we re-implemented everything in Java.
01:08:09
So yeah, H2O is written in Java, all the algorithms are written in Java, but the APIs are Python and R is how people mostly use it. So yeah, we got it all super optimized in Java with some Java engineers that I worked with. So then that was one approach. Then there was a couple other things. There’s another algorithm that was called Subsemble, which is kind of a version of stacking, but it’s partitions the data set into subsets and then uses this kind of fancy version of cross-validation where you’re able to score all the rows like you need to in stacking. But you’re able to do that with training on smaller subsets and it actually does pretty well compared to if you just use the whole data set. So that was another approach, and there’s an R package that does that. It would be great to have it in H2O. We were just talking about it the other day. It’s been a minute but maybe we could get that into H2O. It would be cool. Yeah.
Jon Krohn: 01:09:21
So that was all super interesting, but I still don’t feel like I know exactly what stacking is.
Erin LeDell: 01:09:27
Oh, okay. Yeah. Let me tell you, let me tell you. Okay, stacking, it’s really simple. So you just cross validate a set of algorithms. Let’s say you have five algorithms, or not five algorithms, but five models like a random forest with these particular parameters and a GBM with these and et cetera, et cetera. And then you train and cross validate those KK times if you’re doing K fold cross validation. And then what you do is then from that process, you have the holdout predictions from each fold. And so you have a prediction for each row in the data. And then you take… So basically you imagine a new matrix that’s your training matrix. And there’s still N rows as the original training frame. But now there’s just, if you had five algorithms, there’d be five columns. And each column is just the cross validated predictions from each of the algorithms.
01:10:33
And so you have this new thing, you can stick on the response column. And then you have this new training matrix and then you train another algorithm to learn how to best combine the predictions from the base learners to predict the outcome. So there’s two machine learning problems going on. The original one on the original data set and then this meta learning process. And you can use any kind of algorithm you want for the meta learner. Often people just use a GLM, but you could use whatever you want.
01:11:04
So it’s basically just a two level system where you have the base learners, you have the meta learner. And then when you go to use that in production or wherever to score new data sets, you have to score all the base learners, get the outputs, put it into the meta learner and then get the final prediction. So one of the things that can be slow about stacking is if you have, let’s say, hundreds of base learners, and especially if any of them are particularly slow at generating predictions, you’re… Hopefully you could do that in parallel, maybe it wouldn’t be a big deal. But you’re limited in that way. So sometimes people think it’s too complex or something like that but I don’t think it’s any more complex than deep learning and people love that.
Jon Krohn: 01:12:00
Yeah, no, that is actually quite intuitive. And so I wasn’t aware, but, so at my machine learning company, Nebula, our main production algorithm is a stacked algorithm. I just didn’t know that it was called that.
Erin LeDell: 01:12:10
Oh, nice, nice.
Jon Krohn: 01:12:10
Yeah, so we do exactly that. We have multiple base learners that feed into a generalized linear model that allow us to have this meta learning output. So we intuitively stumbled upon that. We’re like, “Oh, we get, in some situations, one of these models performs really well and other situations, this other model performs really well. Let’s try to find a way to have the best of both worlds.”
Erin LeDell: 01:12:38
Yeah, I feel that it’s been discovered independently many times I think. Which is again, probably why my advisor discovered it and then proved all the theory behind it. And I’m not… I don’t know if he actually knew about it before that actually. So yeah, it’s become a popular technique on Kaggle. That’s pretty much where you see it a lot. But yeah, people are using it. A lot of our customers use it.
Jon Krohn: 01:13:09
It makes a lot of sense.
Erin LeDell: 01:13:11
Yeah.
Jon Krohn: 01:13:11
It’s like to me, intuitively it, yeah, as I just explained, some kinds of models like a GBM or XGBoost, they happen to excel in particular kinds of situations. And by averaging together models that excel in different kinds of situations, yeah, it’s unsurprising to hear that they’re doing really well in Kaggle competitions too. Anyway, I spoke over you.
Erin LeDell: 01:13:37
That’s okay. I’m glad you’re stacking. You should try H2O Auto though. That’s my pitch.
Jon Krohn: 01:13:48
Yeah, no, for sure. I should probably be using H2O Auto. I agree. I am… Based on everything you’ve said today, I feel like an idiot that I am not. So yeah, I’ll get on it. So in addition to techniques like stacking and of course the H2O library that you develop, what other kinds of tools do you use daily?
Erin LeDell: 01:14:12
Well, I use a lot of R, a lot of Python. There’s a new tool that we have at H2O that’s open source that’s called Wave, and I’ve been using that a lot lately. It’s basically an application building framework, so it’s in the browser. So basically the equivalent that people would have heard of are Streamlet in Python, or Shiny in R. So it’s that type of thing. But I actually do like it better than the other two. I’m not an expert at either of those other two. But I found it to be very easy to make UIs for your models or make UIs for whatever you want. There’s tons of little built in things that are already there for you. So you say, “I need this kind of plot of this nature, and here’s the X and the Y, and that type of thing.” So I’ve been building a lot with that tool. There’s a… We’re building… Oh sorry, go ahead.
Jon Krohn: 01:15:15
Does it kind of provide you with interactive buttons and that kind of thing too for your users?
Erin LeDell: 01:15:21
Yeah.
Jon Krohn: 01:15:21
Cool. Yeah, so that… Of the tools that you mentioned, Shiny is the one that I’m most familiar with. So does Wave work with either R or Python or it’s a Python library?
Erin LeDell: 01:15:35
It’s both.
Jon Krohn: 01:15:35
Oh it’s both.
Erin LeDell: 01:15:36
I think the R library is a little bit further behind in terms of some features right now. So primarily the library is in Python and then we have an R API to it as well. So yeah, I’ve been building a UI for H2O AutoML in it so that’s going to hopefully come out soon. So it’s just rather than writing code, I’m trying to get even more one level beyond that. And then you just click to upload your data set and then it trains and then it will also combine all the automatic visualizations. So you get everything all at once just by clicking. And there’s lots little tabs you can click on the different areas. It’s fun to build that. But yeah, other tools, yeah, I don’t know. I use a lot of R, Python, Bash. Yeah, that’s about it.
Jon Krohn: 01:16:31
Cool. Well, I’m delighted to learn about Wave, which I hadn’t heard of before. So I’ve learned about a whole bunch of new techniques and approaches in this episode. Wave, admissible machine learning, even the driverless package, I wasn’t aware of that before. So some really cool things coming up for me and probably a lot of our listeners. Great to learn about how I could be using Wave in Python or R to be building applications in a click and point way for my users in the browser. I’ve loved doing that with R Shiny in the past. And so it’s nice for me to now know about one that I could be doing that with Python applications as well. It allows me, as a terrible software developer, to be able to have interactive data science applications made easily. And fun dashboards. You can typically whip them together pretty quickly. So yeah, it’s definitely a tool for our listeners to check out.
01:17:24
And so speaking of my terrible software developer skills, I noticed that H2O has lots of engineering roles open right now that I would not be qualified for. So when you’re hiring software developers or data scientists, Erin, what do you look for in the people that you hire?
Erin LeDell: 01:17:46
I would say are they… What’s their experience? Have they been doing this a while? Are they new? I do also have interns that are in college and were able to still get a lot done. Actually my intern who’s in New York, he was the one that started to build the H2O AutoML Wave app over the summer. So yeah, I think it depends what I’m trying to hire for. But generally I highly prioritize if the person is nice and they don’t seem like they’re going to cause any major drama or issues. Especially as a woman, it’s important to screen for stuff like that, any kind of red flags in those departments. But yeah, I would say, at H2O we do hire a lot of Kaggle Grand Masters. That’s an easy way to show that you know what you’re doing.
Jon Krohn: 01:18:48
It’s an easy way. Oh sorry. The easy route to a job at H2O is to be one of the top performed people on Kaggle on the planet. That’s an easy way.
Erin LeDell: 01:18:57
Yeah. But we do… That’s just one subset of the company. So we have engineers all over the world so we have positions open, I think pretty much anywhere. We’re very remote company, especially since COVID. So depending on what they’re trying to work on, because we have, we’re building this whole cloud right now, so we are hiring a lot of cloud engineers. I would just hope they’ve worked on some other cloud before. That would be good.
01:19:25
But for data scientists, you don’t need to be a Kaggle Grand Master. You do need some way of demonstrating your skills and ability to think through a problem. There’s different ways that you can test for that. But yeah, I think programming ability is important because you don’t want to be slowed down by that. But I would say not the most important thing. I look for people that are more generalists and can adapt to whatever it is that we’re working on. Unless I’m hiring for a specific position where they have to know Java because they’re going to be working on the H2O Core or something like that.
01:20:06
So yeah, I think also great if people are on Twitter and talking about data science and promoting knowledge, I think that’s a very appealing, especially at a company which has a lot of open source at it, to hire people that would like to do that kind of work as well.
Jon Krohn: 01:20:29
Yeah, that makes a lot of sense to me. And that’s something that for listeners that are looking for a new opportunity or maybe their first opportunity in data science, getting comfortable writing tweets or LinkedIn posts or Substack newsletters, whatever. It doesn’t need to be at the start, the most groundbreaking insights. But just getting into the habit of articulating new things that you’ve discovered or cool technologies, this makes you more confident about being able to communicate these things and it forces you to understand them better than if you just read about them. And so even if you never got a single Twitter follower or a single LinkedIn follower or whatever, it would still be really valuable when you come to interviews. But then as it happens, if you do do it regularly, you’ll probably iterate and improve. You’ll figure out what kinds of content that you create resonates with some of your listeners or, listeners in my case, readers in most people’s cases. And you can iterate from there and start to create more of that stuff that resonates. And yeah, over time you probably will just develop a following organically while practicing your ability to communicate and understand complex data science concepts.
01:21:51
Yeah, again, using that same phrase that I used at the beginning of your episode, but preaching to the choir a bit about evangelizing about data science, but I certainly agree. I think it’s something that everybody could be doing and would probably benefit from doing. So we’ve talked a lot in this episode AutoML. We’ve had some glimpses into really cool emerging approaches like admissible machine learning. So Erin, I’d be interested to get your take on how you think data science will evolve in the coming years and decades, perhaps because of things like AutoML, stacking. And so how can data scientists, what actions could data scientists be taking today to prepare themselves for the data science job of the future?
Erin LeDell: 01:22:42
Yeah, that’s a good question. I think we can look at how the field or position has evolved maybe over the last 10 years. If we think about 10 years ago it was people needed to be good at five or six or 10 different machine learning libraries. Or there was still psychic learn, but if you needed something else, everything was a little bit disjointed. Some people are like, I remember just talking to a lot of people that are implementing their own algorithms, even though there’s other open source things available. So I think the data science job has started at implementing algorithms, then it’s like, okay, now we have second learn or whatever. So then that makes it a little bit easier. But now I’m still doing a lot of the tuning or whatever fancy stuff that you want to do to get good models. And then now we have more tools that automate that process like automated hyper parameter tuning, that type of thing. So then that became less important. And then I think now we have AutoML as well so it’s just kind of like it keeps getting a little bit more automated each time.
01:23:57
So I guess what we could expect next is we have, I guess, I don’t know what comes after AutoML. I think it’s more like using the insights from the models and really expanding in terms of explainability and interpretability and all of that, fairness, bias. All of these topics that were sort of way too niche and overlooked in the past. Now we have more time to do that. You don’t have to spend all your time training models, you can actually think about problems better. Even when you’re starting out a problem, how to structure the data in the data warehouse or things like that. So I would hope that people could focus on fairness issues and explainability. And I think that’s going to go a long way because there’s still this disconnect between the business or whatever other application of machine learning, if it’s in science, et cetera, and the data scientists. There’s still that gap. And so whatever can better fill that gap with these different new methods, I think that would be good to focus on.
Jon Krohn: 01:25:14
Yeah, that’s a great answer. I love that idea that because we’re trending towards increasing automation of parameter tuning, obviously that’s kind of machine learning at large and then hyper parameter tuning and then model selection, it should free up time for data scientists to be focusing on the big issues that matter for society, like removing bias, having fair algorithms. These sound like really valuable things for us to be focusing on as data scientists.
01:25:42
And so we talked a lot in this episode about approaches related to fairness, particularly when we were talking about admissible machine learning earlier in the episode. But if listeners would like another episode that is largely dedicated to explainable AI, you can check out number 539 with [inaudible 01:26:03] in which we talked about specific X AI, explainable AI tools. So it could be one worth checking out there if you’re interested in learning more about that. So Erin, you have been extremely generous with your valuable time today. The time has finally come for us to start winding down the episode, which means that it’s time for me to ask you for your book recommendation.
Erin LeDell: 01:26:29
So yeah, I’m going to give a book recommendation for, I’m actually listening to the audiobook right now, which is Viral Justice by Ruha Benjamin. And she’s a professor at Princeton and she has another book that people might have heard of called Race After Technology. And so she spends a lot of time dissecting some of these issues of how technology can contribute to racism and how race informs technology, all the sort of intersections of these two topics. And she just has a… I don’t know. The book just came out so a friend of mine tweeted about it and I started listening to and it’s pretty good. It’s also a little bit of a biography or autobiography as well. So yeah, I thought I’m enjoying it.
Jon Krohn: 01:27:18
Cool. Nice recommendation and ties together a lot of the themes that we have been discussing in this episode, so that’s convenient. And then, so Erin, as I’ve stated in this episode, you are a luminary in this space. You are of extremely well known content creator in data science and I’m sure there will be lots of listeners who want to follow what you’re up to after this episode. Where should they be following you online?
Erin LeDell: 01:27:47
Twitter is the main place I hang out, so @ledell. L-E-D-E-L-L. And I don’t spend a whole lot of time on LinkedIn or anywhere else, so that’s the best place to find my content
Jon Krohn: 01:28:02
That’s great. That’s why I ask the question.
Erin LeDell: 01:28:03
Yeah.
Jon Krohn: 01:28:04
So now people know where to hang out with you. Yeah, Twitter for sure. We’ll be sure to include your Twitter handle in the show notes. Erin, thank you so much for being on the program. It’s been a super interesting episode for me. Yeah, thank you for taking the time.
Erin LeDell: 01:28:20
Thanks for having me.
Jon Krohn: 01:28:26
Super informative episode today. In it, Erin filled us in on how AutoML enables you to save time, have a cleaner more reproducible code, get more accurate models and view model leader boards that can deepen your understanding of individual ML techniques. She talked about how XGBoost and gradient boosting machines are often the most accurate ML approaches but how the no free lunch theorem postulates that no one particular approach is optimal for all problems. She talked about how admissible machine learning relies on information theory and non-parametric methods to quantify how much sensitive variables are influencing a model’s outcome via its predictors, potentially enabling the identification and removal of model features causing unfairness. She talked about how stacking multiple base learner models under a meta model such as a GLM will outperform the accuracy of unstacked models and how the open source H2O Wave software library enables you to quickly build interactive browser based UIs in either Python or R.
01:29:24
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Erin’s Twitter profile, as well as my own social media profiles at www.superdatascience.com/627. That’s www.superdatascience.com/627.
01:29:40
Every single episode, I strive to create the best possible experience for you and I’d love to hear how I’m doing at that. For a limited time we have a survey up at www.superdatascience.com/survey where you can provide me with your invaluable feedback on the show. Again, our quick survey’s available atwww.superdatascience.com/survey. Thanks to my colleagues at Nebula for supporting me while I create content like this Super Data Science episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the Super Data Science team for producing another phenomenal episode for us today.
01:30:15
For enabling this super team to create this free podcast for you we are deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can find our contact details in the show notes as well. Or you can make your way to jonkrohn.com/podcast.
01:30:37
Last but not least, thanks to you for listening all the way to the end of the show. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the Super Data Science podcast with you very soon.