Podcasts SDS 661: Designing Machine Learning Systems

77 minutes
Data Science, Machine Learning, Productivity

SDS 661: Designing Machine Learning Systems

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

This week, we’re diving into the ins and outs of building production-ready machine learning applications with best-selling author, Claypot AI co-founder and Stanford alum Chip Huyen. If you’ve ever wanted to get more ML models out into the real world, tune in to learn the critical factors that impact an ML model’s journey.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Chip Huyen

Chip Huyen is a co-founder of Claypot AI, a platform for real-time machine learning. Previously, she was with Snorkel AI and NVIDIA, and taught CS 329S: Machine Learning Systems Design at Stanford. She’s the author of the book Designing Machine Learning Systems, an Amazon bestseller in AI.

Overview

Why do so many models never make it into production? It’s a question that puzzles many data practitioners, and this week, Chip Huyen is shedding light on the issues that plague these projects.

The success rate of a model is often directly related to its ability to solving business challenges. In order to address this recurring issue, she often recommends that her students ‘focus on the problem.’ This approach keeps data practitioners aligned with the efficacy of a model, rather than the sophistication of the solution. “Why bring out the big guns, when a cheese knife will do”, she says.

Another topic that Chip covers in her best-selling book is how to align business intent, context and metrics in an ML project. First, team members must correctly understand the problem, and its roots, ensuring that a machine learning solution is indeed required, rather than, say, a UX-based one, for example. Finally, she emphasizes that machine learning is often the last resort in terms of a solution.

If you’re ready for some of Chip’s enthusiasm to brush off you, then tune in to this week’s episode and learn more about feature engineering, getting the most out of your training data and how to supercharge your productivity.

In this episode you will learn:

Why Chip wrote ‘Designing Machine Learning Systems’ [08:58]
How Chip ended up teaching at Stanford [13:18]
About Chip’s book ‘Designing Machine Learning Systems’ [21:12]
What makes ML feel like magic [30:53]
How to align business intent, context, and metrics with ML [37:55]
The lessons Chip learned about training data [42:03]
Chip’s secrets to engineering good features [53:19]
How Chip optimizes her productivity [1:07:48]

Items mentioned in this podcast:

Pathway
Pandata
Keith McCormick (follow #SDSKeith)
Claypot
Designing Machine Learning Systems
Stanford University course: CS 329S: Machine Learning Systems Design
MLOps Community
Real-time machine learning: challenges and solutions
ML Systems Design and Strategy
Jon Krohn’s “Probability and Statistics for Machine Learning” course
ChatGPT
Chip’s LinkedIn post on her favorite books
Complex Adaptive Systems by John H. Miller and Scott Page
How Not to Be Wrong by Jordan Ellenberg
Uncommon Genius by Denise Shekerjian
Fads and Fallacies in the Name of Science by Martin Gardner
From One to Zero by George Ifrah
Jon’s Podcast Page

Follow Chip:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 661 with Chip Huyen, co-founder of Claypot and bestselling author of Designing Machine Learning Systems. Today’s episode is brought to you by Pathway, the reactive data processing framework and by epic LinkedIn Learning instructor Keith McCormick.

00:00:21

Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:52

Welcome back to the SuperDataScience Podcast. Today we are graced by the presence and wisdom of the superstar Chip Huyen. Chip is co-founder of Claypot, a platform for real-time machine learning. She’s author of the best-selling book Designing Machine Learning Systems, which was published by O’Reilly and based on the Stanford University course that she created and taught on the same topic, but that isn’t the only Stanford course she created. She also created and taught the eminent University’s TensorFlow for Deep Learning course. Previously, she worked as ML engineer at the data-centric development platform, Snorkel AI, and is a senior deep learning engineer at the microchip giant NVIDIA. She runs an MLOps community on Discord with over 14,000 members, and her helpful posts have earned her over 160,000 followers on LinkedIn. Today’s episode will probably appeal most to technical listeners like data scientists and machine learning engineers, but anyone involved in or thinking of being involved in the deployment of machine learning into real-life systems will learn a ton.

00:01:52

In this episode, Chip details her top tips for designing production-ready machine learning applications, why iteration is key to getting more machine learning models out of the lab and into the real world. What real-time machine learning is and the kinds of applications it’s critical for, and why large language models like ChatGPT and other GPT series architectures involve limited data science ingenuity, but do involve enormous ML engineering challenges. All right, you ready for this magnificent episode? Let’s go.

00:02:26

Chip, welcome to the SuperDataScience podcast. It’s awesome to have you here. When I posted last week that you were going to be a guest on the show and asked the audience if they had any questions for you, it was my most viewed post of all time. It had over a hundred thousand impressions in 24 hours. So I’m excited about this episode. I know our audience is excited about this episode. Chip, welcome to the show. Where in the world are you calling in from?

Chip Huyen: 00:02:50

Yeah, I’m so excited. Maybe you should have me on the show more often then. Yeah, I’m calling from San Francisco. Yeah, so I’m from Vietnam, but been living here for a while.

Jon Krohn: 00:03:02

Nice.

Chip Huyen: 00:03:02

Are you in New York right?

Jon Krohn: 00:03:04

I am in New York. Thank you for asking. Yeah, not, you know, it’s not very often that the guests are asking me questions, but I was warned you did warn me before we started recording that you might get into teacher mode and start asking me lots of questions. So I appreciate you giving me a softball upfront.

Chip Huyen: 00:03:19

So, you know, it’s funny because before I got into tech I was actually working as a reporter. So for me, like I’m very used to asking people questions instead of being asked questions. So if you give me a chance, I would, I was just going to ask you a question because you do run a very popular podcast and I do want to get on the secret.

Jon Krohn: 00:03:39

Nice. Well does that relate to you in doing research for you before bringing you on the show? This isn’t something I was planning to talk about on air, but is it from your reporting career that you wrote four books in Vietnamese before you were in data science?

Chip Huyen: 00:03:53

So I found out early on this is one of the, like writing is one of the thing you could do without having to interact like with the employers, right? Like, so actually like my editor didn’t fight out that I was still in high school, so I just, one time I was running for, for a while and one day we had a conference about like a tech conference and they had like some high profile people from the government. She would show up and the person who was supposed to take the shorthand note of like some, I don’t know, some higher persons couldn’t show up. Some area was like, “Hey Chip, can you like come here really quickly be here in 30 minutes?” And I was like, yeah, sure. And they show up and she was like, really? Is that you? Your face is so young, I can’t possibly put you next to that person. But so yeah, but, but that how I started and definitely like, it was definitely like a good, how do I say, a good trial run for me so that I could write a book later.

Jon Krohn: 00:04:45

Nice. Very cool. Yeah, you have a lot of interesting stuff in your background. We’re gonna dig into a lot of it in this show. So I was introduced to you through Harpreet Sahota very kind of him.

Chip Huyen: 00:04:56

Oh, he’s great. Yeah.

Jon Krohn: 00:04:58

He’s wonderful. He was in episode number 457, one of the first episodes when I took over his host of this show.

Chip Huyen: 00:05:04

He’s so fashionable, inspires me

Jon Krohn: 00:05:06

He’s is very fashionable.

Chip Huyen: 00:05:08

To be more colorful. Yeah.

Jon Krohn: 00:05:09

Yeah. And his beard is always immaculately trimmed.

Chip Huyen: 00:05:12

I know. He looks so good.

Jon Krohn: 00:05:14

He does and he is got a great voice too.

Chip Huyen: 00:05:17

Yeah, no, he’s a good podcast host.

Jon Krohn: 00:05:19

And then I think we have somebody, well, we’re about to have somebody else in common soon too. So as I was researching for the show, I was reading the people who had given testimonials at the front of your book. And one of those people is Josh Wills and his name is on my mind cuz we’re also doing research for his episode cuz that’s coming up very soon as well. And so I was wondering Chip, if you have a message for Josh…

Chip Huyen: 00:05:44

Josh is my hero. I wish we could like hang out more. He is such an interesting person and the things he says is like, it can be very short, but very memorable. Like, he has some very famous tweets just like very interesting ways of like looking at things. So Josh is great and I’m very happy you are getting him on the show.

Jon Krohn: 00:06:04

Yeah. His tweets are epic and he seems like a very funny, I can’t wait to get him on.

Chip Huyen: 00:06:09

And very kind as well.

Jon Krohn: 00:06:11

Oh, cool. It seems like a lot of people in this industry are kind.

Chip Huyen: 00:06:16

I mean it’s maybe it’s easy to be kind as somebody was telling me, it’s like I could be kind too if I make like a lot of money. So it’s easy to be kind when you are in a more… Okay. I’m not, so, okay. Sorry. I didn’t want to make it sound like I’m dismissing people’s kindness. I mean, it still take effort to be kind, but I do think it’s like, say if you make like, okay, I’m not gonna go into this.

Jon Krohn: 00:06:42

No, I understand what you’re saying. It, it relieves some of the stresses and so people can be kinder, but interestingly, I mean this is maybe gonna be kind of controversial but famously, people working in high finance are, you know, they’re famously mean and like, you know, every kind of… Right? So that’s kind of, that’s an interesting thing. So I don’t think that’s

Chip Huyen: 00:07:01

Good sample. Yeah.

Jon Krohn: 00:07:03

Yeah. It’s not the only factor. I think, I think a big part of it is that people working in data science, there’s a lot of intellectual stimulation. And I think the field is so quickly moving and there are so many people with so many different areas of expertise that you’re quickly humbled.

Chip Huyen: 00:07:21

Yeah.

Jon Krohn: 00:07:22

And you kind of, you have to be nice to everyone because you know, deep down, that, you know such a tiny little slice of the whole field and you’re gonna need a lot of help. And the whole field, actually, this is another thing that’s very different from finance, is that in finance you’re trying to keep all of your secrets to yourself.

Chip Huyen: 00:07:39

Yeah. Yeah.

Jon Krohn: 00:07:40

But in data science, we’re publishing open source code, you are making courses and books about all of your secrets and you’re sharing them with the world.

Chip Huyen: 00:07:48

Yeah. I do think that open source natures does really help with like collaborations and like bringing people together.

Jon Krohn: 00:07:54

So speaking of giving things to the world, you are the author of the best selling book Designing Machine Learning Systems, which is based on your lectures for your Stanford course of a very similar name, not exactly the same name, but pretty close.

Chip Huyen: 00:08:08

I couldn’t get the same name.

Jon Krohn: 00:08:10

Yeah. If people google…

Chip Huyen: 00:08:11

I wish I could. Yeah.

Jon Krohn: 00:08:12

Yeah, Stanford the name of the course was Machine Learning Systems Design when you last offered it in winter 2022. So the reason why I say giving things away is because O’Reilly, your publisher has very kindly agreed to give away five digital copies of your book Designing Machine Learning Systems. And so the morning that your episode is published, I will do a post on my personal LinkedIn account announcing the episode. And the first five people who comment that they would like a copy of the book will get a free digital copy of the book.

Chip Huyen: 00:08:48

Nice.

Jon Krohn: 00:08:49

Yeah. So let’s talk about your book. So machine learning engineering has existed as a profession for a few years now. Why did you write this book? Why was now the right time for this book?

Chip Huyen: 00:09:02

So yeah, so, so I was just looking it up yesterday. It is like machine engineering is like started picking up around, I think I start like go on Google Trends where you, if you search for Google machine engineer, it would start, like, you see it start picking out like 2015. Its really like start getting momentum like around like 2015-17, 2018. And it was about the same times that I started joining like NVIDIA. And at first I was just making notes of things that I experienced and learned. And actually, like, I wasn’t, I didn’t plan on writing a book. It was just like, okay, here’s a note and I thought the best way to learn about something is to just post this out there. Right? You know, we were saying like, the best way to get the right answer is to give the wrong answers and, but we’ll be very quick to then correct you.

00:09:49

So it thought, like, I think I was younger at that time was like, okay, I’m here to learn and I make a disclaimer, I’m here to learn. And it got a lot of good feedback. So it was like an 8,000 word note I put on GitHub and I think it got quite popular. And then so that it evolved into a course at Stanford. And during the process of the course, I also wrote a lot of lecture notes. So I also like continue getting a lot of feedback. And over time it just looks like a book. And that’s how the book came around. So this one as a rule of thumb that the book told me is that like, if people ask you the same question three times, it’s time to write blog posts. Right? So over the last years, ever since machine learning engineering has become a thing, I kept encountering the same questions and questions over, over again. So I thought like maybe my book can help answer some of those questions.

Jon Krohn: 00:10:48

Yeah. It’s an extraordinary book. It has such a crazy amount of detail on all the topics people need to know about bringing machine learning into production and having that production system work effectively. We’re gonna talk a bunch in this episode about it. Really quickly, you mentioned your NVIDIA background there and at NVIDIA your title was Deep learning Engineer.

Chip Huyen: 00:11:11

Yeah.

Jon Krohn: 00:11:12

So is that like relative to this machine learning engineering field where in machine learning engineering and you can disagree with my kind of definition of this, but the kind of definition that I tend to give on air is that the machine learning engineer is the person on a team that’s responsible for taking the model weights after they’ve been trained and putting them into a production system. So does a deep learning engineer, how is that different from a regular machine learning engineer?

Chip Huyen: 00:11:38

So I do think that a role really depends on the organizations. So even like in the same company, two people with the same title, but if they’re on different teams might have very different jobs. So my job Deep Learning Engineer, I was part of the applied research organizations, so my job wasn’t really about bringing models to productions, but it was more like helping our customers prototype and showing that, “Hey, this idea has a potential to make great impact in productions”. So that, that was also like so a lot of our team work was about publishing paper and my job was like to help like research scientists, like run experiments and build things so that they can publish papers as well. So it was through these jobs I realized that like a very big, big challenge in the industry is how to actually bring those prototype into productions. And I learned a lot from our customers, a lot of questions I got for the book also from the customers that we were working with.

Jon Krohn: 00:12:37

Nice. Yeah, that makes a lot of sense. Are you moving from batch to real-time? Pathway makes real-time machine learning and data processing simple. Run your pipeline in Python or SQL in the same manner as you would for batch processing. With Pathway, it will work as-is in streaming mode. Pathway will handle all of the data updates for you automatically. The free and source-available solution, based on a powerful Rust engine, ensures consistency at all times. Pathway makes it simple to enrich your data, create and process machine learning features, and draw conclusions quickly. All developers can access the enterprise-proven technology for free at pathway.com — check it out!

00:13:15

And to understand a little bit better about how you were able to create a course at Stanford, surely that isn’t the kind of opportunity that everybody gets, right? So, you know, you’re like, “Oh yeah, I was doing this machine learning engineering or deep learning engineering and people were asking me questions. So I was starting to write blog posts and writing things in GitHub, and then I realized guys started to have a course, and so Stanford let me make a course”. So how did that come about, I guess from reviewing your LinkedIn profile, it looks like you had been an instructor of other courses maybe that other people had created already at Stanford, so you kind of built up this reputation as an instructor?

Chip Huyen: 00:13:54

So I do things that a lot of things is just, I feel like, so my friends and I used to say that everyone who ends up in Silicon Valley have had a lot of luck. And I think that’s true for me as well. So of course it’s not quite easy till I start a course at Stanford, but it’s a lot easier if you already went to Stanford and knew the processes there. Right? So I think like for me it started as so when I joined doing studying computer science, I started out as a section leader, which is more like of a teaching assistant role for an introductory CS course at Stanford. So I got to know, like teaching it, I really enjoyed it.

00:14:40

And then after that, Stanford has this very interesting concepts that I actually knew from my friends, it’s called a student initiated course. So like as a student you can initiate a course for and teach it if you got approval from the curriculum committee. So at first, my first student initiated course was a TensorFlow course because for my sophomore summer I was interning as a machine learning engineer intern, and I got to know TensorFlow and TensorFlow came out in like November, 2015. So by 2016 there was not a lot of materials about it. So I really, I thought it was really, really useful as a tool. So I wanted to learn more about it. So I came back to my school and was like asking my professors like, “Hey, can you teach your course on it?” And my professors were like, “I don’t have time. Like why don’t you teach a course on it?” So they did help me like get the course on TensorFlow started. So that’s how I learned about concept of starter course.

Jon Krohn: 00:15:40

You created a TensorFlow course before your Designing Machine Learning Systems course?

Chip Huyen: 00:15:45

Yeah, I think I did that when I was in college.

Jon Krohn: 00:15:47

Wow. When you were an undergrad?

Chip Huyen: 00:15:49

Yeah. I think it was a, it was a pretty steep learning curve because I was realizing I was, at some point I feel like I was doing three job because I was doing my degree and then I was like being a teaching assistant for another course and then teaching my own course.

Jon Krohn: 00:16:02

Oh my goodness.

Chip Huyen: 00:16:02

So, it was quite a little of a juggling, but I feel like I learned a lot. I would say that’s the biggest thing I learned that I like is really deep cut in my flesh is this like, tools get outdated very quickly. So remember I built my course on TensorFlow right? Was a TensorFlow 1.0. And at the time nobody called it 1.0, right? But then they changed 2.0 and I realized it’s like if I wanna teach the course again, I would have to like redo like 80% of my tutorials.

00:16:32

And it was just like, okay, tools just go in and out of fashion very quickly. Like, so I chose not to focus on tools for my next course, but focus more on the fundamentals. And that’s the same thing for my book. And I do get a lot of comments like if, like from people like saying that, oh, there’s not a lot of course snippets in the book. So yes, they’re not like, it’s not a tutorial book, so it’s not a teach you like how to use GoodFlow, how to use whatever tools is in fashion nowadays. But it’s more like asking a lot of questions to help you pick the right tools.

Jon Krohn: 00:17:04

Yeah, I think that the topic that you picked, this machine learning systems design, this is evergreen.

Chip Huyen: 00:17:10

I hope so. I mean, this artist, like what is evergreen in the world, it’s changing so fast nowadays.

Jon Krohn: 00:17:15

I don’t know. I think so it’s these kinds of principles that you cover in the book. I don’t think they’re gonna go away anytime soon and I can really empathize with what you went through because I, similarly, I got started in content creation by creating a course about TensorFlow 1, and then had to redo it all for TensorFlow 2. I did redo it all. And I, I’m really happy to have had that experience. Yeah. But that is also why my second big content area is about the mathematical foundations of machine learning because I’m pretty confident that things like linear algebra, calculus, probability theory, that’s not going to go away from machine learning. So yeah.

Chip Huyen: 00:18:00

Yeah. I do think the source are like good skills to acquire as well. I do think there’s something like, for example probability statistics I think that everyone should know more about it regardless of whether you’re doing machine learning or not, because like having a good understanding of probability does help a lot in choosing careers, for example. Like some careers like returns follow more of the number distributions. So if you become like a teacher or a software engineer, you might not like become super rich, but you won’t become super poor. Right? But like some careers like writing, like being artist follow more as a long-term distribution. So like only a few people actually make it. So having a good understanding of that also very helps also helps a lot. Or like I think I wrote a post on like how to apply statistics into life.

00:18:52

Another is like the view on luck. So it does taught me a lot like how to be humble because so like one example, right? Like you have a lot of if you deal a thousand people, two cars, right? So this, this is just by random chance, some of them would do enough with two ACEs, right? So like they don’t have to do anything. It’s a pure chance, right? So we do see a lot of people just like succeed just by pure chance. Like some, like for example, like a lot of people have ideas about social media, social network, but some of them will emerge to be like successful. And you and everything is like, I’m not gonna dismiss like the success of other people, but I do say there’s a big component of luck in it.

Jon Krohn: 00:19:32

Yeah, for sure. But as you say, you can make decisions that take advantage of those probabilities. So for example, you were talking about how a career, like being a software engineer, the compensation is kind of like a normal distribution, whereas something like being a writer, a content creator, there’s a very long tail on the distribution. And so then you could be quite strategic. You could say, okay, I’m gonna have my stable software engineering job. I’m gonna commit to doing that for, you know, for my eight hours a day. And then I’m gonna have my two to four hours a day in the evening or weekends or whatever, to focus on my creative writing career and give myself a shot at, at that long tail.

Chip Huyen: 00:20:11

Yeah. Yeah. So I understand things like the risk factors and how to best, my structures of life to like reduce the risk factors is I think, it’s a very useful knowledge to have.

Jon Krohn: 00:20:24

Yeah. And so if people want to access my probability theory for machine learning course, it’s in the O’Reilly platform already. Yeah, I mean, I open up the course by saying this is like, I think the most valuable topic to take, because even if you’re not in machine learning, these concepts are so useful for thinking about in your regular life.

Chip Huyen: 00:20:46

Yeah.

Jon Krohn: 00:20:48

Yeah, so, and we also have through the relationship that the podcast has with O’Reilly, we have a special 30-day free trial of the O’Reilly platform. It’s SDSPOD23. So if you wanted a copy of Chip’s book and you don’t manage to be one of the first five people to come in on the post and get one, you can use the SDSPOD23 to get that. Anyway, so let’s dig more into the book Chip. So the subtitle of your book is An Iterative Process for Production-ready Applications. Could you parse that out for our listeners? What is a production-ready application? How important is iteration? Yeah, tell us about it.

Chip Huyen: 00:21:29

I think it’s funny use a word of parse because I feel like it’s a very software engineering term. I don’t think normally… Do people outside tech use the word?

Jon Krohn: 00:21:38

I think it’s, I think it’s technically correct, though probably not as widely used outside.

Chip Huyen: 00:21:43

Yeah, I see. No, no, that’s funny. So I’m gonna be like absolutely honest here. I dream of the day when the phrase production-ready, will become redundant. So we don’t say something like eating ready food, right? Like, what is the point of like making food if it’s not ready to be eaten, right? So, so I feel like, sorry, I’m a big fan of food. So my metaphor is food related. I think even, even our company’s name, like Claypot is inspired by food. So production-ready applications, I feel like if you build an application, it should be production-ready by default. And the reason it’s an accessory nowadays still, is because a lot of people still not think about productions when they build machine learning models or doing experiment. So for example maybe things have changed over time, but I would say like when I was interning at Netflix, so people have heard about Netflix price, right?

00:22:38

I was very eager. So I came as an intern and I went to director of algorithm and was like, so who, how, how is that the winning solutions doing? Because it seems so glamorous. That it’s so great. And, and he was like, we never actually deployed it. And the reason is that it’s, it’s very hard to deploy because it was an ensemble algorithms and even though it could edge out really good performance on like a leaderboard, leadership board, leaderboard.

Jon Krohn: 00:23:06

Yeah, leaderboard.

Chip Huyen: 00:23:06

Too much leadership talk, it’s not really easy to, to branch productions. So that was my first lessons and as always saw a lot more of it. Like so first of all, like another is that like a models can be brought to productions, but they just don’t stay in production for very long. So one classic examples they have seen a lot is that like a company could bring out on a consultant to have them build a ML models and they take like maybe six months to build and a few more months to deploy and it will screen in the beginning, but after a few months it just doesn’t work anymore.

00:23:47

It’s because things have changed. And now the company have, like the environment, the business requirement have changed. So for example, like the company might be a say supermarket and they might bring us someone to build a model to predict inventory, like inventory management. So after, after a few months, after a year in production is a model, like we have a lot new items, right? That what people want to buy have changed over time. We got trends, the market, so the models just don’t predict correctly anymore. So now, like supermarket has like several option, right? You can either bring in the same consulting firm like and pay them a lot, extra money, or you can just like build something like from so in-house from scratch. So I think that’s a painful lesson that a lot of companies have experienced. So you, when we do things about like, building a model, so you think about like, first whether they can be deployed and second, how to make it like continually work in production. So that’s where the iterative comes from.

Jon Krohn: 00:24:47

Cool. That was a really great explanation and I love the analogy to food. It makes some, it’s such a crystal clear way to describe it to say, you know, people don’t spend time in the kitchen 80% of the time to then throw out the food. But that’s what happens in machine learning is something like 80% of the time machine learning models are created, or at least a machine learning project is started and it doesn’t make it into production. Nobody gets to eat the model.

Chip Huyen: 00:25:12

So I think I would like challenge on the statistics because it’s very hard thing to measure, right? Right. Because like the nature of machine learning is experimentation. So just experiment with a lot of models. So it’s by nature’s a lot of them are not gonna make it to production because you probably want to deploy the best one, not all of them.

Jon Krohn: 00:25:29

Right.

Chip Huyen: 00:25:30

Yeah.

Jon Krohn: 00:25:31

That’s a good point. So I guess it’s, it’s kind of like yeah, I guess cuz a way that the food analogy is quite different is that typically like we’re making the food one meal at a time. And we only spend like an hour preparing dinner, whereas Yeah, with the machine learning model, you might spend months or years in some cases. And then so you don’t wanna, you, you wanna make sure that that was time well spent especially if you have different competing options for your machine learning solution.

Chip Huyen: 00:26:03

Yeah. You, you want to like, so I think for a lot of companies, like machine learning is a tool. I think it is a tool, right? And when you invest in your tool, you want to see return out of it. So like if like, doesn’t matter like how fancy the model is, if you can’t really show concrete returns on business, I don’t think you are gonna have, you’re not gonna, you’re not gonna get promoted. You’re not gonna, in, in a lot of cases we see the companies invest in machine learning and then they just don’t really get good returns and the whole team just get laid off. I think a very painful lesson is like Zillow estimates zestimates, right? I think are you familiar with the, with a case study?

Jon Krohn: 00:26:41

Oh yeah. Oh yeah.

Chip Huyen: 00:26:42

Yeah, so I’m almost recap. So I think let’s say they did have a competition similar to Netflix price. So it was like a million dollars winning price and they did a lot of marketing campaign on it and it’s like, okay, now we are not using the state-of-the-art to like predict housing price really well. And a year later they report is a loss of like 800 million and they just like laid off, I think the entire Zillow offers team or something like that of like 2000 people.

Jon Krohn: 00:27:10

Yeah. I guess it’s, that kind of problem, I don’t know the problem super well, but you might know better than me. Perhaps some of the issue was feature drift where the world changes. So the model works perfectly on historical data. But then something like the Covid pandemic happens and the factors that are driving house prices completely change, but you’ve still got the same model weight pre pandemic.

Chip Huyen: 00:27:33

Yeah. So that’s very important to make the model adaptive to the changing environment, right? So you need to constantly iterate on the model. So you think change, I think covid is one of the extreme examples, but I think like see a lot of things that can affect a model performance in production. Is this like trends go over times or competitors launch a new marketing campaigns? Or is your user sell launch your new market, or it’s just because of things has changed over time?

Jon Krohn: 00:27:58

Yeah. Recently, in Episode #655, Keith McCormick and I discussed the importance of managing the machine learning lifecycle effectively. To allow you to learn about Keith’s approach to all phases of the lifecycle, he’s kindly making his “Predictive Analytics Essentials” course available for free. All you have to do is follow Keith McCormick on LinkedIn and follow the special hashtag #SDSKeith. The link gives you temporary course access but with plenty of time to finish it. Mastering machine learning project management is just as important as learning algorithms. Check out the hashtag #SDSKeith on LinkedIn to get started right away.

00:28:37

The world is always changing and our machine learning models aren’t necessarily always learning and adapting, though, I suspect a big part of your designing machine learning systems is to have our machine learning models continuously adapting.

Chip Huyen: 00:28:51

Yeah. And I think, I feel like for every application the rate change is somewhat different right? And like some, some applications they’re gonna change very, very fast, but some applications gonna change a lot faster. So, so I do think it’s like one important thing is like to measure, like if you delay updating of the models and how much performance hit you’re gonna get. So say like Facebook early days, like in, in the, maybe like 10 years ago, internet didn’t move quite fast, right? So they did run experiments and they found out that for their model as click through predictions if they had something like delay model update by like a week, it costs some significant performance hit compared to like updating the model every day. So they chose to model update every day, but like it was 10 years ago. But nowadays everything moves even faster. So we’ve seen experiment, people was found out just like, so LinkedIn they found out that like the model, the features stillness just go from like an hour compared to a minute would reduce their performance by three point something percent for the recommended system. So, so I think like the rate of change very depends on applications. Very depends on where you are, what industry you’re in.

Jon Krohn: 00:30:05

Yeah. It’s, it’s wild. I think we anticipate, you know, if you do a Google search right now or maybe a LinkedIn search, you expect some current news to be reflected in that search right away. So, like there’s some major news event happening and you’re like, oh my goodness, I wanna learn all about this. And so I think people don’t appreciate how complex it is and the amount of compute that has to be going into constantly scouring the internet and looking for updates so that your Google search update reflects current news. It’s crazy. Yeah. So these kinds of machine learning systems feeling like they work magically obviously they aren’t magic, it’s just machine learning engineering happening under the hoods, under the hoods of these different platforms. So what are the many parts that make a machine learning system feel like it works so magically Chip?

Chip Huyen: 00:30:59

Okay, so, so I really like a course by like Arthur C. Clarke that said, it’s like any, any technology sufficiently complex would be indistinguishable from magic, just because we don’t understand it, it seems like magic to us, right? So I do think there’s a lot of things about machine learning that is like hard to understand. So I think it’s like, it’s a shift from traditional software to machine learning. I think we have been talking by it a lot, so I hope you don’t mind me repeating it. So for a change of software, usually it takes the input and you explain the transformation. So like like a function. So you explain like, okay, takes the input multiply by three and add something and things blah, blah. So you do exact calculations and you get the output. But whereas for machine learning, you do it other way around, it has the input and the output and you tell the algorithm like, Hey, figure out the functions, how’s that?

00:31:52

So we can produce input from the output. And because of that, that we don’t really understand the functions that the machine algorithm comes up with. So and I think it’s not like, it’s just like one, it’s not like a small function, right? Like a model is have like many, many parameters, like I think like billions of them, like in, in certain, like nowadays. So it’s just impossible. Like it’s really hard for us to just look at all these numbers and understand exactly what’s going on. So there’s a whole industries of like just trying to understand what exactly, how exactly do these machine learning systems make predictions. So some of this is like is for some industry where being, being able to understand how machine learning algorithm make decisions. So where, where the explainability factors are important, then they tend to gear towards more simpler models with less parameters, but they are easier to explain. And on the other spectrums, there are some use cases like it’s just like incredibly large models, they care more about the accuracy and the output than the explainability. So it really depends on your use cases. Like how much magic do you want to like, iterate. Yeah.

Jon Krohn: 00:33:07

Right, yeah. This is, this is often a trade off today, right? Where we could say, okay, let’s go with the regression model that has maybe slightly less accuracy, but we understand exactly how each weight contributes to the outcome of the model. Whereas on the other side of the spectrum, like you’re saying, we could have some deep learning architecture with billions of model parameters. And there’s no way, like even if we use explainable AI techniques to try to kind of understand how some model ways contribute in some particular circumstances, there are so many factors that it will never be fully explainable, but it might have slightly higher accuracy on some tasks.

Chip Huyen: 00:33:46

So I don’t think the requirements for explainability will change over time depending on the level of trust we can build with users, right? For example, like an example interesting like microwave, like we have such a high level of trust with microwaves that we are okay to put food inside, right? Like let’s use it without understanding exactly how it works. I don’t think like many people understand how microwaves work. So I do think in the future if we have a lot more trust with AI, we can just, like a lot of applications will open up because now, now we can be comfortable bringing up, pushing really complex models that really good performance without having to explain to people how they work underneath the hood. But I feel like we’re still a little bit far from there.

Jon Krohn: 00:34:30

Yeah. Well, something that’s cool that’s starting to happen more and more today, are these like model cards. Have you come across those?

Chip Huyen: 00:34:36

Yeah.

Jon Krohn: 00:34:36

That like explain where the models are strong or where they might have issues. So like for example, ChatGPT obviously really incredible at surfacing some answers, but is capable of very confidently lying. So you shouldn’t be making like an important business decision based on some ChatGPT.

Chip Huyen: 00:34:59

Now you’re talking about the Google Bard advertisement.

Jon Krohn: 00:35:03

Yeah, I, yeah, that’s, yeah, go ahead.

Chip Huyen: 00:35:07

I do think a big thing about like, there’s certain aspect of like ml’s, machine learning outputs such as like make it harder to bring things into production, but I don’t think they are not solvable. So first of all, like one nature of like ML algorithms, they are like predictive algorithms, right? So like only answers by their own approximations. They’re not like exact computations when you can like know, like, okay, the answers will be exact.

Jon Krohn: 00:35:29

Exactly.

Chip Huyen: 00:35:29

Say approximations. And if you treat approximations like exact computations, then you’re in for a world of trouble.

Jon Krohn: 00:35:36

Exactly. And that’s like, so people, people at my own company yeah, not on my data science team, but more business minded people, they’re like, “How long is it gonna be Jon before we can just have GPT-3 API in our platform or ChatGPT API when it comes out in the platform and we don’t have to worry about accuracy?”. And I’m like, I’m like “I don’t know. It’s gonna be a really long time because every single token, every single word that it’s outputting to you, it’s doing that probabilistically”.

Chip Huyen: 00:36:08

Yeah. I think like people have come up with designs like UX designs to make it work. So for example, like one thing is just like, because approximates, so it can have a lot of them and let user choose the best. So like instead of like entirely automation is more like augmentation. So like you give people, so like for example, like instead of like showing the direct response from the model to users, right? You can show users like three different options and they can choose the best one or they can just like or they can like edit on top of that. So maybe it’s hard for you to like write the entire email or like entire response from scratch, but you can just change a little, it can speed up productivity a lot. Or we have seen like quite some like interesting for example, like we have known that like GPT can also jerry code, right?

00:36:51

So like we, so one, one my friend has his startup so they can let business owners, small business owners use their own website without knowing how to write code by. Just like describe the website, choose as an input and it outputs the code. And a lot of times the code doesn’t work like perfectly right? But, but like they could show like users like 20 different options for what the code could like look, would, would, would like render into a real website and they choose a best website and then you can iterate on top of that. So like, Hey [inaudible 00:37:04], just similar. So I think that could be a very interesting way of like how to interact with things like approximate ML results.

Jon Krohn: 00:37:32

For sure. Yeah, I think augmentation is the key there is to be using these kinds of models for augmenting your creativity in particular tasks as opposed to trying to have a definite answer on something. So related to content from your book, so we’ve been talking for a while now, but it’s kinda this magic thing.

Chip Huyen: 00:37:54

Yeah. Yeah.

Jon Krohn: 00:37:54

But another great topic area that you cover in your book that I’d love you to tell our audience about is how do you align business intent context and metrics with a machine learning project?

Chip Huyen: 00:38:09

Yeah. So it’s pretty funny. So, so we talk about like, doesn’t matter how fancy some machine models are if they don’t solve business problems since they’re not gonna be used, right? So I do think it’s like so, so one thing is like I keep telling my students is that like focus on the problems and not the techniques, right? So technique is just a tool to solve the problems, like be more like solutions oriented than like technology oriented. So, so like for example one, one actual problem we run into is that like, so a company has this problem of like they want to increase the conversion rate from like visitor generate register account, right? So like somebody new visits to their website and they wind up visitors to sign up to create an account, right? So, so like what would be the first thing you do is not like the first thing you do shouldn’t like, “Hey, I heard this magical fancy machine learning model to solve this problem and let’s just use it”.

00:39:09

So the first thing I should do is since you understand like why people are not signing up and then try to come up with solution for that. And a lot of times the solutions don’t need to be machine learning. So for us, we identified it three different challenges, like why didn’t happen? So the first one is just like we saw that somehow the converstion rate was very low for people with smaller screens. And then we check it out and we realized we didn’t do enough testing on the UI side and for the smaller screen, the side button was like half hidden and it was pretty hard for what you click on. So it was not even like a machine problem at all. Like we very easy fix and then it have was like increase the chunk of my business metric. And another is that like we saw that a lot more. I started the process but did not complete. And the reason is that like, the process is pretty long. So like we need to like do some, so it’s more, it’s not an UI challenge, it’s my UX challenge. So we need to reduce the process and now we have the trade off, right? Because like we want to get more information from users to make them like better give them better services. But at the same time, we can make it too long. People orders are gonna drop out, right? Or, or we make it too easy. Another problem can arise is that like, there’ll be a lot of bots, so we got it once the process should be like, Hey, wanna make sure that they’re real users. But they’re like, if people can just sign up without any like getting the phone verification or whatsoever they might be bot.

00:40:28

And we see the challenge a lot. For example, one of the companies that I work with, they give our like promotions for signing up. So you sign up, you can acclaim certain discount. So if everyone can just sign up without like sufficient verifications, then you can do like promotion abuse and it could cost them like, I think like million, like tens of millions of dollars a year. So, which is like clearly a trade off right now you want increases high up, but at the same time you don’t want to like bleed money. So, but like always the last solutions that is like that is really machine machine learning that is just like, oh, it’s because people don’t find relevant content to them, so they just leave without signing up. And that you can’t even solve without like machine learning in the beginning, you can just choose like maybe try the popular, show the popular content and then after that you can make more complex. Okay, now let’s do more personalizations, right? Like recommendation for each user and the first case be this more like content based. And the next one is made more like session based recommendations. So, so there are like very many different ways to solve a business challenge. And I don’t think people should just bring us a big gun when they are like, oh, I dunno, cheese knife can solve.

Jon Krohn: 00:41:36

Right. Right, right, right, right.

Chip Huyen: 00:41:38

Yeah.

Jon Krohn: 00:41:38

Cool. So that helps us understand how important it is for our business to be aligned with a particular machine learning problem. A related question is from your book, so your book devotes a chapter to training data. And some important subtopics like sampling, labeling, class imbalance, and data augmentation. So yeah, what are some of the lessons that you’ve learned about training data in a business context?

Chip Huyen: 00:42:08

Training data is everything. So it’s, it’s really it’s really interesting that you ask these questions [inaudible 00:42:16] at all. So, okay, so I think I love this story, so I wanna tell this. So during was AlexNet paper in 2012.

Jon Krohn: 00:42:26

Oh, of course I talk about it all the time.

Chip Huyen: 00:42:27

Yes. So there is like, it’s a very influential papers, right? If you look at like, the most influential papers in the last 10 years, in the last decade, I think AlexNet is very high up. So there’s one line in that paper. It’s like everyone pretty much just ignore. And that line is like, it’s something along the lines, like all our experiments showed that we should just wait for more compute and more data to make better progress, right? So like the two things I mentioned, like compute and data and I think is something that’s like pretty much set the tone. So the, the tone for the next decade, you’ll see like now a lot of progress we saw as like this more compute and more data. And the funny thing is like the second author, AlexNet paper, do you know who, who that person is?

Jon Krohn: 00:43:15

Well, so Alex Krizhevsky is the first and then the second is Ilya Sutskever, right?

Chip Huyen: 00:43:20

Yes. Do you know who Ilya is?

Jon Krohn: 00:43:22

I mean, not like personally, I know that he went to work at OpenAI afterward.

Chip Huyen: 00:43:29

Yes. He’s a co-founder of OpenAI. So I feel like if you look back, like it’s just pretty much like laid out from like 10 years ago, I feel like a lot more thing of like, oh wow, ChatGPT is such a shock to everyone. But I do think there’s like a lot of posts saw it coming, right? Yes. More compute, more data. And for a long term I think I just want to point out like OpenAI did not get a papers published because for a while the research committee was like, oh, scaling up models, throwing more compute and more data or is like interest, like yes, of course I bring a good result, but it’s not enough novelty. Like, it’s like not sufficient. Interesting new techniques for them should be like published. And a lot of time people argue it’s like people, I think as humans, we like to think that progress comes from intelligent human design, right? We don’t want to think it’s like, oh, these like brute force solutions are gonna be the best way forward. And I think that Richard Sutton has a very interesting paper like a blog post, very short one, like a bit of lessons. It, it looks like you know what I’m talking about.

Jon Krohn: 00:44:31

I don’t know the blog post, I know Richard Sutton. Again, not personally.

Chip Huyen: 00:44:39

Well you might, maybe you should bring him on the podcast.

Jon Krohn: 00:44:41

I would love to.

Chip Huyen: 00:44:42

He’s amazing. Yeah. Yeah.

Jon Krohn: 00:44:43

I agree.

Chip Huyen: 00:44:44

So so the bigger lesson is that so there two approach, right? Like he noticed over the last 70 years of AI research, a lot of progresses came from like more compute and more data and the other approaches just like intelligent, intelligent human design. And of course it’s like you can, in theory you can pursue both, but like time spent on one approach, it’s gonna take away your time from another. And so a lot of people who believed in like human intelligent design will learn the lesson, just they’re not gonna, they’re not gonna be able to make as much progress as a people who just chose the other approach and [inaudible 00:45:24].

Jon Krohn: 00:45:24

Yeah. And that’s actually, that’s something that, that we have seen a huge amount in recent years. Like something like, okay, the transformer architecture is a novel way of setting up your deep learning architecture. Okay. But then this idea of just scaling it up, you know, adding orders of magnitude, more model parameters. Yes, this is not very intelligent human intelligence, it’s just brute force, but it works unbelievably well. And GPT-3 then has all these emergent capabilities that we were surprised by and I’m sure GPT-4 is gonna have even more. And a similar kind of thing happened my PhD was in genomics. So applying machine learning to genetic data, and it was the same kind of idea. It was like, let’s get orders of magnitude more subject participants and get their genetic information and that approach alone, that brute force alone will yield more interesting genes for a particular trade.

Chip Huyen: 00:46:17

Like, to be clear, like, to be fair, like brute forcing is not easy. Like it’s incredible engineering challenge.

Jon Krohn: 00:46:24

It’s a big engineering problem.

Chip Huyen: 00:46:26

Coordination challenge. And it’s very hard. It’s just that people think it’s like, oh, it’s more of the engineering challenge, not a research challenge. But also like when you start operating at like a skill unprecedented before a lot of new problems arise that you actually need novel techniques to be able to overcome them. So I do think it’s like, it’s it’s, it’s very hard. It’s not easy at all. So I’m just reading this book this morning before our call. It’s a book on a chip war, and I saw like very similar, like a lot of similarity.

Jon Krohn: 00:46:56

They’re fighting over you?

Chip Huyen: 00:46:58

Huh?

Jon Krohn: 00:47:00

They’re fighting over you?

Chip Huyen: 00:47:02

Well, I’m not sure who is fighting over me. That’s gonna be fun. But it’s about like how like a lot of progress in integral [inaudible 00:47:05]. It’s just like, it’s a lot of it, it’s just engineering challenge that you need to like pack in a lot of chance leaders on the same board and it’s incredibly hard.

Jon Krohn: 00:47:18

Yeah. I think with these kinds of models, like I’m probably gonna mispronounce this, but one of the biggest NLP models around today is Wu Dao 2.0, which is Chinese. And GPT-4 I think is gonna be one of the biggest one that comes out. Even just the number of chips that are required for training these models, just acquiring that many chips is crazy.

Chip Huyen: 00:47:42

Wow. They’re taking away all the chips that the crypto miners give up on, so…

Jon Krohn: 00:47:47

Yeah, this is a good thing. It’s a good thing. Bitcoin prices down is good for being able to train AI models relatively more affordably.

Chip Huyen: 00:47:57

So anyway, I think I want to go back to this like training data.

Jon Krohn: 00:48:00

Yes.

Chip Huyen: 00:48:00

I you have a lot of [inaudible 00:48:01] on it, right? So, so I do believe in it a lot. So I feel like, so it’s really funny because when I graduated, I thought like, okay, hardware is a, is a way to go. So Azure, NVIDIA, and then I was like, okay, now with next company, it’s like churn data and saying when I joined Snorkel, which is like a, like with supervision, like augment, like to create more training data. So I do think that’s like, as some models get bigger nowadays not many people can afford to train more from scratch. So we see like a lot of like foundational models. So I do think what is actually is a competitive advantage of companies could be like training data, like how to get data as, as only they can get the hands on. And, and that is like that is, that is big.

Jon Krohn: 00:48:46

Yeah, for sure. So yeah. So what kinds of, with training data being so important, what kinds of data preparation methods or what kinds of techniques could our listeners be employing you know, that are underutilized out there?

Chip Huyen: 00:49:07

So I do think that we need to change the way we look at data. So it’s very important things. It’s like, so, so when we, so AlexNet was trained on ImageNet, right? ImageNet was created by a lot of people labeling images. So on Amazon you can see the mechanical turk’s [inaudible 00:49:24]. Like what, how much? I think something like 2 cent an image label. So, so like traditional annotations or like dealing with data, training data set is considered like low skill labor. And people like, you can, if you label a lot of images in hours super fast, you probably make around like minimum wage. But, but nowadays, like if, if you, the thing’s like you can think of one example of labeling is a picture. Teacher and kid. Like, like okay, so the kids wanna do some homework and the teacher looks at it and said, okay, this is good, bad, how to improve it, right?

00:50:02

So like, if you want, the kids should be really smart, you want to send them to like good pictures. And similarly, I think like if you want the AI, it should be like smart. You want to have really smart people to annotate, right? So a lot of like annotation is like easy to do like image, like whether there’s a current image or not, right? But a lot of annotation is really hard to do. Like first of all look at ChatGPT response, like how do you know that it’s like a good response or like, how can you make even better? So, so, so you need like, so you need like smart people, like how do you change the perceptions of data and notations so that smart people want to do it?

Jon Krohn: 00:50:43

Yeah, this is a, this is definitely a big challenge. So for example, maybe you want to train a radiology dataset, you’re going to need to pay a radiologist to be labeling your images as opposed to mechanical turk. And then it’s gonna be orders of magnitude more expensive to have those experts do it, but if you can pay them to do it, you’re going to end up with a really valuable data set.

Chip Huyen: 00:51:07

Yeah. So I think that’s, that’s the one main cases of like Snorkel and I thought it was very interesting is like how so Snorkel, like with supervision, this idea of how to encode domain expertise so that it can be reused. So like a lot of times for example, like some domain expertise like a doctor can look at a note and say this like, hey, this is not contained something like pneumonia or like some dangerous disease. So that should be like promoted to urgent. So, so that is some type you can encode that. So like if you have the doctor say, okay, if the doctor note contains certain keywords, then give it like the WIC label or noisy label of being urgent and like now just not have like one rule. But if you have say a thousand of those like that and you can rezone the conflicts and looking at the noisy labels, each simple rule gives, which can have you generate some pretty accurate gradual label.

Jon Krohn: 00:52:08

Cool. Yeah. So that’s very cool. Tools like Snorkel can be used to encode domain expertise into our training data. We had a guest on not too long ago named Shayan Mohanty who talked a lot about this. You know him?

Chip Huyen: 00:52:25

No, I do not know him.

Jon Krohn: 00:52:27

Now episode 635, he talked a lot about how yeah, problems with training data and even how like these mechanical turk kind of systems are quite exploitative. And so we can be using more clever technology like Snorkel or in his case, Watchful is the name of his company to be yeah, creating higher value data without necessarily that, that exploitation of people.

Chip Huyen: 00:52:52

Yeah, true.

Jon Krohn: 00:52:52

Nice. Related topic. So once you have your training data some kinds of models, like some deep learning models, we put our data in more or less raw, we don’t engineer features out of the data. But yeah, for a lot of kinds of model types and a lot of kinds of problems, engineering features out of our data is critical for our machine learning model to be working well. So you have a chapter devoted to the topic of feature engineering in your book, what’s the secret Chip, of engineering good features?

Chip Huyen: 00:53:25

Oof. Oof. It’s a lot of that is like, wow, I felt like how much time do we have? So, and I think feature engineering is is, is pretty hard. So I would say this, like first we look at the type of use cases. So there are certain use cases that where a lot of feature engineering can be automated. So first all use cases like the use text or like images. So you can have like most of the data comes from like maybe like a blob of text or a big image. So, which you can like input into a model and get our embeddings. So that type of feature engineering is pretty much automated. So what’s the challenge with feature engineering is how, how to engineer from more tabular data. So for example, for like fraud detections, so you have data coming from different sources. A lot of them are numerical. A lot of them are not really, of course they have some text as well.

00:54:22

Cause some people is treating the entire transaction as a blob of text and you can just input into a model and get the embedding for that transaction back. Another approach is that you can look at a lot of users activities and try to use they call them behavioral activities. So to make predictions, so for example, a very simple example is that say if you as an account holder has been only spending on make very few transactions, maybe on average like one a day over the last six months and suddenly in the last 10 minutes you were farming, you were like doing a hundred transactions. That’s very suspicious. So that’s, that’s kind like measures the numbers of transactions you did in the last 10 minutes is a feature or measuring like what is the average amount of transaction you have made every day over the last six months is a feature. So that kind of, of features you usually need to think about and usually requires certain domain expertise that you should come up with and they can blow up really, really quickly. So we see pretty commons in the companies we work with. They have like thousands of features and we were talking to a few companies where the majority of their cost comes from feature computation.

Jon Krohn: 00:55:48

Right? From feature computation.

Chip Huyen: 00:55:50

Yeah. So like feature engineering. So the traditional center is that like data scientists try to just create as many features as possible because usually having more features will some more or better performance. But at the same time when, so we work in like real-time machine learning space. So what it means is that we want to be able to help companies leverage fresh data to make better, like more accurate decisions. Right? So first are predictions, of course you can do like slow predictions for example. Like somebody steals a cart, you will observe them for like 30 minutes and then make predictions. But like in the 30 minutes that hacker, the scammer can cost a lot of damage. So we’ve seen often cases where people say like they can just reduce the time from second to like milliseconds. They can save them a lot of money.

00:56:40

Or another use case account like email, email takeover for business account. So if somebody get a hold of like uber.com email or like apple.com email, they can pretend to be that person and send out malicious link to the rest of the company and which can cause a lot of damage. So if you can catch, or maybe it’s just a person that’s locked in and immediately say like, wait a second. Two minutes ago you were logging in from California and now you’re logging from like, I don’t know, some like very random islands very far away, then maybe, maybe it’s suspicious. So, so if you can leverage information very fast, then you can save a lot of headache and like increase like more accuracy because now model can use like fresher data to make like more accurate predictions for the solution right now. However, when you do things as a scale or like as a speed, you just have to spend a lot of money because, so we see that we saw like some companies and not even that big but they do things very inefficient. So they just store everything in memory, like data from the last three months in, in memory and it ended up cost them like, millions or like of course of dependent scale. I don’t feel I just throw the number out there. But like a lot of money a month or a year. And we saw that like we can come in and optimize that process, reduce like not 20% or not like 30%, but like 10x like reduce the cost of that.

Jon Krohn: 00:58:07

Yeah. So let’s talk about your company and this realtime machine learning thing in more detail. So your startup is called Claypot AI, you talked about that earlier. How it’s, how does that relate to food? Claypot is like for cooking food.

Chip Huyen: 00:58:18

Food, yes. Do you never eat, you have never eaten food in a clay pot? When you come to Vietnam, I, sorry, when you come to the bay, I will take you.

Jon Krohn: 00:58:28

Nice now, now you have to take me to Vietnam. You already offered it.

Chip Huyen: 00:58:31

Okay. Okay. Tomorrow. Yeah.

Jon Krohn: 00:58:33

Nice. So yeah, so Claypot AI is developing a platform for real-time machine learning. Yeah. So you’ve alluded to kind of what real-time machine learning is now. So it aims to make online prediction more accessible. So tell us what is online prediction? How is that different from the more traditional batch prediction? Yeah, what are the challenges that you’re tackling at Claypot?

Chip Huyen: 00:58:57

So online is pretty much an overused term if you got online can mean many things. I think the meaning has changed over time as well. So I would say that’s for [inaudible 00:59:06] to help companies like being more reactive, right? So, so for example, like if somebody, so, so batch prediction usually have some scheduled batch jobs that runs maybe every day or every hour. So usually a lot of cases, for example a lot of use cases don’t need online predictions. So like if you do like something like churn predictions, like predicting what percentage or like what, what, which user is going to cancel the subscription, you can probably do that it once a day, right? And look at all the users and see like, oh, hey, which one is likely to cancel in the next week or something. So you can offer them some discount or some benefits to like pull them back in.

00:59:48

But like for use case for example, like when recommended systems, when when you have a website and people go on the website, you want to be able to leverage the information of that users and make them relevant info like recommendations to give them like stay, right? You don’t want to wait until the next day or like next four hours after they have already left because like what you gonna use the recommendations for? Another big case is like, like for protection or like dynamic pricings. So, so you want to like come up with like the best price for to show users and I think it’s very big in Amazon. So Amazon’s for a product there might be multiple sellers and there in the buy box only a few buyers sellers can make it to the choose a buy box, right?

01:00:33

So as a seller you want you to customize the price so that you get into the buy box. So that could be a, a big case. Or like Uber Lyft, they use dynamic pricing as a way of as a way to regulate supply and demand. So say if some users, so like if right now there are not many drivers on the road, so they might want to increase search charge to get more drivers on the road and, and, and vice versa. So I think it could be very there’s a lot of use cases. It’s just like, it just makes a lot more sense to be able to make predictions based on what’s going on right now instead of like, like do it like every four hours or every day. So another very big benefit unlike predictions is that it can save cost.

01:01:22

So say you run a delivery app whichever I think GrubHub has published the statistics, but the idea is, is like if you have a lot of users and every day you just predict your predict like what restaurants this users are going to order today, you are gonna waste a lot of compute. And the reason is that like, not all of these users are going to lock in the app and order. So say like for GrubHub it’s like, I think it’s like only 2% of the users order a day. So if they generate for like owners of their users, then 98% of the predictions are gonna be wasted, which is like cost [inaudible 01:02:04] compute. And so like delay, because maybe if take like them like six hours, you deliver it on those predictions, then they can only deliver it every six hours.

Jon Krohn: 01:02:12

Cool. Yeah. So real-time machine learning enables people in some circumstances like fraud detection to save a lot of money. And so it makes it makes it a no-brainer to be using a real-time machine learning platform like Claypot. So what’s your, like what’s Claypot’s angle? Like how does Claypot enable real-time machine learning in a way that, you know, people might not be able to do on their own?

Chip Huyen: 01:02:37

So, so first thing I will go into analysis, like what makes it so hard for companies to do this? So one thing we have realized is this like of cost is a cost aspect, right? Like going from like [inaudible 01:02:50] can cost, can be a lot more both with like upfront infra investments and also the like operational cost, like both good people meaning the job or like in infrastructure cost like storage or computation cost. So, so we do look into how do we simplify that as much as possible and make it as cheap as possible. So we spend a ton of time on like optimization to make the cost like reduce significantly. And it’s like, it’s crazy to look at the benchmark, like, whoa, like companies are wasting a lot of money on this. And the second is a usability. So as a data scientist do you use pandas?

Jon Krohn: 01:03:28

Of course.

Chip Huyen: 01:03:29

Yeah. So, so it’s like, it’s very, when assignments go with batch features, it’s pretty easy to experiment with a batch feature, right? You just like get into our CSV file or some of our like table and you just like experiment with some new features and train a model on it. And if it works then it’s great. But like when, when you switch to like streaming having more as a time sensitive feature, for example, like the number of transactions you’ve made in the last three minutes or the number of views this product has had the last 30 hours, like the last 34 hours [inaudible01:04:00] more time sensitive. Once you like [inaudible 01:04:04], go back in time and like get it like it’s correct a certain point in time. So, so say like if you want to make this operation happened at 10:30 AM yesterday, you don’t want your use of feature just what’s available as like 10:31 AM right?

01:04:19

Because it’s might be like picking into the future. So so, or like when we do compute feature in productions, you gonna be connecting to like data sources like a streaming topic, like a Kafka topic, a Kinesis topic, and it’s just not that easy to, to like for data scientist to like use them and like experiment on top of them. So, so what we do is, is like we focus on how we can make data science, it’s like experiment with streaming features as easy as how the good experiment with batch features because we have seen in some companies like, hey, it’s a process for the machine learning experiment and deploy a streaming feature would take like two months or like a quarter. And in the meantime it was like, okay, this is so much pain for me, what if I just use like tent batch features instead? So like it’s, it’s, so the difficulty in usability makes users not want to do streaming features, even if streaming features are like bring a lot better return on investment. So, so yeah. So that’s what we do. Like first we do optimizations and second we increase usability.

Jon Krohn: 01:05:25

Nice. Yeah, so Claypot, you are spending your R&D time figuring out how other companies, how your clients can be cutting down on their operational costs and be able to do real-time machine learning cost effectively.

Chip Huyen: 01:05:38

We want to make it just as easy as you would have used like pandas.

Jon Krohn: 01:05:43

Nice. So sounds like an amazing company. Obviously working for you would be amazing. Are you doing any hiring right now?

Chip Huyen: 01:05:54

Ooh, aggressively. So we are looking very strong engineers who are interested in streaming, who are solid engineer because we do build things as scale. A lot of our customers just operatings in like large scale. And when you do things in as a scale, you want things to be reliable. So we do want people with good engineering practices. Good, so our CDOs say is like, good engineering craftsmanship, yeah.

Jon Krohn: 01:06:24

Engineering craftsmanship. Yeah. Cool. Yeah, sounds like an amazing opportunity and for all of our listeners out there that would like to be as productive as you running your own company up until recently teaching your Stanford course, obviously writing your book and being a huge contributor to the data science community with your posts on social media platforms and GitHub and so on.

Chip Huyen: 01:06:49

Oh, we have a group on Discord. It’s 14,000 14,000 members now. It’s pretty big.

Jon Krohn: 01:06:55

Oh yeah? So you’ll have to give me a link to that so I can make sure we have it in the show notes. And so yeah, yet another thing that you need to be on top of. So I think that you might have a really interesting answer to this kind of productivity tips question because before we started recording so if people are watching the YouTube version of this, you might be able to tell that Chip is standing and prior to recording I had to ask her to stop walking because she’s on a treadmill, and so I could hear kind of the treadmill noises as and as I was like we’re not gonna be able to do that when we’re recording. So clearly you’re getting some cardiovascular work in while you do calls. What other kinds of productivity tips do you have for us Chip?

Chip Huyen: 01:07:51

So I think okay, so I feel like I don’t think I’m that productive because I have so much more admiration for people who both work full-time and have kids. I do think that having kids is just so much time and I feel like

Jon Krohn: 01:08:10

That’s a good productivity tip. Don’t have kids.

Chip Huyen: 01:08:12

Ok. Ok. I I feel like people might take it the wrong way if I say that but I do think it’s like I feel like I’m able, should I be able to focus on work because I don’t have a lot of distractions outside of work. So yeah. Some productivity tip is just being a little bit more disciplined. Having understand like how my, what causes my may-not being productive and try to like reduce that time. So for example, one of the things I do like working remote a lot is because I don’t have to commute from one place to another. And I do think it’s like a big time sucker. Yeah, having, I do things like, so, and some cause like my energy level sometimes, like having certain conversations give me like a lot of energy to do more like productive work, but sometimes a conversations is like, make me drain. So I do things like what kind of conversation, it’s just like make me like happier or like more energized and do more of those and like look at conversations that drain energy and do less, do less of those.

Jon Krohn: 01:09:13

That is such a great tip. And all of those kinds of things are in my productivity tips arsenal as well. I, I, I would like to have kids someday day, but I’m like man, that is gonna be a productivity hit.

Chip Huyen: 01:09:25

Wow, but is life is not all about productivity, you know?

Jon Krohn: 01:09:28

I know. So I’ve read and yeah, avoiding a commute, this one is huge. And yeah, I love this idea. One that I hadn’t maybe thought of actively, but engaging more in conversations that leave you feeling energized as opposed to draining your energy. I think that that is really key. Nice. So I hope that this conversation today has been one of your energizing ones in the day, but I also know that I’ve eaten up a lot of your precious time today and then you urgently need to get going. So we usually ask for a book recommendation, maybe you can just throw out the words the name of the book, but the key piece of information that our listeners need to go on from this episode is how they can be following you after the show.

Chip Huyen: 01:10:13

So I do like to read about really smart people about how they think. So I recently actually shared on LinkedIn like a list of books that I’ve thought I’d learned a term from last year. So one of them is actually I have it right here, like Complex Adaptive Systems. It’s very good system thinking, about system thinking. For example, like how one example is just how do you set incentives so that like every individual can just focus on like for their own interest, but it’s the same time can make progress for the whole organization. And I think this is really, really hard to do. So it’s very interesting book shows, explains that. I do like some other books like let me see in the bookshelf here. How Not to Be Wrong, The Power of Mathematical Thinking.

Jon Krohn: 01:10:57

Oh yeah.

Chip Huyen: 01:10:57

Uncommon Genius. It’s very interesting. It’s like the author studies a lot of MarkArthur Grant recipients and see like what helps them like be creative because I think it’s just like in hindsight it’s very easy to think like which ideas succeed, right? But when even the [inaudible 01:11:14] of it, like having, having working on it for years and without like recognitions, how do you gain the convictions that is the right thing to do and keep on going, like when to give up and when to pursue, continue being perseverant. I think it’s a very hard question. Another very interesting book. I think I like Fads and Fallacies from Martin Garner, which is another mathematician, he’s talked about like some different like fads and fallacies in how both fell for it. So it was really like 70 years ago, but I feel like it’s so like relevant today because first of all, like why people fell for anti-vax or like why people fell for flat Earther when there are like so many like strong evidence against that, right?

01:11:56

So it’s, it’s really good book and very interesting book. He’s, he’s a real funny person as well. Another book and I really like is From One to Zero, which is not a misspell for people keep asking me do you mean like zero to one? No, no, no, no From One to Zero. So it’s about like how different cultures like arrive at like number systems. So actually it’s a really big mental jump mental shift from having the numbers that scale from 1, 2, 3 to like having the concept of zero nothing. So I think it’s a very interesting book. So yes, a lot of books like that.

Jon Krohn: 01:12:26

Nice. All right. And then, so if people want more book recommendations in the future from you and lots of other kinds of guidance in their lives, especially related to realtime ML systems and that kind of thing, how should people be following you? I know that you have a course coming up in a platform called Sphere. Yeah?

Chip Huyen: 01:12:42

Sphere, yeah. So it’s on the information on my LinkedIn. I’m pretty active on LinkedIn now, also on Twitter, and I’m very active on Discord. It’s called MLOps Discord, and you can also find the informations on my LinkedIn on my website. Yeah, I really like the Discord community because I think it’s a, it is a way to be able to just discuss a lot of things about MLOps and it’s not just my perspective. I think there are certain, like many members who are very, very helpful. Like you can ask the questions and you would get a lot of really helpful answers. I usually go there when I have my I’m thinking about this and I usually like get some like good perspective about it.

Jon Krohn: 01:13:20

Nice. Yeah. Clearly a lot of people enjoy your perspectives with the, whatever you said, 14,000 people in the Discord channel and you have 158,000 people following you on LinkedIn at time of recording. I’m sure it will be many more by the time this episode is published. So Chip, thank you so much for taking the time to be on the SuperDataScience podcast. I understand that it’s only your second ever podcast appearance so we’re delighted and there are so many more questions that I had for you. We’re gonna have to bring you on again at some time in the future and hear how things are coming along at Claypot.

Chip Huyen: 01:13:52

I would love to, I would love to. Maybe you can bribe me with food and I would come back but yeah, sounds nice. Yeah, thank you so much for making time for me and thank you so much everyone for listening and feel free to ask me any questions about things that we discussed here and have a nice day. Yeah, bye-bye.

Jon Krohn: 01:14:15

Well now you know why Chip is so enormously popular. She’s unbelievably intelligent and also very fun. In today’s episode, Chip filled us in on how most machine learning models never make it to production, but she’s bent on fixing this, how some kinds of ML models need to have their model weights updated in production at a much higher frequency rate than others. How orders of magnitude more parameters have yielded stunning results like we’ve experienced recently with ChatGPT. But this corresponds only to overcoming ML engineering challenges as opposed to some data science ingenuity. She talked about how domain expertise can be encoded in training data labels to create higher value data. How online learning involves ML models learning in real time from individual training points as opposed to on batches of data and how she ratchets up her productivity by engaging in more energizing conversations and fewer draining ones.

01:15:04

As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Chip’s social media profiles, as well as my own social media profiles at www.superdatascience.com/661. That’s www.superdatascience.com/661. If you enjoyed this episode, I greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel. And of course, subscribe if you haven’t already. I also encourage you to let me know your thoughts on this episode directly by following me on LinkedIn or Twitter and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another magnificent episode for us today.

01:15:55

For enabling this super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get all the details on how by making your way to jonkrohn.com/podcast. And thanks of course to you for listening. It’s only because you listen that I’m here. Until next time my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 661: Designing Machine Learning Systems

SDS 661: Designing Machine Learning Systems

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 661: Designing Machine Learning Systems

Share

SDS 661: Designing Machine Learning Systems

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

SDS 942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025