Kirill Eremenko: 00:00:00
This is episode number 385 with Lead Data Scientist, Scott Clendaniel.
Kirill Eremenko: 00:00:12
Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. Each week, we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now let’s make the complex simple.
Kirill Eremenko: 00:00:44
Hello everybody, and welcome back to the SuperDataScience podcast. Super excited to have you back here on the show. Today, I’ve got a super amazing treat for all of us. I just got off the call with Scott Clendaniel. Scott is a lead data scientist there. He has a huge amount of experience in this space of data science and machine learning, and he’s always happy to give back to the community. This podcast is going to be an advanced podcast. It’s specifically going to be useful to you if you are an intermediate practitioner in data science or an advanced practitioner in data science.
Kirill Eremenko: 00:01:18
You’re interested in things like models, cross-validation, over sampling and things like that. With that said, here are some of the topics that you’ll hear about today. You’ll hear about Scott’s story and how he got into the space of data science. We’ll talk about fraud detection because it’s a big part of the financial services industry. We’ll talk about some specific examples of ways to detect fraud, including Benford’s law. We’ll talk about oversampling, the minority class, the multiplicity of good models, the tools that Scott uses on a daily basis.
Kirill Eremenko: 00:01:51
We’ll talk about data preparation techniques. Specifically, we’ll talk about target, mean and coding, and one hot encoding, what they are, which one is better, and why. Then we’ll talk about model drift, and we’ll discuss why models decay over time. Scott will give his recommendations on how often he checks up on models. We’ll talk about building populations to build to report. We’ll talk about some real world examples, cross-validation. Then we’ll cover off Scott’s advice on some of the softer skills like data science leadership, what it means to manage data science teams, how to best structure a data science team, the hub and spoke model for that.
Kirill Eremenko: 00:02:28
We’ll get Scott’s ideas and visions for what is coming for data science in the future. A very exciting podcast coming up ahead, I can’t wait for you to check it out. Without further ado, I bring to you Scott Clendaniel, lead data scientist at Franklin Templeton.
Kirill Eremenko: 00:02:51
Welcome back to the SuperDataScience podcast, everybody. Super pumped to have you back here on the show. Today, we’ve got a super special guest joining us, Scott Clendaniel. Scott, welcome.
Scott Clendaniel: 00:03:02
Thank you so much. I’m really happy to be on.
Kirill Eremenko: 00:03:05
That’s awesome. I forgot to ask you, where are you located right now?
Scott Clendaniel: 00:03:09
I am actually in Havre de Grace, Maryland off the Chesapeake Bay, about 45 minutes north of Washington, DC.
Kirill Eremenko: 00:03:16
Havre de Grace, Maryland.
Scott Clendaniel: 00:03:20
Absolutely. [crosstalk 00:03:23].
Kirill Eremenko: 00:03:23
Havre de Grace is the name of the city.
Scott Clendaniel: 00:03:26
Yes. Port of mercy, I believe, is the loose translation.
Kirill Eremenko: 00:03:32
Very interesting. How did you end up there? Have you been there for a long time?
Scott Clendaniel: 00:03:38
Actually, we just moved here. We had an opportunity because my job allows me to work remote to be able to change location. My wife as a children’s librarian just got a position up here near Cecil County. We just moved here, and we’re pleased as punch. It’s really pretty.
Kirill Eremenko: 00:03:57
All right, tell us… Before that, where were you located before that?
Scott Clendaniel: 00:04:03
I was actually born in Baltimore, Maryland, lived there for a long period of time. Then I was in Delaware, Pennsylvania, and a few years in Honolulu, Hawaii.
Kirill Eremenko: 00:04:15
Amazing. I’ve always wanted to go to Hawaii. What is Honolulu like?
Scott Clendaniel: 00:04:20
Honolulu is really a fantastic city. I really loved living there. Obviously, a lot of tourism, but the people there are just so warm and inviting, really had a good time there, a big fan of Hawaii.
Kirill Eremenko: 00:04:38
What Island is Honolulu on?
Scott Clendaniel: 00:04:39
Oahu, which is the main Island. About three quarters of the population lives on that island. There’s the total [crosstalk 00:04:48].
Kirill Eremenko: 00:04:47
Wow. I heard there’s a great restaurant on Maui, and it’s called Mama’s Fish House, I think. Do you know that?
Scott Clendaniel: 00:04:56
No, I don’t know that one. I haven’t been there, but I visited four of the eight islands while I lived there. Maui is very pretty. Each island has its own personality, which is fun.
Kirill Eremenko: 00:05:10
What’s the differences in personality?
Scott Clendaniel: 00:05:10
Well, Kauai is the garden island, so it tends to be much more laid back. It is probably the greenest of the islands. That’s great. The big island has all kinds of different topography. It’s probably the only place in the world where you can go snow skiing and water skiing on the same day.
Kirill Eremenko: 00:05:28
Wow.
Scott Clendaniel: 00:05:29
Each island has its own unique factors to it. Maui’s a lot of fun too. Oahu is where most people go. That’s where Waikiki is. That’s probably the most popular of the group.
Kirill Eremenko: 00:05:41
Oh, fantastic. Very cool. The mountains are tall enough for skiing?
Scott Clendaniel: 00:05:46
Only on the big island. When I say snow skiing, I’m not talking about Aspen or Vail, Colorado. I mean, you can get about that, but you can at least tell people, “Hey, this is fantastic. On the same day, I went water skiing and snow skiing.”
Kirill Eremenko: 00:06:02
Amazing. Okay. Got you. Well, Scott, it’s a pleasure to have you here. We’ve got a lot to go through. For those who are listening and maybe don’t know, I posted on LinkedIn just, “Hey, Scott Clendaniel is coming to the show in 24 hours, and post your specific asks for advanced data science questions.” In that 24 hours, there’s now 56 messages in there. Thank you very much for taking the time to really answer to some people. There’s been a lot of cool discussions and [crosstalk 00:06:33].
Scott Clendaniel: 00:06:32
Absolutely.
Kirill Eremenko: 00:06:34
I really want to go through [crosstalk 00:06:36].
Scott Clendaniel: 00:06:35
Very few of them were from my bill collectors, which I really appreciated.
Kirill Eremenko: 00:06:41
Gotcha. What I want to start with though, so we’ll definitely get to those, and there are some really cool advanced machine learning questions, model and cross validation and things like that. Before we get there, give us a bit of a background around yourself. Who is Scott Clendaniel?
Scott Clendaniel: 00:07:02
Sure. Actually, I have a bit of an unusual background. My undergraduate, my MBA are both in strategic planning. They’re not in either statistics or computer science, which makes it relatively different from a lot of other folks in the field. I was in financial services for a long period of time, and all the way up through vice president of consumer lending at Bank of Hawaii. Unfortunately, I had a family emergency where my ex-wife at that time decided to take my son, and so I had to leave.
Scott Clendaniel: 00:07:33
She had indicated to my son who was only three years old at that time that I had actually gone, because I was packing his toys in Hawaii. I was like, “Why did you do that?” Anyhow, so I got a phone call from my three-year-old son one day. He said, “Daddy, are you done packing my toys yet?” It was the worst thing ever. I was like, “Gosh, I’ve got to figure out how to give up my career in financial services and do something else.”
Scott Clendaniel: 00:08:00
I thought to myself, “I wonder if anyone’s interested in this machine learning artificial intelligence stuff. I wonder if I could do that.” For the next 16 years, I became a consultant. To all those folks out there who are trying to break into data science, if I can get past that, you can get past whatever you’re facing currently. I encourage everybody to give it a shot.
Kirill Eremenko: 00:08:23
Wow. Wow. What a story. You just packed your stuff, gave up a vice president position at Bank of Hawaii, and moved back to mainland U.S. How did that go for you?
Scott Clendaniel: 00:08:36
That was rough [inaudible 00:08:38], but it opened up a world of opportunity. It also gave me a whole new perspective on what’s important in life and what isn’t, and also gave me a lot of focus on persevering and problem solving and how I could add value to others. It was a tough thing to go through, but it provided a lot of opportunities for me later on in life.
Kirill Eremenko: 00:09:03
Wow. Wow. Amazing. 16 years past. You were consulting in this space. What happened next, and where are you now?
Scott Clendaniel: 00:09:16
Absolutely. Morgan Stanley actually recruited me to be their first full time data scientist in their Baltimore office. That was great. Unfortunately, they had a situation with some internal fraud, where… I can share this because it was on the front page of the New York Times. Someone walked out the front door with $11 million. They suddenly changed my role to focus exclusively on internal fraud, which wasn’t really what I wanted to be doing at that point in time.
Scott Clendaniel: 00:09:46
I wanted to stick with machine learning. I was recruited away from Lake Mason, and had been there for two and a half years. We just got purchased by Franklin Templeton, so I’ve been trying to help build functions for those organizations.
Kirill Eremenko: 00:10:02
Just to give a bit of background on the Franklin Templeton, this is one of the world’s largest global investment firm.
Scott Clendaniel: 00:10:11
Yeah. Once the transaction is completed, which will probably be about the time this podcast is released, it’ll be approximately $1.4 trillion in assets making us the sixth largest in the world.
Kirill Eremenko: 00:10:23
Wow. That is crazy. As a lead data scientist at this company, what is your role like? Do you actually look at how to invest this money, or are you looking for fraud, or is it like a broad scope of locations? What exactly do you do?
Scott Clendaniel: 00:10:41
Most asset management firms have separate groups who do the actual portfolios of investments, so I’m not involved in that so much. I help working on other types of business problems like optimizing sales, trying to help in profiling customers and opportunities, occasionally some small pieces of fraud detection, and actually trying to educate the organization as a whole on best practices and analytics and trying to make sure that we can meet the academic component of what’s going out in the world versus the real world, and trying to bring people up to speed, do a lot of training for folks, helping them form their own business plans and helping them build their models. It’s our reaching.
Kirill Eremenko: 00:11:28
Gotcha, so quite a broad scope of applications, but not specifically the investment management. Interesting. Very interesting.
Kirill Eremenko: 00:11:40
I hope you’re enjoying the podcast, and we’ll get straight back to it after this quick announcement. This announcement is going to be a bit tough for me because it is about my own book, so please excuse the shameless plug. However, I do believe in it so much that I want to get the word out there. This book is designed in a way to get anybody and everybody up to speed with data science, pretty much everything that is important that is needed to get going.
Kirill Eremenko: 00:12:05
The unique proposition of this book is that it doesn’t require coding. There’s a lot of books out there on data science where you need to sit in front of the computer and code in Python or R. This book, you simply take and you can read it on your lap, in a car, in a plane, in your backyard, on a couch. You can read anywhere. There is no coding in the book. It focuses on intuition. If you’ve taken our course, then you’ll like those intuition tutorials about how an algorithm works and why rather than what the code behind it is, then you’re going to love this book.
Kirill Eremenko: 00:12:41
It’s going to be a great way to solidify that knowledge. It covers pretty much everything in a data science project lifecycle from asking the right questions to data preparation, to machine learning, to visualization and finally presentation. Pretty much everything you need in your career is covered. If that sounds exciting, check it out. It’s called Confident Data Skills available on Amazon. It’s the data science book with the purple cover, and please enjoy.
Kirill Eremenko: 00:13:11
Well, on that note, I think let’s dive into these questions because there’s quite a lot to go through that, and I think that’ll take us-
Scott Clendaniel: 00:13:23
Let’s go.
Kirill Eremenko: 00:13:23
All right, awesome. I’ve gone through the questions. This was on LinkedIn. I’ve sorted them in order of… We’re going to start with the most advanced machine learning, AI, deep learning stuff first.
Scott Clendaniel: 00:13:34
Fine. No pressure on me. That’ll be great. Okay, go ahead.
Kirill Eremenko: 00:13:38
Here we go. Vighnesh, if I’m pronouncing that correctly, Vighnesh asks, “Can you give us a brief about the real world applications of data science in the investment industry? How do you approach a particular problem in this space?”
Scott Clendaniel: 00:13:58
Sure. I think one of the big components in any field is actually trying to make sure what is the business problem and defining that first before you actually define your modeling approach. In investment management, there are a bunch of different problems where data science can be applied, and also, financial services have been involved in advanced analytics since at least the 1960s, so it’s a great field to be in. A couple of examples would be fraud detection. How do you tell whether a particular transaction is someone spoofing, whether they’re the real person or not?
Scott Clendaniel: 00:14:31
There’s developing portfolios in which assets should be in there. You can look at time series forecasting of how an individual investment is going to perform. You can also look at things. One of the problems I’ve been working on a lot recently is sales optimization, so how does a financial advisor look at the broad palette of customers and potential customers, and figure out who should be prioritized in terms of what their needs are and coming up with a product recommendation on what would fit their needs?
Kirill Eremenko: 00:15:05
Gotcha. It ties in well with what we just spoke about that there is a broad scope of applications in business problem. Gotcha. What specific AI technologies have… This is Matthew. Matthew asks, “What specific AI technologies have changed the investment industry, and which do you predict will shape the industry in the next five years?”
Scott Clendaniel: 00:15:28
Sure. I think the development of additional algorithms to be available to us has changed AI quite a bit. Deep learning, perhaps not as much as others, but things like xgboost and algorithms that allow for ensembling have really helped the industry quite a lot. Also, approaches in terms of anomaly detection for fraud detection, they’ve been huge contributors as well. Those are probably the changes in AI that have impacted the most.
Scott Clendaniel: 00:16:00
Also, the growth of open source have made it very difficult for organizations to say no. There was a time many, many decades ago where people would say, “Oh, no, I’m sorry. We can’t do anything without a software package that costs $90,000.” Now, they can’t say that anymore. I think that actually had probably the biggest impact on the growth of AI overall.
Kirill Eremenko: 00:16:23
Gotcha. Gotcha. In the next five years?
Scott Clendaniel: 00:16:27
Next five years, I think there are going to be huge opportunities in terms of predicting credit performance and also fraud detection. Those are extremely difficult problems to solve, and having the more advanced AI technologies, I think, are going to continue to help in that arena, especially fraud detection, because it keeps changing. What appears to be fraud in one given quarter may look very different the next quarter, because the fraudsters are always adapting and changing their approaches.
Scott Clendaniel: 00:16:58
So you need to have a technology that allows the models to continue to grow over time. You can’t just pick a point in time and say, “Okay, we know what fraud is,” because it won’t be the same next quarter.
Kirill Eremenko: 00:17:09
Gotcha. Before I forget, I wanted to say that Scott is sharing his comments today on behalf of himself and not on behalf of the organization that he works at. These are just opinions at the end of the day, our opinions.
Scott Clendaniel: 00:17:27
It’s all my fault. I want to make myself really clear. Everything I say is my fault. No one else’s.
Kirill Eremenko: 00:17:32
Thank you. Thank you, Scott. Speaking of fraud, do you know what the size of this problem, globally or in the U.S. is per annum?
Scott Clendaniel: 00:17:47
I don’t have recent statistics on that, but it runs into the several billions of dollars. The challenge is the fact that you not only lose the profit on the given transaction that would come in if it turns out to be fraudulent, but you lose the entire dollar amount. In financial services, your inventory is actually the dollars that you manage. If you have a fraudulent transaction, you lose every bit of that inventory along with any type of profit you would have made.
Scott Clendaniel: 00:18:18
It runs into many, many billions of dollars, so it is a huge issue. It’s also really complicated because the fraud rate, the percentage of transactions that are fraudulent tend to be very low, but its financial impact is ridiculously large. It’s a real class and balance problem.
Kirill Eremenko: 00:18:36
I did a quick search. The global fraud market size for fraud detection and prevention market size is valued at $17.3 billion. As you said, lots and lots [crosstalk 00:18:51].
Scott Clendaniel: 00:18:51
That’s just the market to stop the fraud. That’s not the product sales. You’ve got a real clear picture of how big of an issue it is.
Kirill Eremenko: 00:18:58
That’s a good point. I only know one… saw quite… Intuitively, I know what… I’ve read about one specific broad algorithm that I could confidently explain, and that’s Benford’s law. Are there any algorithms that you can share with us?
Scott Clendaniel: 00:19:23
Sure. Actually, I’ll make a recommendation of something to be careful with. There has been a huge amount of press about using anomaly detection for fraud detection. That is very helpful. It does have some pretty severe limitations though, and that, a given fraudster is going to try very hard to not look like an anomaly. In other words, to some extent, the data is actually fighting against you. The fraudster is trying to look as similar as possible as they can to the mean of any given transaction.
Scott Clendaniel: 00:19:57
They’re actively fighting you to not look like an anomaly. The problem is the false positive rate on anomaly detection is enormous, and it’s very difficult to fight against. Just using anomaly detection, regardless of how sophisticated the version that you’re using is, tends to have some severe limitations that you’re going to come up against, so just be aware there should be one tool in your arsenal. It shouldn’t be the be all and end all.
Kirill Eremenko: 00:20:28
Gotcha. Anomaly detection is one. That’s fantastic. I’ll share Benford’s law. Benford’s law is more of a aggregates tool. It doesn’t look at individual transaction. It looks like as a whole. The way we were taught it at Deloitte is that if you take a… I might be paraphrasing what they told us back then. Again, I’m speaking from my opinion as well. You take all the transactions on all of the dollar values on a balance sheet of a company.
Kirill Eremenko: 00:20:59
Just take all of them. Mix them all up. Put them into a bag, and then look at the distribution of the first… Is it the first? No, of the first digit in all of these transactions. What is the first digit of all of these amounts? What’s the first digit? How many number ones are there? How many number twos are there? How many number threes are there? The leading digit, and Benford’s law says that the distribution should be… They should be 30% once, 17% twos, 12% threes, 9% four.
Kirill Eremenko: 00:21:40
It drops off as you go further. It’s an intuitive thing, and that is something that is really hard to fake, right? If you’re faking a balance sheet and you’re making up numbers. You don’t think of that distribution in mind. You might make your numbers look really believable, but overall, when you take the distribution on the first digit across all the numbers that are going to fall on Benford’s law. That’s how a qualified expert can tell that, “Hey, there’s something going on here.”
Scott Clendaniel: 00:22:10
In forensic accounting, that becomes really important. That is definitely one of the warning signs. There are a couple of interesting things about Benford’s law in addition to what you said. I think you gave a great explanation of it. Benford’s law seems to apply even if you change the base unit. If it’s not a decimal system, if you used a base eight system, you will tend to see very similar patterns. The percentage of digits will change, but I just find it amazing that even if you change the base number that you’re using, that it tends to show up.
Scott Clendaniel: 00:22:45
It’s great for things like reviewing the balance sheet, just as you mentioned. What becomes tricky is the fact that when you’re dealing with consumer transactions, for example, the actual transactions, you don’t apply as much for the trailing digits. Everybody wants to charge $1.99 or $1.95. They don’t tend to charge $1.03. You actually have the opposite problem there, so you have to be careful. Benford’s law is extremely powerful when you’re looking at a whole collection of numbers given in one instance such as the balance sheet. It becomes much harder when you’re trying to look at individual transactions.
Kirill Eremenko: 00:23:24
Awesome. Awesome. Thank you. Well, there we go. That’s two techniques in fraud detection. Speaking of fraud detection, we have a question from, again, Vighnesh who says, “While working on fraud detection problems, most of the times, we come across imbalanced datasets. Can you please put a light on how to overcome such problems or how to resolve it?” Maybe to start off, what does it mean that the data set is imbalanced?
Scott Clendaniel: 00:23:52
Sure. This is typically referred to as class imbalanced. In other words, if I’m trying to do a classification problem, and let’s say that I’m trying to do fraud versus non-fraud. If you look at the distribution of how many transactions are fraud and how many transactions are non-fraud, your fraud rate tends to be relatively small down into the tens or hundreds of a percent. The problem is if you try and compare the fraud transactions versus the non-fraud transactions, a lot of algorithms are going to choke on that unless you adjust the balance of the dataset so that you can have more of a 50/50 ratio between fraud and non-fraud.
Scott Clendaniel: 00:24:33
Otherwise, what the model’s going to do is go, “Let’s see. I’ve got 1,000 transactions. 990 are not fraud, and 10 are fraud. I’ve got it. They’re all not fraud.” It’s going to be right 99.9% of the time. It’s just going to mess everything up, so you need to make [crosstalk 00:24:52].
Kirill Eremenko: 00:24:51
It’s going to have fantastic markers.
Scott Clendaniel: 00:24:54
Yes. That’s amazing. I’m done. I just said there is no fraud, and I’m going to go home and have lunch. That doesn’t work out too well in the real world. What you do is you tend to over-sample what’s called the minority class, so in this case, the fraud transactions. I might take most of every fraud transaction I can get that is in my training set. I’m going to compare it against an approximately equal number of non-fraud. That means that the model can’t just arbitrarily say, “Okay, everything is not fraud.”
Scott Clendaniel: 00:25:27
That’s the technique that I use the most. There are other techniques that you can use including doing all sorts of complicated things with synthetic data or smoke techniques or those types of things, but the over sampling the minority class, lots and lots of the fraud and much fewer of the non-fraud has been the technique that worked the best for me. It’s also very simple.
Kirill Eremenko: 00:25:50
What’s the drawback? What’s the potential danger of using this technique?
Scott Clendaniel: 00:25:57
Part of the problem is you may not have enough fraud cases to use in the first place. You may have such a small number of records that you may not be able to use that in its purest form, but you’re definitely going to want to move your sampling as close to a 50/50 ratio as you can.
Kirill Eremenko: 00:26:17
Gotcha. As long as you select the other ones, the non-fraud ones at random, you shouldn’t have any bias in your model because you didn’t use all of the available random.
Scott Clendaniel: 00:26:29
Correct. Use of random becomes really important. Some of the experts in the group who have classic statistical background can talk a lot more about random versus non-random and that there is no true random, but there are all sorts of techniques you can use to make sure. For most of the work that I’ve done, I just use random functions, Python or SQL. That has worked pretty well for me, and I haven’t run up [inaudible 00:26:55] stations.
Kirill Eremenko: 00:26:57
Awesome. Speaking of Python and SQL, what kind of tools do you use in your day to day?
Scott Clendaniel: 00:27:06
Python, Spark are the most common. Because I am older than dirt, I started doing this all the way back in 1986, God forbid. I actually grew up using gooey tools like SPSS, which is now by IBM, Salford systems, which is now owned by Minitab, lots of other gooey tools. There’s actually a free one, especially if you’re just starting in the field and you don’t have developer background, called Orange Data Mining, which is included in the Anaconda distribution. Those were a couple places that you can start, but eventually, it tends to turn into a lot more Python as a starting point, and probably Spark if I’m using a distributed system to build.
Kirill Eremenko: 00:27:55
Gotcha. You said Orange Data Mining.
Scott Clendaniel: 00:27:58
Orange Data Mining, yeah. It is not the prettiest program you will ever encounter. The interface looks at least about 20 years out of date. Don’t be fooled by how the interface looks, because there is a lot of power underneath it. A lot of people get turned off. They’re like, “Oh, this doesn’t look cool. This doesn’t look like something that was built for Apple. I’m not going to use it.” I would say that’s a mistake. They’ve actually done a really good job of creating a gooey interface to sit on top, and then underneath, they’re primarily as Python. [crosstalk 00:28:29].
Kirill Eremenko: 00:28:29
Is this primarily for building models or fraud detection?
Scott Clendaniel: 00:28:36
Any type of model.
Kirill Eremenko: 00:28:38
Gotcha. Awesome. Awesome.
Scott Clendaniel: 00:28:40
I’ve been really happy with it. People laugh at me when I show the screenshots, but it actually works pretty well.
Kirill Eremenko: 00:28:46
Gotcha. What kind of models have you noticed work the best with fraud detection? We’ve got a huge range. K means clustering. We’ve got Naïve Bayes. We’ve got logistic regression. We’ve got xgboost, and so on. What would you say are your go-to models? When you have a fraud-detection problem, what’s your first, second, third choice?
Scott Clendaniel: 00:29:12
First of all, I’m going to throw out an old theory that I hope people look into, which is essentially called the multiplicity of good models, which means if you have the right data and you’ve prepared it the right way, all sorts of algorithms are likely to give you a very similar, positive result. If you haven’t done the data prep correctly, you will start to see wide variances. That being said, ensembling techniques of any type would be my favorite, probably the most common being XGBoost.
Scott Clendaniel: 00:29:44
Also, as a data prep technique, I recommend target meaning coding. That can be extremely helpful. In terms of the final technique, I always recommend ensembling because each algorithm has its own strengths and weaknesses. If you’ll ensemble a group of different models together, you’re likely to end up with a better result. The final model is usually logistic regression based on the inputs of the XGBoost and other types of techniques in the family.
Kirill Eremenko: 00:30:15
Wow. Thank you, very, very detailed and advanced. This data, perhaps, I think, you mentioned, target meaning code.
Scott Clendaniel: 00:30:21
I got really passionate about this stuff, so I may bury you in detail, and I apologize.
Kirill Eremenko: 00:30:24
That’s okay. That’s okay. That’s okay. I want this to be an advanced discussion, well, advanced learning for me.
Scott Clendaniel: 00:30:33
That’s fine. That’s fine.
Kirill Eremenko: 00:30:34
I wanted to ask you this data prep technique, target, mean encoding. I don’t know it. I haven’t heard of it before. Could you tell us a bit about it, if you can just explain what does it do and how [crosstalk 00:30:44]?
Scott Clendaniel: 00:30:44
Sure. Let me give you a really simple example. Let’s say that we are an auto insurance company, and we want to predict whether the old myth that red cars always cause more problems than others. I’ve got a red car, a blue car, a yellow car, a white car and a black car. I’ve got five different car colors. I want to encode into my model this categorical variable, so rather than use the original values of the colors in the variable, you use a transformed variable, so a new variable. Instead of recording the car color there, you actually put the claim rate for each color.
Scott Clendaniel: 00:31:29
If a blue car has a 0.1% claim rate, you put 0.1 at any time it’s blue. If it’s really red, you’ll use 0.2% any time red shows up. In other words, you convert the original categorical input into its actual target mean. What’s the mean rate that this issue is going to come up with? You use the transformed variable as opposed to the original variable. It tends to work better than one hot encoding, which you would usually use. Also, you can use it in pretty much any algorithm, including algorithms that only take numerical inputs like the original XGBoost.
Kirill Eremenko: 00:32:10
Wow. Wow. That’s awesome. I hadn’t heard of that one. I’ve heard of one hot encoding, but still, do you mind refreshing my memory on that, please?
Scott Clendaniel: 00:32:18
Sure. One hot encoding, so for each of our five car colors, we’re going to take the original variable, and we’re going to move it out of the dataset. We’re going to create five new variables. One variable is going to be, “Was the car color white, zero, one?” If it’s white, you can put one. If it wasn’t white, put zero. When you have a second column that says, “Was the car color blue?” The third variable says, “Was the car color red?”
Scott Clendaniel: 00:32:42
You can run into issues with that, and that you can end up with a sea of categories, and you can overload your model with way too many inputs.
Kirill Eremenko: 00:32:55
Gotcha. Gotcha. You also got to be careful of the dummy variable trap there, right?
Scott Clendaniel: 00:33:00
Absolutely.
Kirill Eremenko: 00:33:00
You have to have one less than the original number of categories. Gotcha. Very interesting. Thank you. That’s very exciting. Let’s move on. Now, let’s talk a bit about models. Sonam, Sonam, I hope I’m pronouncing it right, asks, “What are the parameters to look out for and or test to perform that indicate model drift while monitoring a production AI or machine learning model?” To start off, what is model drift?
Scott Clendaniel: 00:33:33
Model drift is a silly name, but basically, it means that the performance is going to drift over time. All models tend to decay over time, and different models decay at different rates. There are a couple of things that I’d suggest with this. Number one, assume that your model is going to decay over time. Don’t guess it’s going to decay. Just assume it’s going to decay, and set up a schedule of when you’re going to retrain any model you have.
Scott Clendaniel: 00:33:59
That’s the first component. Plan for obsolescence before you put the first model in. That’s very important. The second thing is if I use what’s called a population stability report, which sounds like some bizarre sociology experiment, but essentially, what it’s doing is to say, “My data that I started off with in one period, how similar does it look to data that I’m looking at right now? Are those two populations similar or not?”
Scott Clendaniel: 00:34:27
When the population stability report comes in and says, “Hey, wait a minute, your data starts to look different,” you definitely need to retrain your model, and that the world that the model is trying to represent has changed. Therefore, the accuracy of its predictions has changed. Just assume it’s going to happen. I get irritated with folks who just want to go out there and say, “I have created the world’s greatest model, but it’s based on data from seven years ago. And I don’t know why it’s performing poorly.”
Scott Clendaniel: 00:34:54
Come on, seriously? If you put in your model, test it constantly, and use that population stability report to keep your eye open to see that the world’s changed.
Kirill Eremenko: 00:35:04
Gotcha. A couple of questions here. First one will be, “Why…” This might be a naive question, but I’m just tempted to ask, “Why do models decay over time? Why do they never get better? Why is it that way?”
Scott Clendaniel: 00:35:21
Trick question. They decay over time because they represent the world as they knew it, and the world changes. The model does its absolute best to represent the world as it saw it in your training set. When the world changes, the type of data that shows up in your training set may drift. Let’s pick an easy example. Inflation, prices tend to rise over time for lots and lots of things. If your original model was based on prices from three years ago, the prices now look different, so the model needs to adapt to reflect that change to come up with representation of what happens today.
Scott Clendaniel: 00:36:02
That’s why all models tend to drift. However, that isn’t necessarily a bad thing in that you may have learned more information over time. You may have a larger data set to look at. You may have found new variables that you didn’t see before. You may have new algorithms you want to test out. Actually in the end, you can end up with a model that is more accurate than the last model was at its peak. Over time you actually can get better. That’s one of the things that’s exciting to me.
Kirill Eremenko: 00:36:32
Once you update it, of course, right? If you leave it alone, it’s not going to get better.
Scott Clendaniel: 00:36:37
No. I wish.
Kirill Eremenko: 00:36:39
Gotcha. In your answer on LinkedIn, so it was really inspiring to see that you went through and answer to everybody, you do a huge part for data science.
Scott Clendaniel: 00:36:52
Tried. If I missed anybody, send me a message.
Kirill Eremenko: 00:36:55
Awesome. You mentioned that six months is your magic number to look at. Why is that?
Scott Clendaniel: 00:37:05
It is completely arbitrary. I’ll tell you why. If you try and make it annual, it tends not to happen. In the real world, organizations are like, “I don’t have it.” We did it last year or whatever. If you make it six months, you keep it top of mind with everybody. Six months is the outer limit, and then if the population stability report says, “Wait a minute, the world looks different, or the performance of the model tends to fall up.” You’ve got a fraud model, and your fraud rate keeps inching up.
Scott Clendaniel: 00:37:36
Either those two events, you look at it in a shorter period of time. But if you actually build that into the calendar on the front end, and also explain to your stakeholders, “Model drift is a thing, and you need…” Modeling is a process. It’s a process of learning, and it’s adapting to change as information changes. Just bake that into the schedule from the beginning, and you’re going to be in a lot better shape.
Kirill Eremenko: 00:38:00
Been there, it was in an organization once where they didn’t look at the model. Some consultants delivered a segmentation model. They did… oh, no, prediction model, prediction in terms of who will churn, who won’t churn. It didn’t look at it for 18 months. When we had to look at it…
Scott Clendaniel: 00:38:21
Ouch.
Kirill Eremenko: 00:38:22
Its accuracy dropped from 78% or something like that down to 49, so it was better to flip a coin.
Scott Clendaniel: 00:38:32
Well, let me throw out a made-up word for everyone. That’s nonstop optimization. Instead of optimization, if you’re constantly nonstop trying to optimize, you call it nonstop optimization. Senior management loves phrases. I’m kidding. But if you take on that theory that I am always going to be improving my model, I’m always going that it’s an organic process, that it’s something that you put in as a regular business process as opposed to the snapshot one time event that’s going to fix all our problems. I think you’ll be in better shape.
Kirill Eremenko: 00:39:06
Great. The population stability report, what do you look at there? Do you look at means, distributions? What are the [crosstalk 00:39:13] or maybe-
Scott Clendaniel: 00:39:14
It’ll actually give you an indicator on a scale of zero to one and on how much things have changed.
Kirill Eremenko: 00:39:21
It’s like it’s a library in Python?
Scott Clendaniel: 00:39:24
Yeah. For example, if incomes on the original data set were $62,000, and now it’s $140,000, something’s wrong. It also helps you to figure out if your data stream has been corrupted. In other words, let’s say the model works fine when it has accurate data, but somehow, something has happened, and now the data isn’t as good as it was before, you can then go in and say, “Okay, wait a minute. Something’s wrong. The population looks different from what it was before.”
Scott Clendaniel: 00:39:54
Maybe we’ve got a problem with the database. Maybe we’ve got a problem in how the data was collected. Maybe we’ve got a metric versus English units issue that suddenly popped up.
Kirill Eremenko: 00:40:06
Gotcha. Anybody can go and download this population stability report, or everybody needs to build their own version of it?
Scott Clendaniel: 00:40:14
Yes. The formulas are out there. I can’t recite it off the top of my head, but if you just type into Google population stability report, it can give you a walkthrough of that.
Kirill Eremenko: 00:40:22
Awesome. Fantastic. Speaking of data collection, there was a question from Santosh. How do we check stability and consistency in the process before using the data it generated from model building? I understood there was a bit of he meant one thing, and she first answered another question, and then he answered the second one. Let’s start with the first one. The first one was like, “How do we check for stability and consistency in terms of the data collection process?”
Scott Clendaniel: 00:40:52
Sure. Usually, this is done sort of further upstream before when we get data. This is usually done by the folks who are doing your data ingestion or your ETL process on the data upfront. The point he’s making is extremely valuable. It’s as simple as garbage in garbage out. If the data you’re putting into your model has flaws in it, your model isn’t going to work. One of the things is actually sit down with the people who are the stewards of that data.
Scott Clendaniel: 00:41:20
How is this collected? How often? How long have we been keeping track of this? This is also one of the reasons why data visualization is so important. See if the data makes any sense before you start loading it into your model. His point, I think, was data scientists are so excited. They want to have a model. They want to have results. They want to use their area of expertise. They want to pull the algorithm out of the quiver and start shooting.
Scott Clendaniel: 00:41:47
That is a problem if you haven’t checked the data consistency upfront. I will give you a real world example from a client who shall not be mentioned, where the original data stream, there was some type of corruption when data was migrated from one database to the next database. If it was a dollar amount, they literally physically typed the dollar sign in the value. If there was a comma, they type the comma. If there was a decimal, they type the decimal.
Scott Clendaniel: 00:42:16
At other points in time, this wasn’t true, so it was just the raw number. This is why you really need to understand your data. To his point, you need to make sure that data seems to make some type of sense. Sit down with the steward of that data to make sure that you understand what you’re dealing with before you get too far down the pipe.
Kirill Eremenko: 00:42:34
Awesome. Gotcha. Then you also talked about your answers, something that intrigued me. You said that if for some reason, the data is corrupt, then cross validation of the model should also fail. Basically, we could probably use cross-validation as I understood as a indication, whether there’s problems [crosstalk 00:42:58].
Scott Clendaniel: 00:42:57
Absolutely.
Kirill Eremenko: 00:42:59
Tell us a bit more about that.
Scott Clendaniel: 00:43:01
The great thing about models and cross validation is it’s virtually impossible to come up with a great cross validated model based on bad data, because it just won’t work.
Kirill Eremenko: 00:43:14
What is cross validation?
Scott Clendaniel: 00:43:15
When you’re taking cross validation, you’re taking the original data set, and you’re dividing it into what they call folds. Let’s say we’re going to take your dataset. We’re going to break it into five folds. You test the first model built on four of the folds, and you leave out the fifth. On the second one, you may use folds two through five, and you test it on the first. You want to make sure that those results look very similar.
Scott Clendaniel: 00:43:38
If they don’t, you’ve got a problem in the model, so you either average the results of the folds, or what I tend to do is definitely do that first part, but also go back and say, “What is weird about that one fold that doesn’t seem to be working very well? Why is the performance off for this version versus that version?”
Kirill Eremenko: 00:43:56
Gotcha. Is the data in the folds randomized? Before you select defaults, you randomize them.
Scott Clendaniel: 00:44:03
Absolutely.
Kirill Eremenko: 00:44:04
Gotcha. Just to recap, let’s say we have, I don’t know, 100,000 or 500,000 records. Then you break it down to five groups of 100,000 each. In the first version, you train the model on the first four groups of the data, and then the fifth one, you test it. In the second version of the model, you train it on, say, the first, second, third, and fifth groups. Then you test it on the fourth. Then you train it on the first, second, fourth, and fifth, and you test it on the third.
Kirill Eremenko: 00:44:38
You’re always shifting this window. Ideally, you should be getting the same results throughout, similar results.
Scott Clendaniel: 00:44:48
They should be really similar. Also, your final model should take the average of each prediction.
Kirill Eremenko: 00:44:54
Gotcha.
Scott Clendaniel: 00:44:56
There are some folks who just use cross validation for testing hyper parameters and things like that. If production allows me to be able to use all five models and take the average of the scores, that’s what I like to do.
Kirill Eremenko: 00:45:15
If you have corrupted data or probably errors in the data, if you randomize the data before you do the five bands, wouldn’t that corrupt records? Wouldn’t they distribute equally across the five bands, and therefore the model would still perform identically, but still there would be-
Scott Clendaniel: 00:45:37
Well, it would be identically terrible performance. You’re not just looking to see if it differs among the five. If my area under the curve is 51%, I’ve got a problem. I need to be really aware of that. To your point, yes, they will look similar, and they will look similarly awful.
Kirill Eremenko: 00:45:59
Gotcha. Gotcha. Could you do the cross validation without randomizing first? Then you would more likely have one of those bands that is definitely underperforming.
Scott Clendaniel: 00:46:12
It’s no. I would not recommend that because you lose… The real advantage of the randomization is the fact that it eliminates or greatly reduces the chance that you’re just looking that a few records are off, or that you’ve got outliers or that type.
Kirill Eremenko: 00:46:29
Gotcha. Gotcha. Awesome. Thank you. Next one was… This is an interesting question. This one is more about complexity of machine learning models. What is your experience… This is a question from Desmond. What is your experience with the complexity of machine learning models and alpha or out performance to the most complex models XGBoost, neural nets yield the highest alpha, or are there other factors that yield higher alpha? For example, type of data feature, engineering, et cetera. First of all, what is alpha?
Scott Clendaniel: 00:47:14
Well, I’ll tell you what. I am going to do what all good interviewees do. When they don’t have the pure understanding of the answer, they change the question. I’m going to treat this as a question on over fitting as opposed to the pure definition of alpha. I’m going to regard over performance as a function of over fitting the model, which means that it’s basically memorizing the data. As I said, I am older than dirt. When I went to school and studied math, they used to do this weird thing where they would give you the answers to the odd numbered questions in the back of the book, but they wouldn’t tell you the answers for the even number of questions.
Scott Clendaniel: 00:47:54
What would happen is if you just tried to memorize the answers, you’d only be 50% right. Models tend to be very greedy. If they get the chance to memorize the answers, they will do it. What you want to do is to make sure that it’s very difficult for the model to actually do that. Otherwise, what it’s learned is what the answers to the specific records that you gave it is as opposed to identifying the true patterns that can be applied to other folks. If I went and I got a suit, and it was completely custom fit but for somebody else, it’s not going to fit me very well.
Scott Clendaniel: 00:48:32
It’s over fitted to the wrong person. I want to make sure that my model generalizes well. I want to make sure that my model applies to all kinds of different people. Over fitting is a big problem. The more complex the model, such as deep learning, if I’ve got a 600-layer deep learning neural network, and I’ve got 10,000 records, I’ve got a problem, because in many cases, it’s going to try and memorize the data itself as opposed to learn the patterns. Algorithm can be a component.
Scott Clendaniel: 00:49:03
You can also have data leakage issues, where it’s memorizing the answer because it’s actually included in the original data set. The ways to avoid that or an answer to the question, algorithms can be a problem. The data can be a problem. There are all kinds of things that can lead you down that path, ways to get around that or to use very robust validation methods including cross validation to try and eliminate the possibility of models actually memorize the answers in the back of the book.
Kirill Eremenko: 00:49:35
Wow. Wow. That’s fantastic. I just got transfixed on your answer. [crosstalk 00:49:43].
Scott Clendaniel: 00:49:43
Oh, sorry.
Kirill Eremenko: 00:49:44
That’s awesome. Thank you. Definitely an important point to look at. At what stage would you say people should keep that in mind when they’re building a model?
Scott Clendaniel: 00:50:01
From the beginning. Part of what we as data scientists, we tend to be really tempted to jump in and build them up. I’m a model builder, so I want to build a model. That might not be the first thing that you do. You really have to have a solid design in terms of what your model building process is going to be. Coming up with the validation strategy should be one of the first things you do because you’ve got to segment your data out into what’s going to be test, what’s going to be training, what’s going to be validation or cross validation before you start the modeling process, or you’ve already contaminated the experiment, so to speak.
Scott Clendaniel: 00:50:39
In your process design, before you sit down, that should definitely be a component that you look at.
Kirill Eremenko: 00:50:47
Gotcha. How much time should be spent on designing the process versus implementing the model?
Scott Clendaniel: 00:50:54
Well, if you come up with a standardized process in terms of how you select variables and your randomization, you can actually bake it into the process. Once you do it once, you shouldn’t have to repeat it a whole bunch of times, but you should definitely be very robust on your first model and see what components you can redo. Randomization should be something you should be able to do in every model in a specific script. You shouldn’t have to reinvent the wheel every time.
Kirill Eremenko: 00:51:21
Gotcha. The follow-up question from Santosh was he said, “Give an example.” Let’s say I want to build a forecasting model for daily pizza sales, and say I have data for the past model.
Scott Clendaniel: 00:51:36
After my own heart, pizza sales.
Kirill Eremenko: 00:51:40
My question is unless the processes that drive for pizza sales are consistent, we can’t rely on the data. For example, there was a change in employees every month, the change in tools being used and so on. On the other hand, if the pizza store is started just a few weeks ago, it might not have reached this stage because… The main thing is how do we know? If there were changes in the process, you have data for a whole year, but then there are changes in the process, employees or how we do sales and so on. Can we still use that for modeling?
Scott Clendaniel: 00:52:15
Sure. What I’ll say to that is you’re absolutely right, very important point, but it is very rare that a modeler is ever going to have a perfect data set to start with. The trick is, “What is my decision making process now, and can I improve it with the model, and if so, by how much?” You never get to perfect. You try and get to a perfect representation of the system for your pizza prediction, but every organization has employee turnover. Every organization has some of those elements in play.
Scott Clendaniel: 00:52:50
That’s why you also have to be very careful with your validation strategy to make sure that your model holds up on data it’s never seen before. That’s why keep banging that drum. The question is, “Is there something that I can learn from the data that I have right now using standard validation procedures?” If I can, and I increase the performance of my decisions, if I make 10% better decisions, does that help my business? If it does, the Teddy Roosevelt quote about, “Do what you can with what you have where you are right now.”
Scott Clendaniel: 00:53:26
Make sure that your client, whoever, is going to be using your model understands. This is what I think I know. This is what I don’t think I know. This is how much I think I can improve performance with the model over where you are right now. Then you work with a client to say, “Is it worth the effort? Is the juice worth the squeeze, so to speak?” That’s why you need to work with your client as opposed to just turning this into some academic exercise off in an ivory tower.
Kirill Eremenko: 00:53:52
Nice. Fantastic. Scott, thank you. Those were all the advanced questions. We’re moving on to the world-
Scott Clendaniel: 00:54:02
Okay.
Kirill Eremenko: 00:54:03
You did well. You did really well. We’re moving on to more soft skills and predictions and forecasts for the future. A question from Muhammad, “What is the difference between a data scientist and leading a data science division?” Basically, what is the difference in skills required to be a data scientist and to lead a data science division?
Scott Clendaniel: 00:54:33
To lead a data science division, I think you need the skills of a data scientist plus a couple of other things. One strategy, making sure that you’re leading the appropriate goals across all modeling projects, not on your individual modeling projects and management. You need to be able to work with people to coach them to get the absolute best performance that they can achieve, not just what you can do best, and a lot on what you’re working on right now.
Kirill Eremenko: 00:55:01
Are the people skills required for a data scientist such as communication, presentation? Are there still people skills? Are they different, and how are they different to the management people skills required for a lead data scientist?
Scott Clendaniel: 00:55:15
I don’t think that they are different. I think the problem is people skip over them altogether. The quality of management in general in most organizations can be somewhat appalling, not just for people who manage data scientists, but for people who manage any type of group. I think it’s a real chink in the armor for all types of organizations. I think the difference would be that you may communicate to data scientists in their own language.
Scott Clendaniel: 00:55:43
You must be able to speak their language and be able to establish their trust and be able to work with them to get them to their highest performance. If you were dealing with an accounting team, you need to be able to speak the accounting language to be able to help them reach their highest performance. It’s not so much different skills. It’s applying the right skills to the type of game you have, I think.
Kirill Eremenko: 00:56:07
Gotcha. I know you’re passionate about leading data science teams. Why are you passionate about that?
Scott Clendaniel: 00:56:15
Because I think so much can be done outside of just the algorithm, and I think there has been such a push, especially in the past 10 years, on the type of algorithm you use. Algorithm isn’t necessarily what’s going to lead you to the best performance. I’m going to steal a story from Stephen Covey. He said that pretend you had a bunch of folks, and they’re trying to lead a trail through the jungle. They’re like, “Okay, we’re going to have Fred in the front of the group because he’s really good at dealing with a machete, and we’re going to have Michelle. She is the absolute expert at machete sharpening, and she’s really good at that part, and such and so forth and everything else.”
Scott Clendaniel: 00:56:53
The leader is the one who climbs the ladder up and shouts down, “Wrong jungle.” You got to be able to change. You need to be able to figure out if you’re in the right jungle or not. A lot of managers are not terribly good at that, and so you need to have that holistic view in addition to the expertise of data scientists.
Kirill Eremenko: 00:57:17
Fantastic. What advice do you have for data scientists or advanced data scientists who wants to become leaders or who wants to become data science managers?
Scott Clendaniel: 00:57:32
It’s understanding a fact that all data scientists need to come to grips with, and that is that data science is not about data. It’s about people. Let me explain what I mean by that. You’re trying to solve people’s problems. You’re trying to help people. You’re trying to communicate with people. Whether you’re in accounting or data science or medicine or nuclear physics or social work or anything else, it’s about people. A lot of people come into our field unfortunately, because they don’t really like dealing with people. They like numbers more.
Scott Clendaniel: 00:58:04
It needs to be a blend. At the end of the day, you’re always trying to help people meet their needs. Data scientists use data and algorithms and techniques to be able to achieve that goal, but the goal isn’t different. The more you understand people communication, storytelling, visualization, the so called soft skills, that is going to be what greases the wheels to be able to get people to where they need to be, and to solve their problems in the best possible way.
Scott Clendaniel: 00:58:35
You can’t surrender on that and lead a data science team and be terribly effective.
Kirill Eremenko: 00:58:42
Love it, so not just data science leadership, but data science itself is about people. If you one day want to become a data science leader, then start now. As a data scientist, start honing in on your people skills.
Scott Clendaniel: 00:58:59
Absolutely.
Kirill Eremenko: 00:59:00
What’s the recommendation? How does somebody go about… There’s not many online courses on people skills. A lot on technical skills. Where do you learn the people skills?
Scott Clendaniel: 00:59:10
It depends on where you look. I’m actually going to push back on that one a little bit. We in the data science community like to read our data science blogs. If you focus there, that’s where you’re going to find those results. There are all kinds of resources. I’ll tell you my particular favorite is the work that was done by Stephen Covey, and also my book recommendation, I’m going to sneak in here while you’re not looking, which is the Seven Habits of Highly Effective People.
Scott Clendaniel: 00:59:36
It talks a lot about people skills, talks a lot about problem solving. That can be applied to data science or accounting or physics or sociology or anything else, and focusing on those types of skills. Also, data visualization classes, not just to show how to visualize data, but you choose a visualization based on your audience and what that audience’s needs are. Focus on that piece. What do we need to communicate?
Scott Clendaniel: 01:00:02
What do they need to be able to solve their problem, and how do I give them back? As opposed to, “Here’s my way manky cool analysis that I did, 700 pages long that no one’s ever going to read.” That’s not the solution to the problem.
Kirill Eremenko: 01:00:17
Gotcha. Absolutely. Let’s move on to future questions, so about the future. Snehal asks, “What will be the next after advanced AI?”
Scott Clendaniel: 01:00:34
If we’re not careful, we’re going to run into another AI winter. Let me tell you what I mean by that. If you look at Gartner’s Hype Cycle, where they talk about the stages of a technology, you tend to get overly hyped expectations, and then you end up falling off the cliff into what they call the trough of disillusionment. We can bicker over whether those are good names or not. But if you set people’s expectations too high, and then you don’t meet them, they don’t tend to say, “Scott Clendaniel’s particular model didn’t do very well.”
Scott Clendaniel: 01:01:07
What they tend to say is, “I knew that AI stuff was a bunch of hooey, and we never should have invested in it. I don’t want to do a model again. I don’t want anyone coming in here talking about statistics. I don’t want to hear about machine learning. I don’t want to hear… It’s all garbage, because Scott Clendaniel’s first model didn’t do well.” You need to really be aware of that. 85%, according to Gartner, of all models do not reach production.
Scott Clendaniel: 01:01:29
Think about that for a second, 85%. Our industry currently has a 15% success rate. I don’t know of too many fields who can survive that, so my big concern about AI is unless we get back to actually solving the organizational issues and fixing the problems as opposed to, “Gee, look at my AUC. Doesn’t it look great?” The future of AI is going to go into a dark period for a while.
Kirill Eremenko: 01:01:57
What about on the flip side? Jacques asks, “There’s a fear that AI could replace humans in their jobs. What would you tell a concerned human being about that?”
Scott Clendaniel: 01:02:11
I would look at a lot of the research that’s come out of RPA, robotic process automation. To the largest extent possible and the conferences I’ve been to is people’s jobs don’t get replaced. In other words, people don’t get replaced. They get different jobs. Meaning that if they’re working in the accounting department, they stop working with copying stuff from Excel from spreadsheet A to spreadsheet B, and get to work on, Do we need the spreadsheet in the first place?”
Scott Clendaniel: 01:02:37
That’s not a bad thing. There was a lot of concern that AI was going to replace all kinds of people, and I haven’t seen a lot of it happen yet. I don’t think that radiology is the first career I would jump into right at the moment, because a lot of that is being automated, but you may end up with different types of jobs that a radiologist might apply like, how you apply the results from lots of different analysis, from all kinds of different x-rays in terms of diagnosing a disease.
Scott Clendaniel: 01:03:05
I think some jobs always get replaced by technology. There aren’t a lot of buggy whip manufacturing jobs left anymore, but I don’t see AI replacing all kinds of people. The head of Stanford’s AI lab had a great quote that said, “We’re a lot closer to discovering a smart washing machine than we are of terminators taking over the world.” I think that’s true. I think we need to be careful of it, but I think people are perhaps overly concerned at this point in time that AI is going to replace everyone’s job.
Kirill Eremenko: 01:03:42
Gotcha. There was a report by the World Economic Forum in 2018 that predicted that I think by 2025 or 2022, I’m not sure exactly of the year, but AI will displace 75 million jobs worldwide, whereas, it will create 133 million jobs. That’s a coefficient of 1.7.
Scott Clendaniel: 01:04:07
I think that’s a much better example than the one I just gave.
Kirill Eremenko: 01:04:13
I think they’re both absolutely valid. What are your thoughts on AI replacing data scientists themselves, specifically auto ML and products like data robot?
Scott Clendaniel: 01:04:25
I need to be careful here. I need to choose my words correctly in terms of that.
Kirill Eremenko: 01:04:29
We can skip that question. That’s okay.
Scott Clendaniel: 01:04:32
I think that a lot of AI functions can be helped by automation. I think ensembling, I think picking the correct algorithm, I think hyper tuning parameters, I think a lot of those will become automated, but there’s a lot of room for creativity on the feature engineering side. There’s a lot of room for creativity that’s hard to replace. Even something as simple as ratios, models are terrible at calculating ratios. They just are.
Scott Clendaniel: 01:05:02
For example, if you think of a credit score, if you think of everything I currently owe as one input, not a great predictor. If you think of how much total credit I have available, if you add up the credit lines from all my credit cards and stuff, also, not a very good predictor by itself. If you use a classic algorithm to go and say, “Okay, let’s throw them both out, because they don’t have high correlation to my result.” The trick is the percent of utilization, so out of that big pile of credit, what percentage am I using is hugely predictive in terms of your credit score.
Scott Clendaniel: 01:05:35
It’s those types of things as simple as ratios that I think are going to be hard to automate away. I think that many functions may be assisted by automation even in data science. I think if we focus on the right skills and the problem solving aspects and those type things, it is less likely to be automated away at least in the short term.
Kirill Eremenko: 01:06:00
Gotcha. Thank you, very, very cool answer. Adly asks, “Does every business need to adopt AI?”
Scott Clendaniel: 01:06:11
No is the short answer. [inaudible 01:06:16] not every business does. I think it’s silly for us to attend that every business in the world needs AI. I think every business could use to make better decisions than they do right now, and to the extent that AI helps with that, great. To the extent that AI doesn’t help with that, no. Also, there are some businesses that don’t have a lot of good data. If you don’t have good data, you can’t really build great models, so AI isn’t going to be a particular help.
Scott Clendaniel: 01:06:45
I think that every business needs to make better decisions, and businesses that have access to good AI should take advantage of it. Those that don’t, don’t worry about it.
Kirill Eremenko: 01:06:57
What are your thoughts on Andrew Ng’s comments that AI is the new electricity, and similar to how 100 years ago, only 50% of the U.S. was electrified? Now, everything uses electricity. AI will also similarly but faster be adopted by virtually all organizations. Otherwise, they’ll be lose in terms of competitive pressure. What are your thoughts on that, with the thought in mind that not every business needs AI?
Scott Clendaniel: 01:07:25
Well, let’s follow that example through. You used to have organizations in the 1920s who had a CEO, but it wasn’t a chief executive officer. It was a chief electricity officer, whose sole responsibility was how to figure out… I don’t know a whole lot of organizations that are still hiring chief electricity officers. I think that better decision making, again, is the key more than AI itself. I think it’s going to help more and more industries.
Scott Clendaniel: 01:07:55
I just don’t think that you’re going to have VIKI from iRobot making all the decisions for the planet. I think that is an overblown fear. I think it’s going to impact more and more organizations, but I think that we tend to go to this pendulum. “No AI. It doesn’t help anything,” to, “AI solves everything.” The answer tends to be somewhere in the middle, and just be aware of that.
Kirill Eremenko: 01:08:25
Gotcha. Understood. One final question, this one will be from me. There are so many types of ways to structure your data science division. This is like a data science management style question. One is to integrate individual data scientists across different functions like sales and then finance and operations and so on. Another one is to have a centralized data science team, who service all those functions. What is your preferred style and why?
Scott Clendaniel: 01:09:04
I’m going to steal this one from Harvard Business Review. That’s to use a hub and spoke model. You have a central core of folks who help the rest of the organization work on data science project. These are the folks who are going to be making sure that folks have the right tools that they need to help establish some processes, some standards, and so forth. That team is very small, and that most data scientists sit in the individual group that they need to serve.
Scott Clendaniel: 01:09:31
Your hub supports the people and the spokes in the different departments, and help them achieve their goals. I think that is the best way to do it. It is so easy to have folks be in a data science group who were out of touch with the needs of their clients that’s actually making them physically sit in that organizational structure helps solve a lot of those needs. That’s the way I would do it.
Kirill Eremenko: 01:09:58
Wow. Fantastic. Thank you. Scott, it’s been amazing. We’ve actually gone over time, but it was totally worth it. Loved these questions and your answers. Before I let you go, before we wrap up, I wanted to ask you what… Do you have a recommendation or just some piece of advice for specifically advanced data scientists out there who are listening to this podcast, so any parting thoughts?
Scott Clendaniel: 01:10:26
I do. That is that… I’ll tell it through another story. The first time I was ever invited to participate in an AI conference, I went running into my co-worker’s office. Her name was Beth. I said, “Beth, it’s fantastic. I’ve been invited to speak at an artificial intelligence conference. Isn’t that great?” She folds her arms across her chest. She leans back in her chair. She raises one eyebrow and says, “Do you really feel qualified to speak at such a conference?”
Scott Clendaniel: 01:10:57
I was like, “I did 10 seconds ago.” There is a lot of folks in our industry who look like Beth. That’s a bad idea. We need to be inclusive. Get down off your high horse. It’s a technology. It’s an area of expertise. We need to be inclusive, not exclusive. Try and be nice to people. Try and help them achieve their goals. Common manners and being polite and listening to people are really important in any field. But if you’re hoping for AI to have a big impact in your company, trying to prove how smart you are and how unsmart they are is a really bad idea.
Scott Clendaniel: 01:11:37
There are way too many of us who do that. Be inclusive. Incorporate as many people as you can. Be as helpful as you can, and stop doing this approach that, “I am some type of God of intellect because I know how to build a model.” You can build a model and tell folks. A lot of people could do that if only would someone take the time to show them how to do it. Be the person who’s helping bring more people into the fold, not explaining to everybody else why they’re wrong.
Kirill Eremenko: 01:12:07
That’s amazing advice. You actually walk the walk and talk the talk, right? That’s the saying. You live by that yourself.
Scott Clendaniel: 01:12:17
Thank you.
Kirill Eremenko: 01:12:19
I look at your comments on LinkedIn, and you’re always there supporting people, answering questions every time. Even in this thread, people asked you a question, you’d not just say, “Ask the question and thank you,” but you actually put a little image of a thank you, a written thank you. Every time, a different one. I could just imagine you have a whole library of these that you can use at any given time. It’s really cool. Why do you do it? Why do you help the community so much?
Scott Clendaniel: 01:12:48
I think because I was treated so poorly by the experts in our field when I tried to break into the field. I like to tease with people that for the first half of my career, people told me I couldn’t do this because I didn’t have a PhD in statistics. The second half of my career, everyone tells me I can’t do it because I don’t have a PhD in data science. I was like, “But some of my models seem to be working pretty well, I don’t know.” I think that it’s just a way of bringing more people in and being helpful because people need encouragement.
Scott Clendaniel: 01:13:22
We’ve got enough people out there in the world telling everyone else that they’re wrong. I think a little bit of kindness and support to other people goes a long way. I think we’re in better shape if we all just treated one another with a little more respect, a little more kindness, and a little less roughness and a little less intellectual aloofness.
Kirill Eremenko: 01:13:42
Awesome. Fantastic. Well, Scott, thank you.
Scott Clendaniel: 01:13:44
I don’t want anybody else to have a three-year-old on the phone saying, “Did you pack my toys yet?” with no prospects of finding a new job?
Kirill Eremenko: 01:13:54
Thank you very much for sharing that. I think it’s-
Scott Clendaniel: 01:13:57
Thank you. I really enjoyed this.
Kirill Eremenko: 01:14:00
Awesome. Me too. Tell us how can people get in touch, follow you, connect with you?
Scott Clendaniel: 01:14:07
The best way is to follow me on LinkedIn at T.Scott Clendaniel. I can’t accept all the invitation requests because I’m almost at the 30,000 limit. They won’t allow more people in, but please follow me. If you have questions, send me a message. I can’t answer everybody, but I do my best. I think I answered, gosh, over two dozen questions in the existing forum, and I will continue to try and do that.
Kirill Eremenko: 01:14:30
Fantastic. Thank you very much. You already gave your book recommendation. Could you just remind us? I think it was Seven Habits of Highly Effective People.
Scott Clendaniel: 01:14:38
Seven Habits of Highly Effective People by Stephen Covey, who is no longer with us-
Kirill Eremenko: 01:14:44
Awesome. That’s-
Scott Clendaniel: 01:14:44
… but his legacy lives on, great advice in there.
Kirill Eremenko: 01:14:48
Gotcha. Wonderful. On that note, once again, thank you so much. We’ll share all the links in the show notes, and please guys and everybody listening, connect with Scott. This has been a great opportunity to have you on the podcast. Thank you for coming.
Scott Clendaniel: 01:15:07
Thank you. Take care.
Kirill Eremenko: 01:15:14
There we have it, everybody. Hope you enjoyed this conversation as much as I did. I learned a ton from this discussion. There were so many cool advanced things that I didn’t know about before, and just blown away. Thank you so much, Scott, for coming on the show, and sharing all these insights with us. Perhaps my favorite part was when we spoke about over sampling the minority class. I could feel Scott’s confidence, and quite a tricky technique to just throw away a lot of your data in order to make sure that the positives and negatives are equally roughly the same.
Kirill Eremenko: 01:15:48
It’s a difficult decision to make, but the confidence which he spoke of was clear that he’s done this speaking many times. It’s obviously works for him. I really liked the discussion about data science leadership, and that if you want to be a data science leader one day, start now because soft skills are going to be… You need soft skills as a data scientist, not just as a lead data scientist. There we go.
Kirill Eremenko: 01:16:12
As usual, you can get the show notes at www.superdatascience.com/385. That’s www.superdatascience.com/385. There, you can get the transcript for this episode, any materials we mentioned, including a URL to Scott’s LinkedIn. Make sure to connect with him, or just look him up on LinkedIn. It’s T.Scott Clendaniel. He’s always, always very helpful. Just recently, he shared some amazing cheat sheets for machine learning. Even that is worth checking out.
Kirill Eremenko: 01:16:45
I had a look at them. This was some cheat sheets that was shared by Stanford University. He shared them on his LinkedIn, and there are some really cool cheat sheets there, including around cross validation. Check that out. As always, if you enjoyed this episode, share it with somebody, especially if you know an intermediate data scientist who’s looking to become advanced or an advanced data scientist who wants to further their skills in the space, a colleague maybe you know, a friend, a family member.
Kirill Eremenko: 01:17:17
Send them this episode, very easy to share. Send them the link www.superdatascience.com/385. On that note, my friends, thank you so much for being here today for sharing this hour or just more than an hour with us. I hope to see you back here next time, where we will be continuing to deliver on the promise of amazing episodes with very interesting, incredible guests. Until then, happy analyzing.