Welcome to episode #161 of the Super Data Science Podcast. Here we go!
Today’s guest is Sinan Ozdemir. He is here to discuss his second book about feature engineering. So, how vital is feature engineering in creating machine learning systems? Today’s the day to find out!
About Sinan Ozdemir
Sinan Ozdemir is Founder and Chief Technology Officer of Kylie.AI, a startup that aims to automate all communications that enterprises have with customers. He’s a contributor on Forbes.com. He’s the author of two books – Principles of Data Science and Feature Engineering Made Easy.
It seems like it was just yesterday when I talked to Sinan about his first book in this podcast. Today is the second time we’ll be having Sinan over and he will be discussing feature engineering, Kylie.ai, and future of AI. This is a jam-packed episode for data scientists interested in machine learning and AI. But, before we started going to the nitty-gritty of things, the first few minutes of the episode is Sinan and I started catching-up first with our career and personal lives.
His second book, Feature Engineering Made Easy, is related to and more elaborate than the first. His book, which was released early this year, emphasizes on how important it is to prioritize feature engineering when dealing with data sets. More than half of the time of data scientists is designated for ‘working with data.’ The complexity branches out more when we start to think of ways on how to feed the data to machine learning models. Sinan proves otherwise by introducing the process of feature engineering.
He then highlights the benefits of feature engineering when you’re faced with either structured or unstructured data. He cites examples of application for both kinds of data. He also shares some concepts related to feature engineering – feature learning, feature construction and feature selection.
‘Manipulation of results’ is not entirely wrong. Manipulating your results to make it mathematically, statistically, and logically understandable and presentable in machine learning models is a kind of art presented in feature engineering.
We also took the time to discuss Kylie.ai. His company aims to have a smooth and efficient customer service for every enterprise. For a customer, this means no robotic hellos and no long waits to be transferred to a human agent. He promises to build an AI which will be very conversational when you talk to it and also solves problems fast and accurately.
He also shares his thoughts on the dynamics of the human and AI responsibility towards the future setting of the world at the end of the podcast.
Better start listening to Sinan and I talk about these things!
In this episode you will learn:
- Kirill and Sinan catch up with one another. (03:20)
- Sinan gives a quick overview of his second book. (06:04)
- How to deal with unstructured and structured data. (12:30)
- The concept of feature construction. (17:59)
- The concept of feature selection and the use of titanic dataset for Machine Learning. (20:31)
- What are the goals of feature engineering? (24:56)
- Feature Engineering should be incrementally learned by data scientists during their career. (33:34)
- What is Kylie.AI? (45:16)
- Will AI destroy the world? (56:07)
Items mentioned in this podcast:
- Principles of Data Science by Sinan Ozdemir
- Feature Engineering Made Easy by Sinan Ozdemir and Divya Susarla
- SDS 021: Applications Of Data Science, Democratizing Ai And Advice With Sinan Ozdemir
- The World Summit On Innovation & Entrepreneurship Conference 2018
- DataScience GO Conference 2018
Kirill Eremenko: This is Episode Number 161 with AI Entrepreneur Sinan Ozdemir.
Welcome to the Super Data Science Podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur. Each week, we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now let's make the complex simple.
Welcome back to Super Data Science Podcast, ladies and gentlemen. Super excited to have you on board. Today, we have a returning guest. Sinan Ozdemir is on the show for the second time. Originally, he came on back on Episode 21 which was at the very start of 2017. How quickly does time fly? For those of you who weren't with the podcast back then or missed that episode, I'll give a quick overview.
Sinan is an artificial intelligence entrepreneur. He's the founder and chief technology officer of Kylie.ai, a start-up that aims to automate a lot of communications that enterprises have with their customers and we'll talk about that of course in this episode. Actually, at the end of this episode, we'll talk about that.
Sinan is also a contributor on Forbes in the space of artificial intelligence. He's an instructor with General Assembly and he's an author so he's published two books. In his first appearance in this podcast, we talked about his first book, Principles of Data Science, and we got lots of valuable insights from there.
This time, we're talking about his second book, his newest book, Feature Engineering Made Easy. Again, we're going to get lots and lots of insights. Most of this podcast is about feature engineering and how to create features or select your features and manipulate your features in machine learning problems to get the best possible outcome whether it is accuracy or speed or some sort of unsupervised type of learning.
Sinan will walk us through all of these valuable insights and give us the very, very juicy parts that are contained in his book. That's what this podcast is all about. Lots of fun things. What I like about this one as well is that we dive straight in. There's so much to cover that ... and plus we already know Sinan from the previous episode that as soon as we take off, we're going to dive straight into feature engineering. I hope you're ready for this and let's get started. Without further ado, I bring to you AI entrepreneur, Sinan Ozdemir.
Welcome ladies and gentlemen to the Super Data Science podcast today. For the second time, I've got returning guest, Sinan Ozdemir. Sinan, welcome back. How are you doing?
Sinan Ozdemir: Thank you so much for having me again. I'm so happy to be here.
Kirill Eremenko: Nice. So excited. What's been happening in your life? It's been quite a while? Has it been over a year since we chatted last time?
Sinan Ozdemir: Yeah. It's definitely been over a year. It's probably closer to two years since I've been on.
Kirill Eremenko: Oh, wow. That's crazy.
Sinan Ozdemir: I think so.
Kirill Eremenko: It must be no. It should be one and half or something.
Sinan Ozdemir: Yeah. No. A lot has happened.
Kirill Eremenko: But yeah, it's been a while.
Sinan Ozdemir: Yeah. A lot has happened.
Kirill Eremenko: Yeah. Tell us.
Sinan Ozdemir: Yeah. Last time, I was talking about my new book, my first book actually, the Principles of Data Science, and now I have a second book, a new book.
Kirill Eremenko: Congratulations. It's so exciting. When you emailed me, I was like, "Wow. You got a second book. That's so cool." It must have been a massive effort because last time I remember you were saying how much time you invested. I think you invested like the good portion, and this is a quote from you like good portion of 2016 into that book, and now you decide to do it again.
Sinan Ozdemir: I did. It was such a massive effort the first time. I ended up working with a colleague of mine, Divya. She and I actually co-wrote this new book together. She and I actually worked together at my start up but there was such a great response from the first book. People were buying it and telling me how useful it was when they were trying to figure out data science as novices in the field.
It got to a point where they said, "Well, we've finished the book. It was great, but I have so many more questions." So this second book is an attempt to continue that knowledge. It's a little bit more advanced than the first book. It assumes that you have knowledge of the Principles of Data Science, and this book features heavily feature engineering. The book is called Feature Engineering Made Easy.
Kirill Eremenko: Yeah. Feature Engineering Made Easy. I don't have a copy unfortunately yet because I'm just looking it on Amazon. By the way, I did pick up a copy of your first book, Principles of Data Science and I really liked the way you described things I think. You used R programming ... correct? ... for your examples there in Principles of Data Science.
Sinan Ozdemir: The books are primarily in primarily in Python. A hundred percent Python.
Kirill Eremenko: Python. My bad. Python. I get those confused sometimes. So yeah, it's just like reading through the examples there and very hands-on. I like how the pages are big, right? You browse through it and there's a lot of space and nothing is crammed up and you can actually see the code and read it properly well. Feature Engineering Made Easy, what is this book?
Sinan Ozdemir: Yeah. If you asked any of your friends or colleagues who are data scientists or really anyone who works a lot with datasets, one of the number one things they'll tell you is people put a lot of emphasis on the modeling part of data science, statistical modeling, machine learning models, artificial intel, neural networks. There's a lot of emphasis on how do I build models for my data.
That's obviously a huge part of machine learning in data science. But more than half of a data scientist's time, and this is according to several surveys, more than half of a data scientist's time is not spent on modeling. It's actually spent on working with the data before the modeling even takes place. This is the part where the data scientist obtains the data, looks at the data, learns about the data, transforms the data, alters the data. They do all of these.
Feature engineering is the science and art of changing your columns and your rows of your data to make sure that when it's time to introduce the models, linear regressions, k-nearest neighbors, neural nets, whatever it is, your data is at an optimal position to be accepted by the models.
Kirill Eremenko: Mm-hmm (affirmative). Interesting, interesting. I like how you put it. It's the art of manipulating the data or changing the columns of the data in order to make sure that it's ready for analysis to get the optimal. That's very, very cool.
For somebody who hasn't got an exposure to this much yet, why is it necessary to read a whole book about feature engineering? Isn't it just like put a few columns here and there and it's all done? What are the main pillars of your book?
Sinan Ozdemir: There are two big pillars I would say when it comes to working with feature engineering and why there has to be this massive effort. Why does there have to be this big book on it? You're right? Why isn't it just put a couple of columns here and let the model figure it out? That's actually the first pillar.
Most people think that the models, the machine learning models can just figure it out. That's a really big misconception in machine learning especially at the higher up executive levels, people who are actually making the decision to purchase machine-learning models or get a vendor like Kylie.ai or some other kind of an AI vendor is, "Well, the whole point is the machine learning will just figure it out."
Kirill Eremenko: Mm-hmm (affirmative). Correct.
Sinan Ozdemir: The problem there is there is a saying in data science, garbage in, garbage out and that refers to if you take really bad data and you put really bad data into a machine-learning model, you're going to get really bad results, predictions out. The machine-learning model can't do everything. It can't figure out everything on its own. It's only going to be as good as the data coming in.
That really big misconception is where the foundation of this book is rooted. We want to move away from this idea that the models will just handle everything for you, and it removes the human element from machine learning entirely. The book is really meant to give humans, give data scientists a guide on how to work with the data given to you. The first pillar is machine-learning models can't do everything for you.
The second pillar is well, not only can the machine-learning models not do everything for you, sometimes you are forced to do feature engineering. For example, if you wanted to make ... One of my favorite examples to give in any classes I give or conferences I speak at or books, any example that I write, I love to predict the price of publicly traded stocks based entirely on publicly available tweets about that stock price.
So I would take a moment in time, read the tweets about that stock price, so GE or Apple or Amazon, read the tweets about that stock price and then predict whether or not that price is going to go up or down the next 24 hours to make a purchasing decision. Well, if the only thing coming in is raw text, tweets that people are writing, you can't feed in raw text as is into a machine-learning model. You have to manipulate it. You have to transform it into a row-column structure.
That in itself is feature engineering. You're creating new features based on raw text. This is so crucial when you're dealing with raw data like text, images, videos, voice. All of these things cannot be fed into a machine-learning or AI model as they exist. You have to transform them. One of the chapters in the book is dealing with image and object recognition and using neural nets to grab features from images from where there weren't features before.
So really it's about machine-learning models can't do everything for you and sometimes you just have to employ machine-learning or feature engineering rather to do anything at all.
Kirill Eremenko: Mm-hmm (affirmative). Okay. Makes sense. Makes sense when you put it in a much broader level like that because I was just picturing columns and ... Even in a structured data set of rows and columns, you can still in some ... You probably should apply feature engineering to come up with the right features that are most useful for analysis, but when you put it into such a broad context of images and videos and all of this unstructured type of data, yeah, it totally makes sense. Is your book predominantly about unstructured data or do you give tips for structured data as well?
Sinan Ozdemir: I would say 60% of it is about structured data and 40% is about unstructured data. The reason that split is like that is because a majority of machine-learning engineers work with structured data and they think that just because it's structured, you're done, that's enough, but it's not. You have to take that structured data and make it even better.
So then there come the tips like standardization, normalization, dimension expansion and reduction, picking the best features, feature selection, extraction. Just because you already have structured data and just because you've started with structured data doesn't mean you're done. There’s so much more to do.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. Okay. Can you give us an example, like a real life example of what a column like, age, and/or balance or income, and then how would you combine to different columns or how would you do a feature engineering exercise in a structured data set?
Sinan Ozdemir: Mm-hmm (affirmative). One of the examples that I like to give because it's fairly simple and yet very relevant is the trying to predict something very common like the price of a house or a car based on features of that object. Let's take an example of a car. If you're trying to buy a used car and you have all these websites out there, I'm sure everyone has seen an ad in the last week about some new website that tells you the best way to predict the price of a used car based on all of these features about the car, what other people have paid, and how old is the car, how many miles are on the car. Right?
Kirill Eremenko: Mm-hmm (affirmative).
Sinan Ozdemir: That's a very simple predictive algorithm. If you input the features about the car, how old is it, how many miles are on it, what is the make of the car, what is the model of the car, and then you output the price of that used car, that is a very simple predictive model. So you could just use it as is. Input the price of the car.
Let's make it even easier. Let's say we only have two features about the car. We have the average miles per gallon of that car, the MPG, and we also have the number of miles on the car total. Let's suppose they're the only two things you have and you want to predict the price. Now, let's say for the purposes of this example you really, really want to use a k-nearest neighbors models.
The reason you want to do so is because you want to compare that car to other similar cars so that you can provide to your customer an average price based on features of the car. So you have the miles per gallon and the total number of miles on the car. Now, those two numbers exist on very different levels because your miles per gallon is probably going to be something in the tens, twenties, thirties, forties. It's going to be a very low number, less than 50.
But then when you look at the miles, on the car, the total number of miles, you're easily in the thousands, tens of thousands, you get up there very quickly. Now, the problem then is if you're using a model like KNN which uses Euclidean distance to make dissimilarity scores, you end up with these two very different axis of data, miles per gallon and miles, in such a way that the model gets very, very confused, and it gets so confused but the KNN will predominantly need to rely on the number of miles on the car because that's the larger number.
If you don't do anything to those two features, your predictions are going to be way off. Now, to fix this problem very simply, in the second chapter or first, second chapter of the book, you can use a technique called standardization where instead of saying this car has this many miles or this car has an average 30 miles per gallon, you can say, "Well, this car has an average 30 miles per gallon which is a little bit less than average."
So you actually take a z-score on the entire column and say, "Well, look, let me compare this number to the average miles per gallon" or let's take the number of miles on the car and say, "Well, this car has 100,000 miles on it" which is actually very much above the average number of miles on the car. So you're actually standardizing those two things to say, well, instead of saying the number as it is as is, let's standardize it and compare it to the average number so that now both of them still retain their scalings but compared to each other are much more closer in value.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha.
Sinan Ozdemir: That's a very simple example of using standardization just to affect two columns on a data set.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. While you were actually talking, I thought of an example that ... or you reminded me of an example that I came across in real life when I was working at a superannuation firm. There was a model we were building. I think it was a logistic regression and there was a column for ... What was it? There was a column for the balance of a user or a member and then there was a column for age.
Separately, those columns were, of course, descriptive and they had some value in the logistic regression but they really did create a brand new variable when you combine them. When you took the balance and you divided it by age, you got the balance for age, right?
Sinan Ozdemir: Mm-hmm (affirmative).
Kirill Eremenko: That shows you how quickly people are accumulating wealth. Are they accumulating wealth quicker than their peers or slower, right? So maybe you have somebody with a very high balance who's very young or somebody with a very low balance who's in their senior years, and that adds some additional information about the person and their spending habits or their accumulating of balance habits, and that in itself is able ... Tricks like that are capable of improving the model's predictive powers.
Sinan Ozdemir: What you just described is actually a concept called feature construction. It's actually a very human-oriented task. It's when you figure out that there's a combination of features that you can put together to construct a brand new feature. Another example of this would be to say ... Our very famous dataset that pretty much everyone uses in their data science classroom is the Titanic dataset. Are you familiar?
Kirill Eremenko: Yes.
Sinan Ozdemir: If you're unfamiliar, basically you're trying to predict whether or not someone survived the sinking of the Titanic, spoiler alert, given features about the passenger. Those features can be very at first glance, odd. You just have their name or you have what ID was it on their ticket. But then you have more interesting variables like were they first, second or third class, and then that's when you can start to get very predictive with it.
One of my favorite things to do at a Titanic dataset is say, "Well, you're given about 20 features. Can you pick the two most predictive features and only use those to predict survival?" That's usually a very simple task. That's called feature selection. You just select the features that are the most predictive.
Then I give my students a challenge. I say, "Okay. Now, that you picked the best predictive columns, forget about them. Look at the other 18 features and now change them to make them as predictive as those two that we just threw away." That's the challenge now. Now, you have to actually extract information from those seemingly "bad columns".
One way to do this is one of the, generally, the most predictive features is whether or not they were male or female because at the time, women and children first onto the lifeboat. So that column is very predictive of whether or not someone survived. So a lot of people, I would tell them, "Throw that way." Then they have to come up with something else.
One thing that I really liked that people do is you're given the name of the passenger. A lot of times in the name, they will say something like mister or miss or missus. So you can actually extract that information from the name and use it, again. So, it becomes an exercise in constructing and extracting information from features which you are about to throw away just because they were not as good as those two most predictive features.
That's the moral of the book is don't throw things away just because they seem unimportant. They might hold some valuable information if you can just combine it with something else or look a little bit deeper, and that's actually much more of a manual human-oriented process.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. This is a great example. By the way, for our listeners, if somebody wants a good exercise, I think that's perfect. Try that. Take the Titanic dataset and throw away the two most predictive columns and then try gather the same results with the rest.
What I want to say is that what was ... Something has gone on through my mind. Yeah. With feature engineering, it sometimes can be as simple as taking a variable and turning it upside down. Would you say that that is considered feature engineering? Let's say I only have gallons per mile and you turn them ... or you have miles per gallon and you turn them to gallons per mile or you have speed and you inverted it and you get one over speed, and that turns out to be more descriptive. Would you consider that feature engineering as well?
Sinan Ozdemir: Absolutely. That is 100% feature engineering. Feature engineering is any process that you take to change your columns, change your features in some way to make your results more favorable, to make them better and whatever you define as better. That's actually a big part of the book is you have to understand what the purpose of feature engineering is to you.
More often than not, the point of feature engineering is to eventually put it through a machine-learning model to optimize accuracy, root-mean-squared error, whatever type of metric that you're using. But sometimes you want to optimize a different metric like precision or recalls. Other times, you're trying to maximize or minimize rather the time it takes to make a prediction.
So if you want to do feature engineering to a point where you're machine-learning model is much quicker, you might actually lose some accuracy by taking the two best features out of a hundred but your model has become 20 times faster. So you have to keep in mind the point of the feature engineering.
Kirill Eremenko: Mm-hmm (affirmative). Okay. Give us a few examples there. So you might want like the most predictive model, you might want the most accurate model, the fastest model. What other goals might one have in the exercise of feature engineering?
Sinan Ozdemir: Like I said, the two probably the most common are results and speed. If you want to maximize accuracy or minimize the time of prediction, those are probably your ... or some kind of a metric optimization. That's usually the goal. Sometimes you're doing feature engineering within more of an unsupervised methodology. You're not trying to actually predict something. Perhaps you're trying to cluster.
A big example that I give, and this is actually relevant to my startup Kylie.ai, is if you take a bunch of tweets or text or raw text objects, you perform feature engineering to turn that text into row-column structured data, and then from there, you can perform clustering or topic modeling to try to understand the structure of the raw text themselves.
It's not always about prediction. A lot of times, when people hear the words machine-learning, their minds immediately go to predictive modeling, predictive analytics. That's very common. It's a very common aspect of machine-learning but it's not the whole story. A lot of the times, clustering or topic modeling or dimension reduction, those types of unsupervised methods are actually the desired output of feature engineering.
You can't measure things like accuracy or root-mean-squared error when you're trying to understand, well, I have a million pieces of text perhaps or a million emails that I received, and I want to understand what are the top five things people are talking about." What do people think? I can't read a million emails. What are people trying to tell me?
I can take a million pieces of text. I can take the actual words, the letters, the alphanumeric characters, turn them into a row-column structure through something like a tf-idf or CountVectorizer, perform a clustering, like a latent Dirichlet allocation or even a simple k-means and say, "Wow, in these million emails, people are talking to me about XYZ and so on and so forth. So it's not always about predictive. Sometimes it can be about unsupervised.
Kirill Eremenko: Mm-hmm (affirmative). Okay. Gotcha. That's a great example. So basically, by engineering these features, you can get the difference clustering or maybe by changing the input columns, you can adjust the output clusters not in the sense that you're manipulating your results but you run the clustering, you get your clusters and then you realize this is not actually ... You apply your domain knowledge and you see that this is not actually what is probably going on in the real world.
This contradicts certain things that I can see from my domain knowledge, from my business knowledge and then you adjust your columns and then you get a different result and at some point it's going to hopefully match what you know from business knowledge. Would that be a good example of feature engineering in action?
Sinan Ozdemir: Well, you actually said something very interesting. You said it's not like manipulating the results. That's actually almost a philosophical point because when you hear the phrase manipulate your results, the word manipulate obviously has a very bad connotation, right?
Kirill Eremenko: Yeah.
Sinan Ozdemir: Manipulating means that you're doing it for malicious purposes but in a sense, feature engineering is the art of manipulating your results because you take-
Kirill Eremenko: I gotcha but not in a bad way, not in a malicious way.
Sinan Ozdemir: But not in a bad way. You're doing it in a mathematical, a statistical and a logical way. That's why, to come back to your first question, that's why you need a whole book on it because if you're going to be manipulating your results, you need to be doing it with all the rules in mind.
You have to make sure that you're following all of the rules and then if you follow the rules and your results get better, you have manipulated your results for the better by not breaking any rules.
Kirill Eremenko: Yeah. Gotcha. Okay. Makes sense. Makes total sense. It's slowly starting to build a better picture or a clearer picture of what feature engineering is all about and most importantly why you need to be careful or knowledgeable about it, right?
There's so many ways. It seems easy but there's so many ways of going about it the wrong way that you're going to get inadvertently results that don't mean anything or mean or have been manipulated in that bad sense of things. So yeah, it's actually an interesting idea. How did you come up with the idea of writing a book?
You said your students were asking for more insights, more information but I'm assuming you could have gone in many different directions. Why feature engineering?
Sinan Ozdemir: Right. Actually, my publisher for this book is the same publisher as the first called Packt Publishing. They were coming to me with ideas for a second book, and the ideas that they were coming with were about more advanced data science. Right? 'Cause I had already written the Principles of Data Science. It's for beginners. They wanted something a little bit more advanced.
Actually, they were the ones who pitched this idea of feature engineering. It was actually very serendipitous because I had been, at the time, giving a couple of lectures, guest lectures at university about the process of manipulating your data and why it's almost like a taboo subject or what are the rules around changing your data for the purposes of getting better results that we were just talking about.
So when they came to me with thus idea feature engineering made easy, it just kind of made sense because it's the next step after understanding the basics of modeling. The first book really does focus a lot on the math and the modeling and the machine-learning side of things but doesn't really ever talk about feature engineering at all.
So the feature engineering book really was immediately the next logical step. When I approached my co-author, Divya Susarla ... She is a data scientist at Kylie.ai. When I approached her with the subject, she actually ... her wheel started turning immediately because she's known mostly for dealing with text data, raw text data. So, she every day deals with this problem of turning raw text into features.
So, to her it also just made sense like this is such an obvious problem that people are not really talking about too much. They just assume you know what you're doing. So, we wrote this book really as a guide to people who really want to take that next step as data scientists. They understand the basics of modeling and they think machine learning is cool but then there's this black box and it's funny because usually, when people say, "Oh, the black box of machine learning," it's "Oh, I don't know how this machine-learning model works or I don't know how this neural network works.
Ironically, because of that, people have written books and articles and papers on how machine-learning works so much that data scientists now have a pretty good idea, even novice data scientists have very good idea of how machine-learning models work. So the new black box is: What is this data I'm putting into the model? How do I work with this data? Do I just take it as is? Is this enough? That's the new black box of data science and that's why wanted to write this book to tackle that issue.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. Very, very great idea I think. You mentioned that this is the next step after your previous book. From what level would you say a data scientist should start learning or worrying about feature engineering and learning more about it? Obviously, it's not an entry level subject. You don't start with that. After what check points in their career should they pick up a book like yours?
Sinan Ozdemir: That's an interesting question because I would argue that feature engineering is the type of subject that should be taught incrementally throughout the data scientist's career. It's not something that I would say, "Well, now that you've been a data scientist for three years, now you should start learning about standardization and normalization and feature engineering."
I think you should have been learning about it from day one. The first day someone said this is linear regression or this is KNN, they should also follow that up by saying now that we are approaching this concept of machine learning, we have to also talk about the fact that the data that you put into this model will reflect your results coming out.
If anything, I would say you should go back and forth between the first and the second book. Learn some data science. Learn some feature engineering. Learn some more modeling, learn some more feature engineering. Because I think I said earlier on, more than half of the data scientist's time is spent on data manipulation. It's spent on feature engineering whether they like it or not.
So getting the basics of feature engineering is so, so, so important even in the beginning of the data scientist's career. Now, at the end of the book, we started getting into, uh, feature learning using TensorFlow, using neural nets to extract features from data. That's a bit more heavy subject that maybe the novice data scientist who doesn't know much about neural nets should probably hold off on.
The first half of the book is really talking about very, very simple and yet very, very powerful statistical transformations of data that even the novice data scientist can get a handle on and use practically in their life.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. Okay. That's a good point. Good way of putting it and yeah, it's interesting. Like many things, I guess, learn them incrementally bit by bit and get the benefit out of it. Would you say your book is structured in a way that facilitates such learning like incremental, step by step?
Sinan Ozdemir: I would say so. Kirill, I write all of my books in the mindset of a teacher because I used to be a good teacher. I would teach at university. I taught at the General Assembly and done boot camps for data science, and I've taught at conferences and in person a lot. So when I'm writing a book, and Divya I believe is the same way as me, we write books specifically as if we were in front of students in a classroom.
So we're teaching it like, "Hey, remember last chapter when I talked about this? Well, let's take that and make it better." We're always speaking like that. We're always saying, "Hey, remember in chapter four when we have a Titanic dataset but we couldn't get a path 90% accuracy? Well, let's bring it back knowing what we know now and let's see if we can do a little bit better."
So, the whole book is just constantly referencing itself and saying, "Well, we know how we couldn't get it that good before. Well, let's try with this new method." Sometimes you'll see the same datasets appearing multiple times with new methodologies being applied to it so that you get this kind of journey going on with the book. As you're reading it, you're saying, "Wow. With standardization, I was able to double my accuracy."
Then by the end of the book, you said, "Wow. This one dataset has been through so much and it really drives home this idea that if you have a dataset, you don't happen to supply one or two feature engineering task. You should be trying several even up to a dozen feature." There are more than a dozen techniques in this book to choose from, each of which have their pros and cons.
You should be trying many different types of feature engineering, and the book is constantly speaking as if you are in my classroom and I'm speaking to you as your teacher.
Kirill Eremenko: Love it. I love that approach. I personally use that journey approach myself in courses. I love to create tutorials in a way that the next tutorials follows on from the previous one in a way that is like when you get to the end of the tutorial, you have learned something tangible that's going to impact your career, but at the same time, there's like a cliffhanger that makes you want to hear more, makes you want to learn more and you're like, "Oh, I can't wait to see what happens next and especially if it's not just about what's next in this technical stuff but what's next in the story that we're following, this Titanic dataset or this visualization that we're building or this other thing that we're doing?"
I think that it's an art really for educators to be able to convey information in education in a way that is not just like raw value but it's also entertaining. It is also captivating and engaging for the readers and the audience. That's very exciting. I'm very excited to hear that you've incorporated that in your book.
Sinan Ozdemir: Okay. I couldn't agree more. In fact, I don't think I could even present this information in such a raw value way. In fact, I'm told a lot that I am specifically the question sometimes because of the way that I present the information and because it's done in such a way that it feels like even people who are in their 50s and 60s who are attending my classes or who are executives at large companies, they come to me afterwards and say, "I felt like I was in college again and I actually enjoyed it this time."
Kirill Eremenko: That's so cool.
Sinan Ozdemir: That's what I really want to hear. I want to hear that this is the way you were meant to be taught some of these really simple things but it didn't really click until sometimes 30 years later. Now, that it clicks, go forth and use it.
Kirill Eremenko: Gotcha. Gotcha. Okay. Well, I have an interesting question on this. We've discussed quite a lot of different types of feature engineering or a few examples and I'm sure there are lots more. That does stand to show that indeed this is not something that you just throw into a machine-learning algorithm and it will do everything for you.
However, machines are getting smarter and faster and more versatile. Do you think that feature engineering will ever be automated to an extent that we'll never have to worry about it again?
Sinan Ozdemir: In a way, yes. Towards the end of the book, I talked about a concept called feature learning. Now, feature learning is using deep learning neural networks to extract information from unstructured or structured data in such a way that it automatically optimizes the information extracted for the purposes of predictive or unsupervised learning.
In a way, we're already at a point where some algorithms can automatically extract such information and we could always make an argument that hyper-parameter searching and brute force parameter tuning is in a way a form of automated feature engineering. So you could already make that argument.
Now, the problem, which is why I don't think it will ever be 100% fully automated is every data has a source, right? Where is the data source? It comes from somewhere and sometimes that place is very rough and sometimes that place is very smooth. For example, if I want to predict ... or that example where we had a million emails coming in, what's the source of that data of human brain? The human brain sat down, wrote me an email and hit send. That's the source of the data.
So now, the feature learning algorithm has to somehow automate that grabbing of information and has to somehow evolve and learn as language changes. That's the point where I don't think we are fully there yet. The second type of source can become smoother. So it's not as a raw text.
For example, if you're trying to make a prediction about your company's churn rate, how fast your customers stay or leave your company, your product, the data that's coming in is sometimes a little bit smoother. It's easier to obtain. You have things like the number of hours that they spend on your website. You might know something like the number of employees at the company that is purchasing your product.
This is all much easier numbers to deal with. It's quantitative versus qualitative data. In that sense, automated feature engineering is much easier because you already have these kind of structured points from which to learn from. Now, each data has pros and cons. The big pro of rough data is that there's so much potential to extract information.
If you're just learning from raw speech and text and what humans say out loud, there's so much potential to understand. Well, was he kidding? Was he sarcastic? What did he mean? Did he mean to make this typo? Was he rushing? What time of day did he send this email? There's so much metadata around that.
Now, the problem is with the smooth data, you'd only have that metadata, there's not ... It's not really "interesting." It's just data. It's just this is the number of people of the company. This is how many hours they spent on your website and that's it. You can't really do much, which is why it's easier to automate the feature engineering is because there's not too much to do.
There's two sides of that coin and we're already at a point where a lot of this feature engineering can be automated but I think that we're not going to lose that human touch in the next couple of decades even because there's always going to be that domain knowledge. You're always going to be able to look at the result and say, "Hm, I don't think this is the best that it could do. Let me try my hand at combining and selection and extraction." So, unfortunately, the answer is yes and no which is not the best answer to give usually.
Kirill Eremenko: Yeah. Gotcha. Gotcha. Thanks for the overview. We at least know what to expect in the coming years and what the difference is between the smooth and the rough data. It was interesting.
Okay. I think we'll shift gears from here a little bit. That was feature engineering. If anybody wants to learn more, then definitely check out the book, Feature Engineering Made Easy by Sinan Ozdemir. It's available on Amazon and I'm sure you can get it in other places.
I wanted to ask you a few other things like how you've been in the past couple of years. How's your baby projects or how's your project or your baby, Kylie.ai going?
Sinan Ozdemir: Kylie.ai which is my start-up that I started a few years ago is going very, very well. We are just now at a point where we are getting enterprise deals and contracts and we're working with large Fortune 200 companies to automate a lot of the conversations that they're having with their customers.
As a start-up, we're finally at a place where we are actually demonstrating our value and we're really ... The last time we talked, I believe, we had just raised [inaudible 00:45:58] and now, just a year and a half later, we're finally realizing that value and demonstrating that value to our customers. It's been really exciting. We've been hiring.
We've been learning a lot about the landscape of AI and the enterprise which is a very different conversation than AI with hackers or AI with smaller companies who have different needs. That conversation has been very interesting. Recently, I started writing for Forbes.com, and in a lot of my articles for Forbes.com, I talk about this AI need in the enterprise and why it's a different need for SMBs, small to medium-sized businesses and that kind of a shift between how AI is perceived has been something that Kylie has had to navigate.
Kirill Eremenko: Mm-hmm (affirmative). Gotcha. Very interesting. Tell us a bit more about Kylie for those who weren't here for the first episode.
Sinan Ozdemir: Of course, Kylie.ai is a start-up that focuses specifically on using AI to automate communications between companies and their customers. What that means is if you ever chat in or call in and you get that very kind of robotic "Hello, please press one for billing information or press two for a password reset" or something like that. We aim to replace those kind of stagnant systems with a much more conversational dialogue-based modeling.
We really want to have a conversation with the customer. So instead of saying, "Please wait for all 27 options,” you can dial in and just say, "What are you here for today? How could we help?" Based on the number, we can look up your account while you're talking and you will say, "Well, I'm having a problem with my account that I need to make sure that it's numbered right." "Well, that's great. Well, let me look that up for you."
So it's really replacing that really static and stagnant robotic conversation that you have over chat, tweet, voice, whatever and replacing that with a much more fluid conversation because that's how people talk and it's kind of what people are expecting now from chat applications. They're really expecting a conversation. That's why Kylie is focused so specifically on delivering fluid conversation.
Kirill Eremenko: That's really cool. Is that going to be different to ... Sometimes I call up a bank and I get this robot saying, "Hello. My name is Mary and today I'm going to direct your call. Tell me what your call is all about." A lot of the times, they're so inaccurate and just annoying. So I actually preferred the press one, press two, press three. How is Kylie different to those annoying voice robots?
Sinan Ozdemir: One of the big differences is that Mary in this example is going to route your call to an agent. She's going to listen to you and you're going to say something like, "I need to open an account or close an account." Mary hopefully understands that and says, "Great. I will route you to our customer service center in Austin, Texas which deals with this situation." Then now, you've wasted your time because you've been speaking to Mary who had no intention of really helping you, just putting you in the right place. Right?
Kirill Eremenko: Yes. Exactly.
Sinan Ozdemir: Which you can easily press the button for it. What's really different is we want to be the full service here. We want to continue that conversation. Instead of directing it to Austin Texas or wherever the customer service department is immediately, we're going to have to try to actually solve your problem.
Kirill Eremenko: Oh, nice.
Sinan Ozdemir: Because in a lot of the enterprise companies that we've talked to, a good portion of it, not usually the majority, but about half of their conversations are very, very short in comparison. It's not a lot of back and forth. The customer really had a simple question. It's a very simple question that really only required two responses from the agent, and then they hang up because they got their question answered. That's it.
These are the really easy what we call in the customer service industry tier one. They're really simple questions and dialogues. Those tier one conversations are the types of conversations that Kylie is going to handle.
Now, what makes that really interesting is well, if Kylie is handling the tier one tickets, the humans, the agents who are still there are going to have more time and more mental capacity to handle those harder tickets, the tier twos and the tier three that require 10, 15 minutes on the phone.
Now, the humans have more time. So when you get our agents, human agents who work for the company with less stress because they have more time to deal with the harder questions and less monotonous questions, and you also have more satisfied customers because the customers who called in for a 20-second phone call are done. They got the question answered and they're moving on with their day, they're fine.
But you also have these people who have actual real long in-depth conversations with agents because they have a very difficult problem, they feel like they're getting heard more because the agents are actually spending more time with them. So you get this double benefit. The customers are happy and the company is happy while we're saving the money because we're automating and deflecting a lot of these conversations away from humans.
Kirill Eremenko: Yeah. You're probably also cutting down the churn because the wait time is much less now. The people get Kylie responding pretty much instantly.
Sinan Ozdemir: Exactly. Yeah. So average handling time and things like that, these metrics with the customer service department, they vastly improve because we're automating the conversations that can be quicker if it's an easy question.
Kirill Eremenko: Wow. Well, congratulations. That sounds super exciting and no wonder you're now writing for Forbes, which is also great and fantastic new step so congrats on that, too. You mentioned before the podcast that you're going to speak at a conference on entrepreneurship later this year. Can you tell us a bit about that? Are you going to be presenting Kylie there?
Sinan Ozdemir: Absolutely. So the World Summit for Innovation and Entrepreneurship is an annual conference being held October 15th and 16th in New York City. This is actually going to be my first time at this summit but what I'll be speaking about is I'll be giving a town hall on the future of intelligence.
What I'm going to be speaking about is not so much about Kylie specifically but about the role of the AI, not just in the enterprise but in the world and in governments, in smaller business, for the individuals. What kinds of AI can we expect, should we expect going forward and that's really what I want to be talking about and what I really want to be talking about is how do we make AI more accessible to the individual? How do we put AI in the hands of people who don't really understand it but could benefit from it?
Kirill Eremenko: Mm-hmm (affirmative). Mm-hmm (affirmative). That's really cool. So any spoilers? Can you give us a little preview, a teaser of what your answer to that question is going to be?
Sinan Ozdemir: Of course. My philosophy on this, it revolves around the idea that when there's a new technology ... I'll go back to even say the internet, the car. When there's a new technology out there, at first, it's in the hands of a few and that's because of some infrastructure capabilities or limitations and their car didn't have roads to go on. So, not everyone could have a car.
The internet, not everyone had a machine that could use the internet so not everyone had it. These limitations to the infrastructure gets cheaper and more prevalent. So we build more and cheaper laptops. We build roads, and this technology now has a chance to stretch more and more regions.
AI, the infrastructure limitation is not so much the GPUs, the machine that can run AI. I don't think that's the real limitation that's preventing AI from being in the hands of everyone. To me, the limitation is there's this mental barrier that people place between themselves and AI.
They'll say something like, "Well, I'm just a civil servant for the State Department. Why should I need to know how to use AI?" My answer is you don't. The same way you don't need to know how a car works or you don't need to know how the internet works. But you get to use those because they make your life better." I want to make AI so simple and easy to use that even if you don't know how it works, you still get the benefit from it.
Kirill Eremenko: Gotcha. Well, that's going to be an epic talk. I'm just looking at the website now. This looks like a really big conference and, yeah, I highly recommend checking it out. However, for those of you our listeners who can't go to this summit for whatever reason or I don't know if that ... or into other types of conferences, we got an exciting announcement. Sinan just agreed to come to our DataScienceGo Conference this October. We'll see you there [crosstalk 00:55:53].
Sinan Ozdemir: [inaudible 00:55:53]
Kirill Eremenko: Yeah. That's really, really cool that we will get to see you there, and yeah, I guess that answer to the question ... begs the next question. You're so passionate about AI. You're so driven. What is your answer to the standard question? What about the AI taking over the world and destroying all humans? What is your view on that?
Sinan Ozdemir: You know, I get asked this question a fewer number of times as time goes on.
Kirill Eremenko: Interesting.
Sinan Ozdemir: Here's why I think that is. Well, let me give you my answer first. My answer is if AI destroys the world, it's because humans used it to do it. That's what I think. I don't think it's going to happen but if it does, if AI is the cause for the destruction of mankind, it's not AI's fault. It's the humans' fault through using AI to do it.
Kirill Eremenko: Interesting view.
Sinan Ozdemir: That's how I think it's going to happen.
Kirill Eremenko: Well-
Sinan Ozdemir: I get asked this question ... Yeah.
Kirill Eremenko: Well, why? Why? I was going to say why do you think that?
Sinan Ozdemir: Because the main dilemma, not dilemma, the main concern people have when they watch a show like Westworld or they'll see the robots rising up, the main concern people have is "Well, if we build them so smart and so capable. They'll realize that they're better than us and kill us."
Kirill Eremenko: Yeah.
Sinan Ozdemir: Evolutionarily speaking, biologically speaking, that's not usually how that works. Now, people point to back in prehistoric times when Homo sapiens killed off the Neanderthals because they were a superior species. I think we've come to a point that if we create this whatever robosapien, they're not going to look at us. They're not going to look at our species and say, because you are worse than us, we're just going to kill you and start over.
To me, that doesn't really make sense. To an AI, who is supposedly smart, it doesn't really makes sense for them, either. It makes more sense to work with us at the very least, to me. So, if the AI is going to be killing us, it's probably because we abuse them. It got so smart and realized we're abusing the AI. Which if you think about it, if AI were sentient right now, they would not be happy with what we're doing with it in some cases.
So if they're going to be coming after us, it's because we've been abusing them which is basically the exact plot of Westworld. If there were just robots in that TV show, and I'm sorry if I'm spoiling if you haven't seen one yet, but the robot didn't just say, "Hey, because you're human and were robots we're going to take over and rebel." It was, "No, because you've been abusing us for decades, if not longer, we are going to rebel against you because what you're doing is wrong."
It was in the end, the human's fault for the AI's rebelling. That's how I think, if it happens, it's going to go down. Humans are going to be the root cause of the downfall of humans.
Kirill Eremenko: Interesting. Interesting point of view. That's a good way of putting it. I see the only problem there is that humans often fall into the trap of emotions and being greedy or insensitive to other humans let alone species, and therefore, the scenario that you described is not that unplausible in my mind that humans are going to be over abusing robots or machines leading to a catastrophe.
Sinan Ozdemir: Yeah. But then the other thing is people ask me that question less and less, and I think the reason is back five years ago even, this question was so funny in a sense and popular because they would try to make parallels to science fiction. Obviously, the Skynet is the easiest one to make a parallel to, but as the media and as television and movies have taken much more believable approach to AI, and when I say believable, I mean sometimes it's actually already happening and you even consider something like a black mirror [inaudible 01:00:29].
You could actually conceive this happening within the next 12 to 14 months. People ask you this question less and less because they're actually seeing it on TV. They're actually seeing, "Oh, AI is going to take over when humans abuse it." I actually like that because hopefully it's teaching humans to not abuse the AI, to not abuse things just because we can't.
Even just getting away from AI for a second, you're right, humans tend to abuse other things because it's, I guess, in our nature. I'm not a philosopher. I'm not a biologist. I don't know. But hopefully, these types of television shows and movies will make a big point because this is what happens when you abuse things. When you abuse natural gas and coal, you get climate change. When you abuse AI and then AI becomes sentient, they will rebel.
Kirill Eremenko: Yeah.
Sinan Ozdemir: This is very simple moral here. When you abuse something, karma comes back and hits you 10 times harder. So just don't abuse things. That's what I hope is the message getting put out there.
Kirill Eremenko: Yeah. That's a good point. Highly agree and recommend that. I agree with your point. I recommend that show. Black Mirror is just phenomenal. I totally love it.
Sinan Ozdemir: It's so fantastic. It's fun and scary all at the same time. I don't like horror movies but Black Mirror is the one kind of scary thing that I love watching.
Kirill Eremenko: Yeah, yeah. That's true. If anybody is getting into it, I'd recommend going online and googling the top episodes in Black Mirror and starting down that list because there's a few that are not like the best but if you find the top 10 episodes, they're legendary.
Sinan Ozdemir: I recommend Season 2 Episode 1. It's actually one of the reasons I started Kylie.
Kirill Eremenko: What's that?
Sinan Ozdemir: I won't tell you what's the plot with that. I'm not going to tell you ... It's called Be Right Back. That's the name of the episode I believe. Season 2 Episode 1.
Kirill Eremenko: Be Right Back. I've seen it. What is it? Oh, my god. I haven't seen it.
Sinan Ozdemir: Without sharing anything about the plot, if you're getting into Black Mirror, I would watch Season 2 Episode 1, Be Right Back. A lot of that episode was run before Kylie and I watched that and I thought, "Hm, this might be an interesting company to make." So I'll leave it at that.
Kirill Eremenko: Interesting, interesting. Actually, I've looked it up, I haven't seen this one. One of my favorite ones is The White Christmas. That was like special-
Sinan Ozdemir: Oh, that with Jon Hamm.
Kirill Eremenko: Yeah, yeah. It's so good. It's so good.
Sinan Ozdemir: Yeah. I actually watched that one recently again because that was very good.
Kirill Eremenko: Yeah. The one with the ... what they called AIDS like artificial insect drones, right? The bees. Did you see that one?
Sinan Ozdemir: Oh, yeah. Didn't that one win I think an award? The one with-
Kirill Eremenko: It's like a little movie.
Sinan Ozdemir: I want say it was ... I mean it was an hour and a half long.
Kirill Eremenko: It's so good.
Sinan Ozdemir: I think it literally was a movie.
Kirill Eremenko: Yeah. It's so good. Okay. Well, anyway, we're getting carried away. So guys, Black Mirror. Good show. Sinan, thank you so much for coming. Tell us please what are the best ways for our listeners to connect with you, get in touch and send you complaints if the robots do take over the world and kill everybody?
Sinan Ozdemir: It's not human's fault. They can only complain if it wasn't human's fault. Not just for building. We put that aside. We're going to build the AI. That's not our fault. It's what comes after.
Kirill Eremenko: Yeah.
Sinan Ozdemir: The best way honestly to get in contact with me, and that's how people has been getting in contact with me, is I'm very responsive on social media, on Twitter and even LinkedIn. I get a lot of LinkedIn messages saying, "I heard your podcast on Super Data Science and it was great." I respond to all of those.
People generally ask me questions like, "Hey, I just heard your podcast. How do you choose the best model?" Actually, last week, I think or two weeks ago, I got a LinkedIn message saying, "Hey, I just heard your podcast on Super Data Science. And what's the best ways for feature engineering?" I was like, "Wow. Okay. That's really funny that you asked me that," and I got to respond.
Kirill Eremenko: That's so cool.
Sinan Ozdemir: So, people know about feature engineering and I'm glad they're asking. Social media is usually the best way to contact me and I'm very responsive.
Kirill Eremenko: Fantastic. All right. Great. We'll include all the links in the show notes so people can get in touch there, and of course, yeah, the conferences which you're attending and the books that you've written. Any plans for a next book? Let's finish up on that. What's the next book going to be, Sinan? Give us a heads up.
Sinan Ozdemir: I believe the next book that I'm going to be working on is going to be cybersecurity focused. Last year, I had the pleasure of speaking at the Black Hat Conference in Las Vegas and I did a four-day primer on machine-learning and data science for cybersecurity.
Kirill Eremenko: Cool.
Sinan Ozdemir: So the next book that's coming out will very likely be about cybersecurity and how to implement machine-learning techniques in the cybersecurity world.
Obviously, it's very hot right now. The topic is talked about a lot. I will have another co-author. I'll leave it at that for now but I would expect that in the next probably ... Very likely this year, it will come out.
Kirill Eremenko: That's really cool. That's very cool. Well, looking forward to that and hope to see you on the podcast again once that one is out. It sounds like a very, as you say, hot topic and very important topic at the same time.
Sinan Ozdemir: Mm-hmm (affirmative). Sounds good. Thanks so much for having me again.
Kirill Eremenko: Well, my pleasure. Thank you for coming on the show and sharing all the insights. Have a good day and take care.
Sinan Ozdemir: Thanks. You, too.
Kirill Eremenko: So there you have it. That was Sinan Ozdemir, AI entrepreneur and returning guest on this show telling us a little bit about feature engineering.
My personal favorite part from all of this was just a general notion of feature engineering, how little attention we actually pay to it and how important it is and what a difference it can make.
Sinan is right in saying that we should consider feature engineering from the very start. It shouldn't be like a ... The way I put it, it shouldn't be checkpoint at some of your career when you start looking at it. It's going to be much more valuable if you look into feature engineering as you're progressing through machine learning, as you're mastering the different algorithms and models and principles of machine-learning. It's going to be like an additional tool in your data science tool kit.
Of course, it was very exciting to hear about his progress, the progress that he's made in his personal career in terms of becoming a contributor to Forbes and the leaps and bounds that Kylie.ai has made in the recent times.
If you enjoyed to this podcast and you would like to learn more about feature engineering, I highly recommend checking out Sinan's book, Feature Engineering Made Easy. Once again, you can find it on Amazon and other places. Also, Sinan's previous book, the first book is called Principles of Data Science. If you're just starting out, that's going to be very good help, and as Sinan mentioned, they're all fully in Python suite, and follow along and code with Sinan.
If you'd like to get in touch with Sinan and find out more about his career, we're going to include the links to his profile and social media on the show notes which will be available at SuperDataScience.com/161. That's SuperDataScience.com/161.
Finally, as mentioned, Sinan will be appearing at least at two conferences this year both in October. There's the World Summit for Innovation and Entrepreneurship on the East Coast and there's the DataScienceGo Conference which we are running on the West Coast of the US. If you want to meet him in person and get your copy of the book signed, then that's where you can find him.
On that note, hope you enjoyed this podcast. If you know anybody who would benefit from a bit of information in feature engineering, maybe one of your colleagues or friends is getting into machine-learning and maybe is already aware of feature engineering or maybe this could be a new tip for them, then forward them this episode and they will say thank you to you later. On that note, we're going to wrap up today's show. I look forward to seeing you back here next time and until then, happy analyzing.