SDS 279: Embedding Data Science in Business

Podcast Guest: Kevin Perko

July 17, 2019

This is a great episode for anyone interested in moving into a data science manager or data science leader role.

About Kevin Perko
Kevin Perko is the Head of Data Science at Scribd, the leading subscription reading service. He focuses on evaluating search engine performance, building data pipelines, and democratizing access to data through various initiatives including Reddit-style AMAs, emails, and individual outreach.
With over a decade of analytics experience, Kevin has worked for a multitude of Bay Area startups including Eventbrite, GREE, and Education.com. He has a background in Finance from Santa Clara University and has volunteered with The University of Cape Town to teach computer skills in the townships of South Africa.
Overview
One thing Kevin is incredibly interested in is the interdisciplinary nature of data science where different applied fields comes together: psychology, geology, philosophy, physics, and virtually any field where there is a study of information. The main point there being: what problem do you want to solve? When you think about data science this way, there’s few limits to where data science can have massive impact. A great example is Kevin’s own history in the industry. He started in finance where he found data and the numbers fascinating. From there he began work at tech companies at the onset of true data science when social media and search engine optimization took off. The industry showed up around him, and it was exactly what Kevin wanted to do.
The biggest difference between being a data science and being a data science manager, Kevin says, is going from problem solving to figuring out how you can help other solve those problems. It’s less technical, a typical day for Kevin is meetings from 10 to 3, one-on-ones with the team, and very little time for actual technical work. The work at Scribd, which is a reading service, is a combination of user data and content metadata. A typical project for the team is something along the lines of analyzing historical data for search rankings and retention through success metrics.
One interesting project Kevin is working on is the Book Genome Project as a way to take ambiguous, vague elements of books and find ways to relate such terms as “dystopia” to other books. They’re not looking for your typical recommender system where you’re trying to offer two like things, but rather offering a sequence of multiple elements you might enjoy (such as “dystopia” plus “set in London” or “science fiction” plus “noir characters”). If someone continues to pick up science fiction books, it doesn’t necessarily mean science fiction is what they’re after. And readers might not even know this about themselves. They’re starting this through publisher data and human curated keywords for preliminary training. In terms of neural networks vs machine learning, Kevin doesn’t see them as different. They complement each other and balance out what the other doesn’t have. In the past the mentality has been: if we can make a good decision then we’ll go do it. As machine learning and neural networks start to work in tandem it becomes: if we can explain this decision in a good way, then we’ll go do it. The decision might be less objectively “good” but more societally ethical. You need to ask the question: is this explainable AI?
Across the board, Kevin’s work has helped encourage companies to realize that data science itself is a product, rather than a tertiary part of the company. Having profuse collaboration across departments in a business, working directly with the data science group, allows a business not only to have success but to scale. Data scientists shine when they’re working consistently with other departments to produce the best product.
In this episode you will learn:
  • Kevin’s view on the interdisciplinary elements of data science [6:00]
  • How did Kevin get into data science? [10:49]
  • Being a head of data science [14:07]
  • What is Scribd? [16:55]
  • The Book Genome Project [26:30]
  • Neural networks vs machine learning [34:30]
  • Kevin’s various types of roles and advice [45:53]
Items mentioned in this podcast:
Follow Kevin
Episode Transcript

Podcast Transcript

Kirill Eremenko: This is episode 279 with Head of Data Science at Scribd, Kevin Perko.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. Each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now, let’s make the complex, simple.
Kirill Eremenko: This episode is brought to you by our very own data science conference, DataScienceGO 2019. There are plenty of data science conferences out there. DataScienceGO is not your ordinary data science event. This is a conference dedicated to career advancement. We have three days of immersive talks, panels and training sessions designed to teach, inspire, and guide you. There are three separate career tracks involved, so whether you’re a beginner, a practitioner or a manager you can find a career track for you and select the right talks to advance your career.
Kirill Eremenko: We’re expecting 40 speakers, that’s four, zero, 40 speakers to join us for DataScienceGO 2019. And just to give you a taste of what to expect, here are some of the speakers that we had in the previous years: Creator of Makeover Monday Andy Kriebel, AI Thought Leader Ben Taylor, Data Science Influencer Randy Lao, Data Science Mentor Kristen Kehrer, Founder of Visual Cinnamon Nadieh Bremer, Technology Futurist Pablos Holman, and many, many more.
Kirill Eremenko: This year we will have over 800 attendees from beginners to data scientists to managers and leaders. So there will be plenty of networking opportunities with our attendees and speakers, and you don’t want to miss out on that. That’s the best way to grow your data science network and grow your career. And as a bonus there will be a track for executives. So if you’re an executive listening to this, check this out. Last year at DataScienceGO X, which is our special track for executives, we had key business decision makers from Ellie Mae, Levi Strauss, Dell, Red Bull, and more.
Kirill Eremenko: So whether you’re a beginner, practitioner, manager or executive, DataScienceGO is for you. DataScienceGO is happening on the 27th, 28th, 29th of September 2019 in San Diego. Don’t miss out. You can get your tickets at www.datasciencego.com. I would personally love to see you there, network with you and help inspire your career or progress your business into the space of data science. Once again, the website is www.datasciencego.com, and I’ll see you there.
Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies and gentlemen. Today I’ve got a super exciting guest, another speaker who will be joining us for DataScienceGO 2019, at the end of September, this year. If you haven’t gotten your tickets yet, check out www.datasciencego.com. Today we have Kevin Perko. Kevin is the head of data science at Scribd, and he is leading a team of approximately 13 data scientists, between San Francisco, and Toronto. We had a fantastic chat today, so here are a couple things that you will take away from this conversation.
Kirill Eremenko: You will learn what it’s like to be a data science manager, or a data science leader, and what it’s like to manage a team, and more so two teams, in two different locations, and how that is different to actually doing the technical work. If you’re thinking of progressing as a data scientist to a data science manager, or to a head of data science, this will be very valuable for you. Also, you’ll learn about the Book Genome Project, that they’re doing at Scribd, which is a very exciting undertaking. You’ll learn what it’s like when a company sees data science as a product, as opposed to an auxiliary function.
Kirill Eremenko: If you’re a business owner or an executive, you’ll learn a very valuable concept of decentralized, or embedded teams, versus core data science teams. What’s the difference when your data scientists or machine learning experts are embedded throughout your organization, versus when they’re in one core centralized team of data scientists, what are the advantages and disadvantages of each approach, and what stage of the business should you be doing each one in, and what should you be aiming for.
Kirill Eremenko: Finally, if you are in Toronto, or San Francisco, and you are looking for a job or considering a new role in data science, then stay tuned for this podcast, because Kevin will announce that they’re hiring, and you might just like this company, and might just want to check them out. On that note, very exciting podcast coming up. Can’t wait for you to check it out. Let’s get straight into it. Without further ado, I bring to you, Kevin Perko, Head of Data Science at Scribd.
Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies and gentlemen. Super excited to have you on the show here today, with my lovely guest, Kevin Perko, calling in from San Francisco. Kevin, how are you doing?
Kevin Perko: Doing great. I’m doing great.
Kirill Eremenko: It was fun chatting just now about like, a book, and you haven’t written one yet. If you were to write a book, what would it be about?
Kevin Perko: Oh, that’s a great question. If I was going to write a book, I think I would focus on kind of how interdisciplinary data science is, and how that is really kind of what makes it come alive. You’ve got elements from psychology, you’ve got these general things around just being curious, and you’ve got to really program, and build models, and sort of represent in the world, and I think all of those things kind of come together in this nice sort of like, systems thinking, complex systems type fields of study, that people don’t usually study who do data science. I also think it’s why people do study something like physics, which is literally the building blocks to the universe, tend to do really well in data science.
Kevin Perko: I think my book would be try to capture more of these elements and kind of interweaving them, and showing how these things are building on each other, and why neural networks are kind of something that’s really interesting, comes out of something from ’70s really, even before that. It’s not like a new thing, but just to give people a sense of understanding on how everything is interrelated, and it’s all towards understanding how we model these things, and that while people like to talk about AI, there isn’t really anything that approaches general intelligence yet. Still really mapping these functions to output values.
Kevin Perko: I think understanding the systems in which these operate are really, really interesting.
Kirill Eremenko: Very, very true. Do you feel that data science kind of came together as a chain of development … not even chain, like a group of developments in different fields. You know, there’s elements of data science that come from economics, there’s elements that come from physics, as you mentioned, there’s elements that come from neural networks and IT, there’s elements that come from mathematics, even biology. Some of the statistical apparatus, especially in R, originally came from AB testing, and random sampling in biology, or in medicine.
Kirill Eremenko: Do you have this feeling that data science kind of right now it’s a separate science, there’s arguments to support that, but originally it independently grew in all these different fields? 
Kevin Perko: Right, right, absolutely. I think a good correlator for this is that if I was going to recommend what somebody should study, I would rather see them study computational biology, mathematics, physics, as opposed to data science itself, because then you’re kind of removing yourself from the actual subject you’re studying, and data science is always applied. We’re never just, at least in industry, thinking about how to make [inaudible 00:08:06] to sound more efficient, think about how to apply it to solve a problem. You come up from a computational X area, that’s really what you’re going to be doing.
Kevin Perko: I see sometimes people come out of this sort of generic data science programs, like, I want to go do NLP, and it’s like, what problems do you want to solve with that, like why do you care about having this tool, so you can leverage it for solving a problem, whether it’s in health care, or physics, or business. I think that’s where it gets really exciting, is when you mix those applied fields together. Somebody, I’m kind of remembering here, somebody was studying glaciology and they were actually applying data science methods, and they were able to map how glaciers are moving, where people previously hadn’t been able to. It’s like, that’s where data science really shines.
Kevin Perko: That’s where it gets really exciting. Yeah, I think that that’s kind of like my … My thing is that I almost think it shouldn’t … It can’t be a separate thing. It has to be in all of these things, because it can help all of these various fields move forward faster, as opposed to just itself.
Kirill Eremenko: Wow, very interesting perspective. Applied data science, great way to get started into the field. I guess if you combine it with something that you’re passionate about, something that like somebody who’s doing glaciology has to be excited about glaciers, and there has to be some story behind it, why they’re doing it, I guess if you do it that way, you get the extra boost of seeing how applying data science to this field that you’re very interested in, can make massive progress and massive impact in that field.
Kevin Perko: Absolutely, absolutely. I think that’s really where people drive breakthroughs, is when they bring a couple different fields together. Data science is a great one that you can bring it to almost any field. It can help you rather infer, compute, figure out what is the true structure of all of these different areas, and that’s really powerful. There’s not a lot of fields that do that, but if you’re just focusing on where do I run around and how to apply data science and algorithms to, you get a lot of interesting things. You see a lot of the voice to face, or the deepfakes, and all this stuff.
Kevin Perko: There’s people that, well, there’s social media, and I can get a lot of press if I do this thing that’s going to freak people out. Then that’s what happens, and we end up building something that kind of scares people about AI, and also has a debatable social value, rather than like really pursuing trying to building up breakthroughs in hard sciences, which is really exciting and really valuable to the world. That’s where I see the trade off.
Kirill Eremenko: What’s your story? How did you get into the space of data science? What did you study?
Kevin Perko: I actually studied finance. For me, data science really happened to me. I was always interested in numbers, and thinking about numbers, and I had picked up programming when I was younger, sort of on and off. Then in school, kind of switched over to like, I really just want to do banking, because I love the stock market because it had so many numbers associated with it, and all this future value of money, and all these kinds of things that are really interesting. I didn’t have … For me, it didn’t click, like oh, I should do computer science yet. Then I got out and I was like, I definitely should have done computer science. 
Kevin Perko: Just ended up, I was like I just got to work at a tech company, which I did. Did a variety of roles there. I got into building and application. It was powered by data though, so I got to interact with the data. I was building what we call ETL pipelines now, but nobody really had a name for it then. [crosstalk 00:11:38]-
Kirill Eremenko: [crosstalk 00:11:38], right?
Kevin Perko: Hmm?
Kirill Eremenko: Extract, transform, load.
Kevin Perko: Exactly, exactly. Nobody really knew what the next thing, nobody was like, let’s do analysis on top of it. We did a little, like a very light statistical analysis. I did a little work with SEO, because we had more of a long tail application. From there I basically knew that I wanted to do more of this, but I still didn’t really have a name for it. People are thinking it was FP&A, which is definitely not what it was, because it was much more computer science oriented. Roughly around this time, Facebook started to come out. The term data scientist got popularized, but it was only for PhDs at this point, for the most part.
Kevin Perko: They were solving these really, really massive problems at scale, that didn’t previously exist. They also had a ton of users, and so all these unique problems that most start ups didn’t have, they couldn’t really do this. I kind of went from there to the next company. I again did something similar, but I was closer to the analytics this time. That kind of gave me the freedom to do all this analyses, finally get into building some models, doing some fraud modeling, some graph analysis, and that’s really where I was like, “Ah, this is incredible.”
Kevin Perko: Booting up Gephi first time, and loading a graph in there, and really seeing the representation of these relationships, and how you could walk down the node, and see how people are related and how fraud circles form, fascinating stuff. This kind of hooked me and then I was like, I again need to do more. I’m going in the right area, even though I’m not really sure what it is now. It’s finally called data science, really diving in to learning Python, everything else I need to.
Kevin Perko: From there then I worked for a gaming company. I was, all right, it’s like in a lab. It’s like a science lab for running experiments. Really interesting. Don’t necessarily feel that great, but you learn a ton about how people respond very quickly to incentives, and game play function, and game play economies, and all of these really interesting areas. That’s kind of in my path and then from there I’ve continued on at Scribd. For me, it was kind of this route that I was sort of on, and I didn’t know it. Then the industry just showed up, and I was like, “This is exactly what I want to do.”
Kirill Eremenko: That’s awesome. Right place at the right time.
Kevin Perko: Absolutely.
Kirill Eremenko: Yeah, very interesting story. You’ve been now at Scribd for what, like over five years?
Kevin Perko: That’s right, five and a half years.
Kirill Eremenko: That’s really cool. You start off as a data scientist, data science manager, and now you’re head of data science. Tell us what that feels like.
Kevin Perko: It’s great, it’s great. I mean, it’s both exciting feel to grow in a company and watch the company grow while you’re there. It’s been a total mindset shift when you’re going in and doing the ground level work versus having a team of people. We’re in San Francisco and Toronto, in terms of the data science team, and that’s just … kind of have to … Most of my career has been sort of figuring out how to do things while I’m doing them, and so managing a team is no different. You really have to sort of change your job every six months to a year. Nobody tells you that you’re supposed to do that, but you definitely are. Otherwise, you’re going to get stuck. [crosstalk 00:14:44]-
Kirill Eremenko: What do you mean by change a job?
Kevin Perko: What I mean is like, as a data scientist, you’re really thinking about the models, and the business problems you’re solving, and as a manger now you have to think about how you help people solve those problems, and what the communication around that looks like, and how you’re setting expectations, and what you’re delivering. Then once you’re kind of managing the whole team, you have to think like, what are we not even thinking about, what’s the culture, how do I kind of delegate, so I have more people on the team who are aligned with me and thinking the same way, and I can be a multiplier effect, because I can’t be everywhere anymore.
Kevin Perko: Most of my day is kind of like sitting in meetings from 10:30 to 3:30, very typical day, and whether I’m doing interviewing, or meeting with other PMs, or meeting with other executives, all of those things kind of add up, plus one on ones for the team, and so the day just kind of fly by, so I can’t really be there providing any sort of technical leadership. I have to build that out on the teams so the team has some senior people who can do that. These are sort of things are like, okay, well now I had to change my job. Previously I was much more involved in this. Now I’m not involved at all.
Kevin Perko: Now I’m working with the team in Toronto, really making sure that they get up and running, and we’re working on newer things, like we’re working on building a machine learning platform internally. Now we’re going to use some tools for this. We’re not going to write the whole things ourselves. That’s like a whole new area. Okay, okay, now we really have to think about this, and we really want to focus on getting everybody more into the full stack data science side. We’ve always sort of had the full stack data science term that we’ve used internally, of like how we think about we kind of go end to end, but this is like we want to go, take that to the next level where we’re working with Scala, and we’re really being able to productionalize anything at any point. Really kind of pushing the team in that direction, to enable new opportunities for us.
Kirill Eremenko: Very cool. How big is the team right now?
Kevin Perko: The team including myself is 13 people right now.
Kirill Eremenko: Oh, okay, gotcha. 13 across Toronto was it, and San Francisco?
Kevin Perko: That’s right, Toronto and San Francisco.
Kirill Eremenko: Very cool. I think it would be a good segue or a time to mention a few things about Scribd, I guess. Tell us a bit, what is Scribd, and what kind of product services does that company offer?
Kevin Perko: Right. Scribd is a reading subscription service. It’s $8.99 a month, and you get access to books, audio books, sheet music, articles, as well as user uploaded content, which could be really anything, letters of recommendations, people’s physics theses that they’ve published, and just a wide collection. Game strategy guides of content that people have decided to upload on the internet, and so Scribd enables you to get access to that.
Kirill Eremenko: Oh, nice. What kind of data would you be working with, or does your team work with on a daily basis?
Kevin Perko: We really work with I think of it like a couple different types of data. One is sort of like the application level data, whereas who’s paying us, where are they from, all of that kind of demographic type information, what devices are they on, etc … Then you have this sort of user interaction event stream data of what did they do, what did we show them, and how did they interact with that, and how does that mix with what we know about whether or not they’re logged in, or logged out, or a paying subscriber, what they’ve done in the past, what other people have done. That’s kind of like one part of it, and then the other part is understanding the content.
Kevin Perko: All of the books, and audio books, and user generated content that we have that people have uploaded, really understanding what that is, what language is that in, what categories are they. For books, publishers typically provide us with categories, whereas for documents, users do not provide us with any information, so it’s up to us to decide, okay, what is this document actually about, and how should we use that information when we’re building a search index, to search search results, or showing recommendations.
Kirill Eremenko: Okay, very, very diverse. Two diverse areas, user interaction, and understanding how they use the platform, and also understanding the content. What would a typical project look like for your data science team?
Kevin Perko: That’s a great question. I’m actually going to just use a project that we’re doing right now. Somebody identified our success metric, which is our target for our GBM. For search, for re-ranking items, once all the candidate sets are generated, so think rows of books, audio books, documents, then it kind of goes into this GMB, and it decides how should I actually rank these items, within each module. Today a lot of that is, after all the routing and candidate generation happens, it’s all based on historical data for the most part.
Kevin Perko: Our best approximation is to try to understand how those interactions correlate with retention. For our business, that’s what we want to optimize for right now. One of the data scientists said, “You know, this previous success metric, we did really great work on it, and I think we can make it better.” They kind of mapped out the project, what should that be. They did like the whole analyses, they presented to the team, they got a bunch of feedback, they continued to improve the success metric, and they’re continuing now to get it into production. Once they do that, then the next step will be to retrain the GBM, so that we can actually see, is this better, because obviously it’s easy to say, “Offline looks better.” Like we’ve reduced our main [inaudible 00:20:24], but that doesn’t really mean anything if we didn’t make the user experience better.
Kevin Perko: That’s kind of why I say, “There’s not really a typical project, but this would be a good representation of like, okay, there is some clearer variables that you want to optimize for.” Maybe somebody is giving you the project, or you’re creating it, and then you need to kind of go down, break them down, figure out how to represent them. A pretty collaborative environment, so you’re going to go present. You’re definitely going to take some feedback. I think that always sort of hardens the project, gets you to question your assumptions, and then you’ve got to go an write the code to get it shipped out, so we can actually use it in the product.
Kirill Eremenko: Okay, got you. GBM is Gradient Boosting Machine. Is that right?
Kevin Perko: Right.
Kirill Eremenko: Why do you use a GBM in this specific example? Is there any reasons for that?
Kevin Perko: You know I would say honestly there’s not a great reason. We sort of inherited this model. There’s a previous search team, and the model place, we traded it, it worked the best way doing this. Kind of our bigger contributions have been to improve the success metric that it gets trained against. 
Kirill Eremenko: Okay, gotcha. If a team of 13 data scientists, including yourself, do you find that you have multiple projects going on at the same time? How many projects is the team involved in approximately?
Kevin Perko: There’s so many people, it’s hard for me to even pull that number out of the air. There is a lot going on at any one time.
Kirill Eremenko: How do you keep track of everything?
Kevin Perko: Yeah, that’s a great question. We have a couple support structures for that. We have squads. People that are working on product facing squads, they have somebody that they’re working with, like a product manager, and a technical project manager, who are working on what’s the task flow, what are we shipping, what are the deadlines on all that kind of stuff. We get some similar apparatus on the search and recommendations teams, so that I don’t have to be responsible for all of it, because it’s too much for one person to make sure everything is on track, and all the deadlines are being met. That really helps a lot.
Kevin Perko: The other thing is to just … beyond individual projects, is having higher level goals that you’re … or higher level targets for the quarter that you want to move towards. Those are easier to check in on rather than a specific, okay, did we analyze this test, did we learn from this test.
Kirill Eremenko: Okay, gotcha. You probably have like managers in the team as well who take on some of the responsibility that then report to you?
Kevin Perko: Right, right. We’ve got a manger out in Toronto.
Kirill Eremenko: Gotcha. Okay, okay. Very interesting. You mentioned you guys are hiring at this stage, so if anybody listening is interested, what’s the best way for them to apply?
Kevin Perko: Yes, we’re hiring in San Francisco and Toronto, and the best way to apply is to go to the jobs page, I would say. Just in the cover letter mention that you listened to the podcast, and I’ll see that. I actually review all of the applications that come in. I’m very passionate about hiring the best people I possibly can. I’m reviewing all the applications that come in. I can kind of take more risks, and really see if somebody is showing something that someone else who was looking for a very specific profile, may not be able to pick up on.
Kirill Eremenko: Nice. By listening to this podcast obviously people are already ahead of the game.
Kevin Perko: Exactly.
Kirill Eremenko: Okay, cool. Well, thank you for that, and guys, girls, everybody, ladies and gentlemen listening, if you’re interested to go in Toronto or San Francisco, make sure to check out Scribd. Let’s shift gears a little bit. You’re coming, which is very exciting, I’m excited to announce this to our listeners, you’re coming to DataScienceGO this September, 2019, in 27th, 28th, 29th, September, and you’re doing Keynote. Super pumped about that. Congrats. I can’t wait to hear your Keynote and to meet you in person over there.
Kevin Perko: Thank you, thank you. I’m definitely looking forward to that. This will be my first Keynote, so it’s a very exciting experience for me as well.
Kirill Eremenko: That’s awesome. Tell us what is this Keynote going to be about? Can you give us like a quick, I don’t know, maybe preview or some spoilers about what you’re going to be talking?
Kevin Perko: Well, I can’t give any spoilers of course. In terms of a preview, what I’m going to focus on is kind of two things. I want to get people generally excited about what’s happening in data science, as well as how that’s intersecting with what we’re doing in Scribd. I think one of the best ways I can do that is to talk about an initiative we have internally, around learning how to represent our content better, which we’re calling the Book Genome. That’s really obviously taking from like the Music Genome from Pandora, from way back, and applying that to books. It’s scaled what some companies have done. I don’t know if anybody’s used the term, book genome, but we really want to think about how we represent our content.
Kevin Perko: I want to talk about how we’re doing, how that’s going to enable really amazing things for our users, and for data science in general, as well as how that intersects with like a curiosity culture. Eric Colson at Stitch Fix, totally has written lots of very good articles on this, and I really am trying to bring that into my team, into my organization, and intersect these things, because there’s so much opportunity in data science, that there’s no way that top down you can see all the opportunities and correctly allocate all the resources.
Kevin Perko: You want people on the ground, being curious, asking questions, saying, “Hey, I actually have a couple extra hours, and I’m going to see if this variable is correlated with this variable, or if I can map this out with a regression, or a neural network, or whatever it happens to be, and if we can learn something new, and I really believe that that’ll add much more value to the business than us trying to pick the best projects every single time.”
Kirill Eremenko: All right. I haven’t heard of this music genome project. Can you tell us, what is the end goal of the book genome? What does it look like?
Kevin Perko: The end goal is for us to really understand books on a deep level. When you talk about a book, you talk about books that you enjoy. You say things like, “It moved really slow,” or maybe it was really dense, like very … when I say [inaudible 00:26:51] Slavic words, lots of technical jargon going on. You don’t necessarily say, like, “Okay, well, it was like a front list book.” For anybody who’s not familiar with that, that’s a book that’s come out in the last year. Publishers, they care a lot about that. That’s where a lot of their money comes from. We think about a lot of this internally in all of these things, but readers don’t think about that necessarily.
Kevin Perko: They’re thinking about, “I’m reading a book that people are talking about. I’m reading a book that is relevant in the media, or that my friends recommended, or that’s a murder mystery and I love murder mysteries. It has these elements that I like.” We want to take those, when people are saying these kind of ambiguous and vague words, these elements that I like, well, what are those elements, is it dystopia. 1984 is definitely a dystopia, so if you read that, what are you interested in learning. If we can represent dystopia as an embedding, how can we relate that to other books, and then understand that you’re not just going to read dystopias. You’ll have a very depressed outlook on the world if you do that.
Kevin Perko: That’s just like a … not a thing, because lots of recommender systems, they want to find similar to items, but we need to introduce this serendipity. It’s really going to become like a sequence type model, because people, even if you read a data science book and you’re getting into data science, you don’t only read data science books, because that again will kind of drain your brain power there. You have to sort of recharge with something else, whether it’s a biography, or a science fiction book. When you read those, not only do they kind of go together in a sequence, but you have specific elements you like about your science fiction books.
Kevin Perko: To you it’s less about science fiction, and maybe it’s more about dystopia plus science fiction, plus a futuristic setting. We want to be able to represent that in words that we can both share with our readers, on why they were recommended this book, and what we know about this book, and to help them find other books. Whereas today you may browse by genre, perhaps in the future you could browse by something more stylistic like books set in London, or fast pace books, or easy reads for the weekend.
Kirill Eremenko: Gotcha. For instance, like your example with science fiction, somebody might be interested in like they’re picking up science fiction book after science fiction book, but really deep down inside what they like might be a certain type of character, like the lead character has a certain background, or they are passionate about certain things, or the manner that they … how they are heroic, or things like that. Really, the reader might not even know this about themselves. They just happen to be picking up these books, and liking them based on other people’s recommendations. You can’t really express that in words. 
Kirill Eremenko: I guess what I was going to ask is, are you going to look for this information from people? Are you going to get people to complete a quick survey after they finish a book, what did they like about it? Or are you going to have natural linguistics language processing, some AI, or machine learning, that’s going to go through the book, and actually look for these gems, or these parameters inside, autonomously?
Kevin Perko: Right, right. The current approach that we’re thinking is given that we get a lot of good publisher data, we’ll start to build it. This includes some kind of human curated keywords, like dystopia, that’s associated with 1984. We can start to train on those words, and kind of build, and understand how that represents across, we’ll call them words, because we want to kind of get it more into a tree. It’s much more of a graph system. We don’t want to think of it as a flat system. Dystopia has a relationship to the environment, and cooking, so it’s not very related to cooking, but if you just have a flat group bank of words, it doesn’t really mean anything, but when you start putting them in a graph, and it’s a little bit more directed, then oh, you can see cooking is way over here, and you’ve got your werewolf romance way over here, and those things aren’t really related.
Kevin Perko: Actually your dystopia which could kind of go either way, is maybe much closer to this hypothetical werewolf romance, for whatever reason. Being able to understand those things is much more valuable, because that’s how people think about the books. They’re not putting these hard boundaries on them, like we tend to do when we mull them out. We’re like, oh, this is cooking, or that’s not that, and so they would never want that. It’s like, okay, well, the world is a little bit more complicated and subtle than that. By bringing this out, we’ll really be able to get at the heart of what people want.
Kevin Perko: I think you kind of brought it up, it’s going to be a two step process. We’re going to be boot strapping it. We haven’t planned on doing a survey, but that’s a great idea. Honestly, I might steal that.
Kirill Eremenko: Sure.
Kevin Perko: Because like you’re saying, we don’t necessarily have the language to represent the things that we want to today, so we’re going to have to go figure out what that’s going to look like. It makes a lot of sense it’ll be a collaboration with the data we get, the data we’re able to acquire, how we’re able to learn things internally as well as what our users tell us.
Kirill Eremenko: Gotcha. Is it going to be similar to the Netflix recommender system?
Kevin Perko: I would say, “No.” At a high level all recommender systems have this … they share similarities. Given that we’re in the process of building, and I wouldn’t really be able to say, I think that the bigger goal of extracting the metadata, and learning how to represent it, that’s very similar to what Netflix did. I think they actually had like rooms of people watching movies at one point, like labeling them. We’re not there yet to have rooms of people reading books. It also takes a lot longer time, so I’m not sure if that’s feasible. We’re going to continue to try to increase our sophistication, so yes, I’m sure we’ll be using similar methods that Netflix has pioneered.
Kirill Eremenko: Okay, very interesting. Yeah. It looks like you’re going to have a lot of algorithms that you’re going to be trying out. What’s your view on that? How’s your approach going to be? Which model, which algorithm is going to be the best? Are you just going to try out a lot of things, or do you already have some things in mind?
Kevin Perko: Yeah, that’s a great question. I feel like I sort of have two views. One is that I’m agnostic. If you use CFIF, and that represents the problem, and solves it, then you should always use the simplest tool for the job. My second view is that a lot of the things we’re seeing with these kind of next generation language models, that’s coming out with like BERT, and [Inaudible 00:33:34], and I haven’t even had enough time to dig into them, as much as I’d like, but I can see that their ability to represent language is incredible, as well as opening eyes.
Kevin Perko: A model they only released a small version of it, that was … I believe it was writing articles, it did too good of a job of producing fake news basically, so they didn’t want to release the full model, but then they understood within a certain amount of time, people would be able to recreate it. They’re just sort of buying some time hopefully, before they unleash this thing on the world. Which is nice to see somebody having a thoughtfulness, that hey, this thing could actually be used, or bad at things.
Kevin Perko: I think a lot of those models will definitely come in here, because they will enable us to represent things in really interesting ways, that we may not think about. I think the simpler approach is nicer in the sense that it lets you actually say, “Hey, we extracted this part of the book, and that needs this.” That’s really valuable, that interpretability piece. That being said, neural networks are starting to get that. People are doing active research. They’re starting to say, “Okay, this is actually what it learned, this is how it represented it, this is your pixels that it took out and learned.”
Kevin Perko: Then you start to understand, oh, this is why when we turn a bus on its side, now it may think that it’s a zebra instead of a bus, because it just learned like two pixels in the image, and so there’s a huge risk that when things change slightly you get very, very wrong outcomes from these neural network type models. That’s why I like this idea of having a mix of us really deeply understand the model, not as sophisticated plus something that’s really pushing the edge, and they’ll also can act as like a check on each other. You can sort of see when the bus is on its side, or if a book is clearly about romance, and this is saying, “It’s science fiction,” and we have people look at it and it’s like, oh, this is science fiction. Then we understand what’s going on.
Kirill Eremenko: Wow. Very cool. Well, if anybody wants to find out how this story ends, DataScienceGO 2019, end of September, in San Diego. That’s where you can catch Kevin. I wanted to ask you, Kevin, you mentioned neural networks. What’s your view in terms of the work you guys do … There’s a lot of … especially in the part of understanding the content, I’m assuming there’s a lot of working with text, and language processing. What is your view on neural networks versus machine learning approaches?
Kevin Perko: I think that for the most part, they complement each other, and that really, neural networks uses a lot of machine learning. They’re not these separate worlds of things. When you’re setting up a neural network, people have kind of said it’s much more like differentiable programming. It’s like a config file, especially if you’re working with Keras, you’re sort of setting up, okay, like what are my activation units, how many layers do I want. You’re deciding these things and it’s like, what are you deciding when you’re thinking about this. Okay, well you’re thinking about maybe a linear model, or a logistic model, in terms of how you want to represent a thing.
Kevin Perko: The difference is that what you’re thinking about is one part of the model. You’re not thinking about the whole model anymore. The neural network kind of takes all that. It adds its hidden layers, and it does extra things that aren’t really represented here, but you’re kind of guiding it, so you’re more of a guide rather than like, oh, rather than logistic regression, I learn these features, however I learn them, and I put them in all and it gives me something very interpretable. Outputting probabilities, which are very understandable and that’s what the model is, versus neural networks just trying to kind of map something really probably non-linear, and understanding that without …
Kevin Perko: It’s not going to give you that nice interpretability component yet, but it uses the same I would say mathematical approaches under the hood. Then it kind of adds on its own layer. I think that like I was saying, they really complement each other, and there’s no like, this is better than this. It just depends on the use case. The truth is in industry most of the time you don’t actually need anything neural networks. Like I was saying, it’s better to say on the old stuff that people have proven out, that works really well, that you can actually communicate with, because it’s really hard to talk to somebody about neural networks given their … It’s like, all the machine learning stuff combined into this other box, and then put that inside another box, and then you kind of shift that out.
Kevin Perko: Then people ask you, “Well, how did this decision get made?,” and you don’t really have a good answer for them. Whereas if you’re using random forest, or logistic or linear regression, you can say something much more confident about, “Oh hey, this is how this model made this policy or this decision, and I really understand what that means, and what it’s trained on. We can debate if that’s right or wrong.” This is how we go there. That doesn’t exist with neural networks. That why I think they’re a balance, when you think about traditional machine learning techniques.
Kevin Perko: Same thing with support vector machines. Given its a margin with classifier, you pretty much understand how it’s making these decisions. Whereas with something like neural networks, you really … That’s kind of the core thing today, you don’t. I think in the future, people are going to sort of break through that wall and we will understand these decisions, well enough anyway that people will get much more confidence in the models. That’s proving to be increasingly important, is these things get incorporated, like doing facial recognition for all sorts of use cases. When a model’s impacting sentencing guidelines, you really want to have a lot of interpretability behind that model.
Kevin Perko: These are things that I definitely worry about, that people use these kinds of tools without understanding like, oh wow, people, there is a lot of ambiguousness between how this model is working, and there’s lots of opportunity for this to go awry, when you don’t have a good kind of interpretability, and a good transparency layer. I think that was sort of a big thing for data science in general is to get much better at that, especially as data science permeates all parts of business, and culture. People want to know, “Hey, how did this happen? If we’re going to delegate this to an algorithm, how did it make the decision?”
Kevin Perko: In the past it was just, if we can make a good decision, then we’ll go do it. In the future it’s like, if we can make a good decision that we can explain, and people will agree with it, we’ll go do it. Sometimes we’ll make a less good, but perhaps a more societally fair decision that people agree with. We’ll have the ability to adjust the knob and do that, whereas today we may not.
Kirill Eremenko: That’s a whole explainable AI. [inaudible 00:39:56] becoming more of a trend we’re seeing that even this year, more questions are being raised, more companies or agencies, government agencies including, are asking the question, “Is this explainable AI? Do we know how it’s making these decisions?,” because as you mentioned, with data science becoming more and more part of our daily lives, and society, there’s so much that can go wrong in terms of recognition of even facial recognition, and any kind of associated racism that can be incorporated in that, or sexism, and when you can explain how the model works, you can point that. When you can’t explain, then you’ve got a whole different can of worms that you’re going to open.
Kirill Eremenko: A lot of it also comes, especially in neural networks, comes from labeled data. Like, the AI might be the neural network is … just the architecture is very neutral, but then the data that it was labeled already has some kind of bias, so has some sort of discrimination in it. Then the AI learns that, and try go in there and make it unlearn that if you can’t get … You don’t know which neuron responds … correlates to which features. It’s pretty insane.
Kevin Perko: Exactly. Exactly. That’s a great point, that the algorithms are just representing a bias, and when we have bias as society, that is represented in the data sets. The algorithms don’t … they’re immoral. They don’t know that that’s not the ideal outcome. They actually think that’s the outcome they’re supposed to learn and reinforce.
Kirill Eremenko: Yeah. Then you’ve got that whole trend. Have you seen those images when people take like a stop sign, and they put some stickers on it, and self driving car doesn’t recognize it as a stop sign anymore.
Kevin Perko: I have not, but that does not surprise me at all, because I see those self driving cars around San Francisco all the time, and they really struggle.
Kirill Eremenko: Oh wow. Where is it … I haven’t been in San Francisco for a while. What company is that through, Uber or self driving Ubers?
Kevin Perko: Typically what I’m seeing are the Cruise vehicles.
Kirill Eremenko: Okay. What do they do?
Kevin Perko: Cruise, I think GM bought Cruise, and-
Kirill Eremenko: Oh okay, gotcha. It’s like a [inaudible 00:42:24] transportation company.
Kevin Perko: Right, right. They have SUVs drive around San Francisco with a ton of sensors, and they’re logging in an incredible number of miles in the city. You can see how much they struggle at intersections, and it’s like a bike goes by, then they’re like suddenly swerving, and you’re just like, technology is not right. People talk about level five, in like 10 years. I’m like, level five is just like, we can’t even think about that. This is, these cars are just … they are not ready. I mean, I get it. Urban environments are really hard, but the core thing is you can’t learn everything and advance, and I think that’s where we’re just kind of pushing the current limits of what we have with vision and AI, is that we’re trying to.
Kevin Perko: We’re trying to have incredible lidar that can respond super fast, instead of a general intelligence that understands how to value different objects. These cars can’t do that, so they treat a cat the same as a bicyclist, the same as a semi truck. It’s just an object, and there’s not association or learning with it. Now, I’m sure that’s changing. I think that’s kind of the key problem, is until you do that, then you’re going to react the same to a cat, or a squirrel, that you are going to reach to a semi truck, which is a problem.
Kevin Perko: The other thing is if you just had like a whole network and it was all autonomous, then you’d be kind of fine. The machines could do weird things, but you’d figure out how to solve that. When you’re interacting those with humans, and the machines don’t have a way of relating to the humans, then you get all these new problems. My favorite one was they had to make the driving system more aggressive at intersections in California, because we all do the rolling stop, especially in San Francisco. The car would just sit there waiting for its turn to go, and it would never go, because there was never a point where all four cars came to a 100% complete stop.
Kirill Eremenko: Okay, gotcha. Okay, yeah, okay, because the rules are kind of different. It’s following strict rules, whereas humans are more flexible with the rules I guess.
Kevin Perko: Right. We think about the spirit of the rule, are we causing harm, and try to interpret that within the context of the situation, like is it sunny, or raining, or am I surrounded by bikers or little kids, whereas literally the machines, they don’t have any of that context. They’re just like, this is the rule. If the speed limit is this, and it says this, then I do this.
Kirill Eremenko: Yeah, wow. Okay, very, very interesting observation. Must be pretty scary dodging these cars.
Kevin Perko: It is, it is. Sometimes it concerns me to think that they are actually going to try to have that ready to go. I think that they do have some in … maybe it’s in Arizona, but it’s on kind of like a closed track, where they know exactly what the variables are going to be, and that works fine. It’s just urban environments are really hard, even for human drivers who have a lot of experience. They’re very challenging. For machines, they’re incredibly difficult, because the number of things you have to learn each second, it changes every second.
Kirill Eremenko: Yeah. Well, technology, data, it’s interesting to see how they are coming. Data is becoming more and more recognized as something that’s driving business, and these two things, technology and data, are coming closer together. They’ve always been propelling one another, but now we’re trying to use data everywhere where we can, and technology as well. Then what I notice about your background is that it looks like you’ve changed careers very consciously I would say, that you’ve selected different companies, or different roles, in data science to work, but they’ve never been along the same line. Let’s say, developing self driving cars, or in the case of Scribd, like working with recommender system, or understanding content.
Kirill Eremenko: It feels like you’ve moved around the space quite a bit. Can you comment on that? Why these choices of roles and careers? Were you searching for something? Did you consciously decide on what you want to learn next before progressing further?
Kevin Perko: Right. I think it’s easy to look back historically and see a narrative. I’ll say at the time it was really kind of like an exploration, give it much more like a gradient descent. I’m taking these steps, some of them good, some of them not as good, and just learning, and gathering more information. They’re all really valuable steps, because now I know if I’m walking up the hill, or down the hill. What they’ve kind of given me in aggregate is this really unique view of all of the different parts of the system, in terms of how companies actually can use data science, how we think about this idea of a full stack data scientist, kind of comes from my past experience of seeing, well okay, somebody can’t ingest this data right, then there’s no data science.
Kevin Perko: If you don’t have good data that’s clean, then you spend all your time doing that and so you spend very little time applying it to a model. These are the kind of the key systems of like, oh, if you can’t deploy your model, then you’re just beholden to another group, and you’re not like a data science business unit. You’re not shipping product, you’re really more of a support function if you’re constantly bound by somebody else, to go put the thing that you made into a product. That really limits your scope and your ability.
Kevin Perko: That’s kind of what I’ve seen across my experience across all these organizations, getting to see how different organizations treat data science. It’s really kind of a key thing, that you have an organization that the executives believe in data science. They believe that you can use experimentation and machine learning, not just to make their product better, but to be the product. That’s something I very much see has to come from the top. When it does, it makes your life much, much easier, and the company is on board, and you’re pushing the edge more than just trying to say, “This is why we should exist.”
Kevin Perko: Kind of having my experience in hindsight has given me a lot of these really unique perspectives. Going forward as I build it, I just thought this is a really interesting opportunity, let’s try this, let’s try this. I didn’t really see how it was going to connect. Looking back, I can kind of see that it’s been a really nice connection by working these different companies, seeing different approaches, how all this works together, seeing different organizational structures where you have it really split up, where data science doesn’t have access to any systems, and how limiting and suboptimal that it, is for a data science group. 
Kevin Perko: To have those restrictions, whereas if you think about the other side, of well, what if they have engineers with data scientists, and they’re shipping product. That is really where you want to be for every data science team, because then you get really to this true full stack data science org, that can ship product, that can support change, that can do whatever it needs to do within the business, rather than having something that’s very kind of boxed in, into its very specific niche.
Kevin Perko: It does that and maybe it does it really well, and creates a ton of value for the business, but in my opinion it’s always going to be suboptimal to structure it that way.
Kirill Eremenko: How would you advise somebody who is looking from without an organization? From externally, and maybe looking for a job, or looking to move into that organization, change career, how would you recommend for a person like that to determine the answer to that question? Is data science seen by the executive as a product or not, because when you’re inside it might be quite obvious, but when you’re outside, and you’re trying to understand if this is the right company for you to work in, it might be difficult to see.
Kevin Perko: That’s a great question. I think that it is always going to be difficult to judge something like that from the outside. What you can do are like little … You have to look for signals, kind of build your own pattern recognition system, and ask questions, really simple things like, do they have a blog, does it get updated, is the company … are any executives talking about data science or machine learning, any public interviews ever, do they have maybe a chief data officer, or a VP of data science. If you’re able to talk to people, if you’re in the interview process and you’re talking to someone who’s maybe director, executive level, what do they think about data science, how do they think it’s driving the business, and really listen to how they answer that question, and what they say.
Kevin Perko: Do they have a vision? Have they thought about it at all? Or is this like, we don’t know, we want you to come in and do it, and we’re open. How they answer those questions will tell you a lot about how the organization views it. Most people will be pretty honest there and say, “Okay, we really think that this will help us increase our lead generation by 3X for our business, if we’re B2B SAAS, and that’s more money, and that’s how we see it. That’s the end of our data science at the company.
Kevin Perko: Then you can make your own decision, once you get that. I think it’s really kind of being able to talk to somebody from a more senior leadership position, and getting good answers on, have they thought about this deeply, and they actually believe in it, or they see everybody, and they just want to hire, because it’s usually pretty clear, when somebody’s trying to hire a data scientist, because they think they should have a data scientist, and yet they have no idea what the data scientist will do. They won’t actually be able to tell you what any of the projects are, or any of the vision is for data science in that job unit, or what have you.
Kevin Perko: I think those are kind of the key signals. You can kind of start parsing out. You can also just sort of ask people how the teams are organized, is it in engineering, which might be really important at a smaller company, is it in product, is it in marketing, is it in finance. I’ve seen all of these structures. They mean really different things for the data science group. Is it a science group totally decentralized and everybody’s embedded within a specific team, that’s a really different data science experience rather than joining a data science team, and then working within different areas of the business.
Kevin Perko: I think all of these things are areas you can look for, and questions you can ask to try to assess that out.
Kirill Eremenko: I love what you mentioned about the decentralized embedded data science team, where you’ve got data scientists, or machine learning engineers, in different functions of the business, versus a stand alone data science team, something what you have at Scribd. What would you say are the advantages, disadvantages of either of the approaches?
Kevin Perko: Right. At Scribd we would have something approaching a hybrid model of this. I think that … The advantage of having a core data science team is that it really has to think of itself as a business unit, and go around, and connect itself to the business, and understand what the priorities are, and where it can drive value, and what opportunities exist. Then can kind of track those out into near term, medium term, long term initiatives, whereas your long term initiative is like trying to ship really exciting state of the art products, and then short term is something very clearly defined.
Kevin Perko: You’re working with GBM you can re-rank something better, or represent something better with some vision that’s already been solved, using a pre-trained model, and you know you can ship that in a month, and help the business in this way. The kind of key thing right there is you have to align yourself really tightly with the business. When you’re embedded, it’s really easy to say, “Okay, well a product manager brought a road map, or somebody brought a road map, and we’re executing on it. You told me to build this algorithm, and so I’m going to go build it.”
Kevin Perko: We have a recommendation system, and we’re going to try to make it three percent better, rather than asking if we even have the right system, and then taking three to six months to rebuild it, which is what you’re going to get from the business unit approach. Where as the embedded approach is much more likely to be iterative. There’s going to be other factors in there, but that’s sort of what I’ve seen, is that it drives this iterative approach, which makes it hard to make bigger gains. It’s certainly valuable for the business to have iterative gains in the near term. However, it kind of limits your ability longer term to sort of go after bigger opportunities [crosstalk 00:54:38].
Kirill Eremenko: When you say you have a hybrid model, what do you mean by that?
Kevin Perko: Right. When we have a hybrid model at Scribd, we have data scientists that are embedded on product facing squads, as well as searching recommendations. They work with those squads really tightly. Those squads have road maps. They are doing some of the iterative thing, and what we’re doing now is to really pair that more with like, well, let’s drive road map, let’s think how we can kind of reimagine the system instead of just making an existing system we inherited a little bit better. Maybe we can actually make it a lot better, however we’re still working within the constraints of that system, without really deeply questioning if that system should exist. Which just is that, like I said, is a function of being embedded.
Kevin Perko: In Toronto, we’re really focusing on the more business unit type approach. I’m going to bring that approach to San Francisco as well, so we’re really thinking about how to reimagine the system, in addition to driving iterative improvements.
Kirill Eremenko: Okay. Very useful information, especially for business owners, or executives listening to this. Would you say there is kind of like a threshold, when a company should maybe for instance as a smaller firm, a smaller organization, a start up start with embedded approach, and then at some threshold, switch over to the core data science team approach, or the stand alone data science team approach? Would you say there’s a time in the life of any business when that should happen, or this really depends on the type of business nature of the industry?
Kevin Perko: It depends on the business. My personal view is that going to the business unit sooner is always going to be better. The trade off with that is you don’t get the really embedded focus that that brings. If you’re trying to ship something, say you’re a start up, and you need to raise your next [inaudible 00:56:41], you need to hit very specific goals in the next six months. The embedded unit can really help you align everybody, really clearly, assuming you already know it needs to happen. When you’re in much more of the greenfield space, the embedded units, it’s harder for them to deliver that kind of work, especially from the data science side, because data scientists really kind of shine when they’re working with other data science consistently, and bouncing their ideas off, and thinking about things.
Kevin Perko: They’re like outside of the bounds of what people are envisioning of the next version. That’s where it just becomes, you could have a product manager who totally gets data science and they can do that in that model, and it works, and you don’t actually need this. You can get the same gains that you would get. What I’ve seen practically is that there’s not very many people like that. Depending on your organization, and what you’re solving for, there is kind of like a point where you want to think, okay, when … am I getting enough out of the data science team. If not, the kind of the business unit approach and now it’s important to pair that approach with a full stack approach. It’s data scientists, engineers, whoever they need to ship their product.
Kevin Perko: Maybe in your company it’s front end engineers and designers. They should have it, and they should be accountable just like a product org, and run it the same way. It’s no longer a support function. It’s now a unit shipping product that’s driving your company forward, and you can’t have them sort of constrained by other parts of the organization, because then you’re not going to really get to see what they can do. I think that it’s simply a trade off for the business, depending on what you want to achieve. It’s not like, oh, you have to do this, or you have to do this. It really depends on the goals of the business.
Kevin Perko: I’m always going to say that the business unit is going to be really more powerful. Longer term it’s going to create more value. I feel very strongly about that. I think though in the short term, and in the medium term, that can be very iffy. If those are really where the business is focused, they can have different ways of approaching that.
Kirill Eremenko: Okay, gotcha. Thank you for that overview. I got like a question, a philosophical question for you, where do you think the field of data science is going, and what should our listeners prepare for to be ready for the future that’s coming in the next three to five years?
Kevin Perko: I think we talked about this a little bit earlier, where data science is starting to pervade every part of our daily lives, and so people are now asking these big questions about, hey, how does it impact my privacy, how did the model make this decision. I think privacy and interpretability are going to become increasingly important. I think you see this a little bit with Android and iOS, and you can do some on device training, or serving, depending on how you set it up, that can really actually drive user privacy, and machine learning. Those two things used to be opposed. Now they can be united. I think privacy is generally becoming a big worldwide thing as people realize the value of the data, and the value of their privacy that they’ve just kind of given over to corporations and governments, so they want it back.
Kevin Perko: I don’t think that’s going away. You have things like the blockchain, which is high level, sort of a universal trust in verification system. It’s really exciting to think how can data science intersect with that, can we actually write contracts with Ethereum, that are social enforceable, and build models, and have all of these sort of units served where we have general ledgers of trust, and where does data science play in that, like how can we think about what kind a society one have, and what data science can enable within that. These are really big questions for us to ask, because I think the models, it’s sort of the both, they’re already there and the incredible things they can do, and they’re really far away in the things that we think that get hyped a lot, like actually having autonomy in self driving cars.
Kevin Perko: Computer vision is still very, very early. I think that it’s going to get deployed in a lot more situations where it’s actually making decisions for classifying people, where it’s probably not ready. That’s just going to happen. The best thing that we can do is to really push the interpretability, so people can say, “Oh, it’s kind of clear this algorithm isn’t ready, but we can pair it with humans.” That’s what a lot of businesses that use AI, do. They pair it with huge amounts of people labeling the data, and evaluating the decisions the model made, and understanding if it’s right. We need to continue to do that same thing as it gets out into society in general.
Kevin Perko: Everybody needs to be able to evaluate a model, and understand if the decision it made based on this information is reasonable, and have debates about it, as it comes into society. I think that’s real exciting, because people are now building … You have [inaudible 01:01:20] processing units, but this computer is specifically dedicated for serving and in some cases training models, and that’s real exciting, because most of the limit I think of like, there’s machine learning and neural networks, and general AI has really been on the compute, this is like pushing it back to the algorithm. Then you see once that happens, every kind of six months people are sort of pushing the state of the art, and that’s going to continue to happen, as long as we don’t run into another compute wall.
Kevin Perko: I think the future can be sort of whatever we make it. It can be a dystopia 1984 type situation, where we’re all getting bound by this facial recognition that we don’t know how it works, and the government’s using it, or we can create this real incredible future where we can be revolutionizing how food is grown, and how water gets preserved, and how we’re tackling climate change, and data science can move into all of these fields, and it should, and it can help. We can help people understand what’s actually behind all these decisions, and make better allocations of our resources, using data science models, and using a lot of models that already exist today.
Kevin Perko: It’s kind of getting them into government, getting them into these really large companies that move really slowly. That’s sort of a really big piece, is kind of the pervasiveness as much as pushing the state of the art of data science. That’s really exciting work, can open up new implications and new technologies, and new products. I think that there’s also a lot of gains to be made on just increasing the pervasiveness of data science among existing industries like schools, and governments. That can have a very large positive effect.
Kirill Eremenko: Gotcha. It seems like we’ve gone full circle here on the podcast, that we came back to where we started from, that applied data science is kind of the answer, don’t just learn data science for the sake of learning data science, but see what impact you can make in the world, whether it’s through various industries, and exciting projects, or it is through bringing data science to government, and society, in a very understandable, secure way, that respects people’s privacy.
Kevin Perko: Absolutely. I think that’s a great summary, because you can solve a lot of problems with regressions, better than they’re being solved today, and people can understand those decisions, and can actually improve the world doing that, which is really exciting.
Kirill Eremenko: Fantastic. Well, thank you, Kevin. This brings us to the end of today’s episode. Before I let you go, what’s the best way for people to contact you, get in touch, follow your career, learn more about what you’re doing?
Kevin Perko: People can follow me on Twitter, at croatiankp. We’ve got a data science blog, Scribd data science and engineering blog on Medium. Obviously there’s LinkedIn, feel free to follow me there, although I don’t post very much material on LinkedIn. I think those are all great places.
Kirill Eremenko: Nice job. Obviously people can apply for positions that you’re looking to fill on the Scribd website, right, you said?
Kevin Perko: Right, right. You can go to Scribd.com/jobs, and we have some data science openings, you can apply there as well.
Kirill Eremenko: Fantastic. Well, we’ll share all those links in the show notes. Make sure, guys, and everybody listening to get in touch with Kevin, follow Kevin. Kevin, one more question for you before we finish up, what’s a book that you can recommend to our listeners, that will help them in their careers, or in life?
Kevin Perko: I recently read Bad Blood, which is about the Theranos founder, Elizabeth Holmes, and I think it’s a really incredible book, because it sort of shows this intersection of building a future, and how you can kind of go over the line with that. You get kind of caught up in your own, you go in your own potential too much, building the future’s actually really hard. When you’re dealing with something like health care, if you get caught up in those things, you can create very bad outcomes for people. It’s kind of a good sort of message for data scientists, of like, we can take this incredible tool we have and use it for bad, or we can kind of say, “How do we leverage this thing,” and really kind of think about how we drive new, amazing systems, and strengthen the world in a better way, using it. 
Kirill Eremenko: Yeah, I actually watched a documentary about that on the plane recently, and indeed, extremely interesting and very educational story for anybody in technology and data science, that the things that as you said, could be used for good or for bad, and even trying to use it for good you can get really caught up in the promise that it has, that technology. Sometimes we’re not there yet, like with the whole self driving cars. Right? We need to navigate our way to get there first.
Kevin Perko: Exactly, exactly.
Kirill Eremenko: Gotcha. Okay, well, Kevin, thanks so much. Looking forward to seeing you in person at DataScienceGo. All right.
Kevin Perko: Absolutely. I can’t wait either.
Kirill Eremenko: There you have it, ladies and gentlemen. That was Kevin Perko, Head of Data Science at Scribd. Thank you so much for joining us for this conversation today. I hope you enjoyed the chat that we had, and probably for me, one of the favorite parts was what Kevin mentioned about the different types of data science teams that you can have. You can have a decentralized team where all your data sciences or machine learning experts are embedded within the different divisions of your business, or you can have a centralized team of data scientists, a stand along core data science team. There are advantages and disadvantages to both, but it’s important to understand that it is a conscious decision on how a business should do that. 
Kirill Eremenko: If you’re a business owner, or entrepreneur, so that’s something to think about. If you’re a data scientist, that’s also something to think about into the sense like, how does your business do it at the moment, or how does the business that you’re applying for do it. That’s a question that you might want to ask at an interview, to understand better what your role is going to be about. If you enjoyed this conversation with Kevin, I am 100% sure you’re going to enjoy his Keynote at DataScienceGOo 2019. If you haven’t gotten your tickets yet, head on over to www.datasciencego.com, and join us this September 27th, 28th, 29th, in San Diego. Wonderful city, wonderful conference.
Kirill Eremenko: Get to network with Kevin, lots of other amazing, insightful speakers. We have over 30 speakers attending, and of course we’re going to have between 600 and 800 data scientists coming to DataScienceGO. You don’t want to miss this opportunity to expand your network. We had people fly all the way from Brazil on 27 hour flights, on 20 plus hour flights from Europe in the previous years, so distance is not an excuse. I look forward to seeing you at DataScienceGO, and networking with you personally.
Kirill Eremenko: On that note, thank you so much for being here today, and I’ll see you next time. Until then, happy analyzing.
Show All

Share on

Related Podcasts