SDS 001: Ruben Kogel on Self-Serve Analytics, R vs Python, and Mentors in Data Science

Podcast Guest: Ruben Kogel

September 10, 2016

Welcome to the very first episode of the SDS Podcast. I’m super-pumped! Check out the episode intro to learn more what this podcast is all about.

Today’s guest is Ruben Kogel, Data Scientist and ex-Head* of Content Analytics at Udemy, San Francisco, US.

Everybody comes into Data Science from a different background. Some studied maths, some took stats classes, some came from finance, etc.

Ruben’s story is especially unique in that way – for many years he used to be a Chemical Engineer!

What prompted him to make the transition? His MBA. Yes! Ruben found a way to structure his MBA in a way that lead him to an ultra-successful career in Data. And he hasn’t looked back since.

We talked about what it’s like to be a Data Scientist in an agile start-up environment in the Silicon Valley, how to setup and promote self-serve analytics systems, how to keep your workflow under control and the importance of mentors in Data Science.

I can’t wait for you to hear his story and insight in Episode 1 of the Super Data Science podcast. Since recording this episode, Ruben has resigned from his role at Udemy and moved onto new exciting challenges! This stands to show how Data Scientists are always in high demand.

In this episode you will learn:

  • How present insights in simple terms
  • The rise of self-serve analytics
  • PostgreSQL and Wagon
  • R vs Python
  • Managing a team of Data Scientists
  • Communication in Data Science
  • Importance of Mentors

Items mentioned in this podcast:

Follow Ruben

Episode transcript

Podcast Transcript

Kirill: This is episode number one with ex-chemical engineer and now data science wizard, Ruben Kogel.

Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur. And each week, we bring inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make complex simple.

(background music plays)

Hello and welcome to the very first episode of the SuperDataScience podcast. I can’t explain how excited I am to finally get the show off the ground. I’ve had this idea for literally months now and finally today we’re kicking things off. And what is the show all going to be about?

The show is going to be about inviting the most aspiring data scientists in the world and talking to them about what they do, what their background is, what they’ve learned in their past on the data science journey and what they can share, what insights, what tools and methodologies they can share with us and what we can learn together from them. So, I’m very excited that you’re here from the very start that you’re listening to the very first episode. Thank you so much for being part of this journey. I’m sure that together we’re going to learn a lot.

And I’m very glad that this very first episode, we kicked off on a very high note, I spoke with Ruben Kogel who’s a data scientist at Udemy. So if you’re not familiar with Udemy it’s the biggest online educational platform in the world. Currently there are over eleven million students learning through Udemy so if you haven’t checked them out, definitely check them out. Our courses are basically anything that you could potentially imagine and personally I’m an instructor in Udemy as well. I have over twenty courses there and near fifty thousand students. So it’s a great learning platform and Ruben Kogel is one of the head data scientist in one of the divisions in Udemy. So, a division that works on content and content marketing and Ruben shared some very powerful insights about what he does on a daily basis at Udemy and moreover how he transferred his chemical engineering background into a data science skill set. How he in taking ABA and specifically selected the subjects in essential way that he was able to learn more about data science and get into that field so make him jump from chemical engineering to data science through his MBA.

He also talked about communicating insights and how important it is in the data science role. Also we discussed lots of other topics such as identifying problems when you’re a data scientist. How important that is. How to combine data science and product strategy. That is something that Ruben does on a daily basis and you’ll learn more about that. And so that’s a very powerful skill to have especially if you’re working in the start-up area. The start-up space in company is predominantly in the silicon valley or any other kind of location where you got this start up culture. Then we also spoke about managing a team of data scientists so Ruben quite had an experience around that he had some tips if you’re a manager in data science or in that analytics space, you can pick up some good tips from there.

And also, we talked about managing the inflow of request and that is valuable to any data scientists to how to manage the inflow or requests and then Ruben gave us an example of his approach of the trello board, we talk about the mentors of data science and of course, we talked about lots of different analytics tools. We talked about (inaudible) sql and add- on which I don’t personally know about. We talked about wagon, we talked about r versus python. So the (inaudible) question which one is better which is more preferable and we talked about many many other things in this podcast so I’m sure you’re going to enjoy it. And we even touched on self-served analytics so a growing space of the field of data science. So without further ado, I bring to you Ruben Kogel of Udemy.com and enjoy.

(background music)

Kirill: Hey guys, welcome to the Podcast. I’ve got Ruben Kogel here from Udemy. Super excited about this very first episode.

Ruben, welcome! How are you going?

Ruben: Thank you! Thanks for having me over. I’m doing great.

Kirill: Awesome. It’s great to hear you and for those of you who don’t know, I met Ruben for the first time when I was in San Francisco a couple of months ago for the first Udemy live conference. And it was pretty exciting and he ran some great presentations but just so that we got or everybody gets up to speed, Ruben, tell us a bit about what do you do? What’s your title in the company and what exactly do you do on your own?

Ruben: Yeah sure. So, I’m the senior manager of analytics and strategy in Udemy and basically what I do is that I help the content team which is the team that looks at the courses and the content in Udemy in order to figure out what is in the catalogue, what’s in the selection of courses, what’s the quality of courses, how can we improve our catalogue to make our students happier. In practice, a lot of it might work has to do with – are we measuring the right thing, are we measuring satisfaction, is that data available so that the people who are in charge of bringing in courses know which courses are good and how do we measure also the selection of courses so that we can slowly build a better and better catalogue for our students.

Kirill: Awesome! That’s pretty cool so you apply Data Science techniques in order to measure those kinds of metrics. Is that correct?

Ruben: Totally. I mean data science comes in different parts of my job. So just like the more basic part which is instrumentation. So I just start to like any data endeavors you wanna make sure you’re measuring the right things and that data is available. So for me it means that are we measuring student satisfaction correctly and is that data available so that people in the company can access it and make decisions on it. There’s another level in which I use data sciences then there like more broader questions are coming and so that’s what I call like ad-hoc analysis and so someone might ask me what happens if we remove some of the low quality courses from the platform. Can we predict all the impact on the satisfaction of revenue? So that’s kind of a question where you need to know the structure of the problem but also come up with some predictions and use some techniques to like evaluate what would be the impact was maybe some conference and roles and there’s like another type of question that can be, well, we know that there’s a variety of courses and they all have different quality scores and we wanna know what about these courses that driving these quality scores and you know is that the audio quality, the video quality, and the instructured delivery so in that case, you know data science comes in to play both in terms of structuring the problem but also running some sort of a statistical analysis to extract the importance of the different variables and come back with an answer saying like – “well I think this variable is the most important and you know if you move that variable by like 1% that will have an impact to all student ratings but by that much.” So that’s another example.

Kirill: That’s pretty cool. So, a kind of two main types – one is metrics of existing metrics and how you can tweak them to improve the experience of students but another one is where the first image where you’re doing kind of behavioral analytics and predictive behavioral analytics to see how you can change things so that the future experience is going to be better. That’s pretty cool on my view and it’s great that you get to do both parts in your role as a data scientist at Udemy.

Ruben: Yeah, it’s very very cool. Actually what I really like about my role here is it’s really the interface between data science and sort of like strategy product because I get to work on data set and they get to a lot of data analysis but at the end of the day the people I talk to are VP of content toward you know director of course acquisition and people who were like you know they are like in the width. They are doing the business and I get to give them recommendation, I get to influence, know their decisions or influence the product with data so it’s this really cool interface between data and business.

Kirill: Definitely and that’s what I also found that the most interesting and the most impactful roles and careers happen on the verge of two fields whether it’s for example you could take like Physics and chemistry or biology and chemistry but that’s very exaggerated examples but even in data science one thing is just to do analytics and all the whole thing is to do the analytics and at the same time convey the findings and work with those people that use analytics. And just on that as well, how do you find conveying such complicated analytics to your stakeholders that like you said the VP of content. Is there any certain approaches that you use or any tips you can give to our listeners on communicating these findings to senior stakeholders in the company?

Ruben: You know I’ll start by saying that any analysis that you do is useless and irrelevant if you’re not able to communicate the findings to people. So, communication is a huge part of being a successful data scientist. And, but there’s like choose..or like roes that I try to use when I communicate things, one is translate technical concepts into something that people can understand so you don’t just like talk about conference (inaudible). You can talk about maybe your confidence in the data or you can select. Well I think you know, the prediction or like the predictive revenue will be between those two balance. You don’t have to say you’re ninety-five percent confident because that doesn’t add a lot of value but at the same time you also wanna convey precision in your communication and make sure that you, you don’t throw in answers like “yeah I think we should do A because.. I think it’s important that you convey like “ well I look at the data and if we do A, we can lift revenue by this amount and maybe just a little bit of an individual in terms of like what I think we can lift revenue but this is what the analysis is doing. So it’s like this balance between being a precise and showing that you’ve done your homework and showing like that this is a written in data but at the same time translating it to layman words and stripping away any technicalities that don’t add a lot of values to your message.

Kirill: Totally agree. I’ve got a classic example on that, how often stakeholders especially senior stakeholders are very skeptical about sample sizes for instance right so if you run some analysis on a sample size of 157 or 300 they might be prone to saying actually we want a sample size of 10,000 but you as a data scientist know that that sample size that you ran is significant. Statistically significant so it is important to convey these findings in a way that when you are confident then with yourself then you don’t have to go into all of that detail all to explain exactly the methodology behind it but actually just convey the confidence in the way you present, the way you position your analysis so totally agree on that one.

The interesting thing that I’d like to ask you and probably a lot of our listeners would be curious about is – can you tell us a bit of your background. So how did you come into being a data scientist and progress, of course you progress further in your career and now you’re the head of analytics in that department but originally how did you get into data science because this is quite a new field and back in the day it wasn’t being taught in university so how did you get here and what are the steps?

Ruben: It’s a very interesting question because I had a very meandering path. I didn’t start at all in data. My background was in applied physics and then I switched to material science and I was a chemical engineer for many many years. I was dealing with data but not the type of analysis that you do when you are data scientist. It was a lot rougher and a lot sophisticated and what happened is that I actually was looking to like transition to something different. I went to business school and you don’t typically think of a business school as a place you learn for data science but I had this opportunity to take to use statistic courses and one like data mining course that I suddenly fell in love with the field and like the more I learn about it, then the more I was really intrigued and so I really learned the theory in business school and that I had this opportunity to come work to Udemy and applied some of my knowledge and that’s how I like started my career. So it is pretty reason and It was also like a pretty stark contrast from what I was doing before.

Kirill: Definitely. That’s a great jump and like you say, one wouldn’t expect that you would learn all the necessary data skill at business school but I guess you picked the right subjects and that’s a great testament to how lucrative the data science field is and tell us the skills that you developed as a chemical engineer or working in that field. Is there any way to leverage them currently because data science comes from all the different areas? Some people come from acting classes, some people can come from economics or finance but coming from a chemistry background is there any skills or any particular mind sets that you can share of us that you can leverage with your current work as a data scientist?

Ruben: Yeah. Totally. Surprisingly, it is not the data skills that I leverage from my chemical engineering background. It’s more the problem solving thing in communication skills. I think any engineering work has a heavy component of troubleshooting and problem solving and so I was doing a lot of that in my job and it really forced me to come up with like a very systematic approach to breaking down a problem. Coming up with a hypothesis and I think those hypotheses and be extremely systematic and organized about and structured about my thinking. So I definitely learned that from my engineering background. The only thing that I learned is, as I mentioned communication skills. As an engineer especially like in my position, I was doing a lot of like account management. I had also to translate a lot of very complex technical experiments and results into something that the business people could understand and the ability to summarize complex concepts and notions into a few you know whether it’s a slide or whether it’s an email you know and prepared draft that really sum up the insights, it’s very important that something that I learned in my engineering job and something that had thrilled me also in my current career.

Kirill: Sounds really cool and definitely the problem solving skills that’s a very valuable thing to have when you’re dealing with data science challenges but from what you’re saying I gathered that your communication skills had played a very significant role in terms of your success and how would you recommend because – my thinking is that a lot of times when people are starting out into the field of data science now that it’s getting more and more popular, sometimes they can get pitched on a hold into certain roles where they are performing some analytics but they don’t have the exposure to go and share their insights with stakeholders. They were just performing certain sql queries, or certain analytical procedures but they don’t actually have a chance to communicate insights. So what would be your suggestion be like from the top of your head for people or our listeners who might be in that situation to somehow start developing those communication skills nevertheless.

Ruben: Yeah. I think there are two ways you can do that – one is, even in your day to day interaction with customers. I would think of like analytics of data sciences you know you have customers inside the company whether they’re technical customers or business customers, people ask you questions or ask you to perform data analysis to give them an answer so whenever you interact with your customers, you can always like good extra mile and structure your answer not just as the output of a regression analysis or as a SQL query but like you could try and explain like why you think this is the right thing to do, what that means in practice, what would be your recommendation so always like push the results path not just the technical output but and package it in a way that shows that you thought about the implications and the meaning of the analysis so that’s one thing.

The second thing is I also like always recommend people and that’s true of people who worked for me or people who might working in other position that whenever you are given a problem, it’s always a good practice to try and dig in to exactly the problem the person is trying to solve. Because oftentimes, someone will leave comments like “hey can you pull this data or hey can you build a dashboard for that or can you do an analysis around this” like really, they are trying to solve the problem and they may not tell you what’s the problem trying to solve it so if you engage with that person or customer try to really get to the bottom of what problem they are trying to solve. All of a sudden you are, starting a conversation around what are you trying to solve, what can you bring besides just the approved analysis, how can you help them frame your problem and so all of a sudden you are engaged in communication, you are engaged into like expressing a problem, breaking in down and choose some more elemental data problems and coming back with a solution that addressed and really on the line problem. So that’s one way in which you can like push the boundaries of your current job and really expand into delivering like more value that is built upon this communication and built upon just like understanding the underlying problem. So that’s number one.

I think number two also is generally speaking, analysts have like this unique ability to look into the data, come up with some insights that no one else in the organization has and so as an analysts, you also have the opportunity to maybe start addressing problems that other people may have also or some people may not have thought about and you can create value by being a little proactive about what you think we should be looking into, what you think would be useful analysis with some useful insights for other teams. So there’s also that opportunity because no one else can really look into the data you are the only person who has access to the data and also can bring out the insights.

Kirill: Beautiful. Love it. Especially the – what is exactly your problem that is one of the best skills to have as a data scientist to help people identify the problem because they often come to just data instead of actually identifying problem and regardless of what level you are at as a data scientist or data analyst eventually that skill is the one that will push you forward and combining it with what you said about the proactive approach that’s where you become the doctor for the organization and you walk around and you measure what’s going wrong and how you can help fix it. So yeah definitely agree with those two. Those were some of the very powerful skills to have. And you mentioned that you are in-charge of a few people. Can you tell us more about that how many people are you in-charged of and how did you find to managing the team of analysts or data scientists? What are the challenges that you face on a day to day basis?

Ruben: Right now my team is down to a one person so at the HayDay, we had a bigger team but you know things moved quite quickly. In Silicon Valley, so I have one person reporting and hiring at least one person for the moment. Well the thing that I find that is most challenging in terms of more like managing and growing analysts or data scientists is oftentime there‘s this tension between trying to please your internal customers and trying to make as many people happy as possible in the shortest amount of time. And trying to build long term value and work on like longer term projects. The way I think of like data scientist is like you’re the end of the chain. There’s a lot of people in the organization you know they manage project, they ask other people to do things and you know eventually when you put all the contributions together you come up with a product but when you are in data science, you don’t ask other people to do things or it’s very rare like usually, you’re the last person that they ask and so there’s a lot of people coming to you and end up having a ton of request and managing the inflow of request and managing the different things they’re working on is can be very challenging oftentime like a new analyst, they tend to gravitate your words maybe like the more short term projects they think that the last person that’s sent to do is like “oh yeah yeah, I’ll do it right away” and to the detriment of working on the bigger longer term and more impactful project so that’s one of the challenge.

Kirill: That’s definitely true and that kind of flows into the art of saying no to people who will come to you as a data scientist especially when you’re running a department and when you have a few successes you’ll find even more of the other departments from the company and people and stakeholders coming to you with requests. So how do you say no? How do you tell people that, “hey, you’re project is really cool and I’d love to work on it but at the same time I’ve got other commitments? What are your tips around that?

Ruben: You know you don’t really say no. The truth is (inaudible) you make it very clear what your priorities are and one of the tool that we use here in Udemy is a Trello board. So Trello is a for activity tool that, it’s like virtual post-its. Essentially it enables you to show a public board of what you’re working on, what are the stages of the different projects, what are the stakeholders of the different project and if someone ask you to do analysis you can say ok no problem, can you just record on my board and they’re quickly realize that their card is you know one of 50 other cards and now there’s actually prioritization process for the team to start working on it. So that’s one which we’re working in.

The second part is like, really what you wanna do is like you don’t wanna work on each of the jewel request and sell them. What you wanna build is a scalable infrastructure. You only build scalable analytics. What that means is instead of answering the same question over and over again, or instead of like pulling data for everybody in the company, you build self-serving tools. You build tables, dashboards, web UIs that enable people to access the data that they need so that they don’t ask you to do the data and pull the data analysis anymore. So you can free up some of your bandwidth to work on the more interesting project.

Kirill: Awesome. Love it. The self-serving analytics has becoming a more and more popular concept in the world of data science. Specifically for that reason – to free up data scientists and to empower the end user to do their own and so just on that what are the tools that you use in Udemy, if you can disclose this for self-serve analytics side of thing?

Ruben: The basis of the self-serve analytics is creating a set of summary tables that have all the relevant information that cover ninety percent of the use case. What I mean by that is, for example (inaudible) questions that come in Udemy is like “Oh do you have a list of course that have been published this month or what is its total revenue of courses that were published last month or can you look up a particular instructor and see how many enrolments or reviews she has in her courses. So there’s like a limited set of question that come over and over again and instead of like building queries or like every time she would extract the information she would just build one or two tables that have all that information summarized so people can access the data directly by either acquiring the table by sequel or like looking it up by the dashboard. So in practice, what we’d do is we have all our data information flow in your (inaudible) by amazon and that lets you build tables on top of like raw tables so we have what we call this summary tables built upon the raw. Or they can power dashboards between used chartio Ui interface and they can use the chartio to extract whatever information they need.

Kirill: Okay so red shift and chartio. Yep. Those are the two tools. A very nice answer. You store amazon in a WS for storage of data, correct?

Ruben: Uhuh!

Kirill: And how did you find that? Has that been a reason for transition of your organization or that’s been always the case?

Ruben: No. It’s been a transition about a year and a half ago we used to have only our traditional MySQL own server and then we started exploring red shift. (inaudible) red shift and we saw that it was much more powerful in terms of doing analytics because like typical My SQL database is optimized for writing but not honestly for creating new tables. Red shift is optimized for doing a lot of joints a lot of analysis and reading data. So, we’ve been using red shift for like the last year and a half and we’ve been scaling the size of our cluster as our data grows and as our analytics team grows and it’s been serving us pretty well actually.

Kirill: I’ve heard a lot of comments about that that in AWS you can scale with your needs and that’s one of the biggest advantage of AWS that as your organization grows and need more capacity to empower then to scale that you don’t have to purchase that in advance and also there are month to month or other type of plans there. Very convenient. Usually the only said back organization that have common of this as well is an organization that deal with customer data, client facing data, such as Udemy. They would have certain regulatory, maybe certain regulatory issues with outsourcing the storage of the data to the cloud or to external systems like AWS. Did you face any of that when you’re making the transition?

Ruben: Not really in the sense that the data is still secure and actually in the united states the only data that is being heavily regulated is health data so if you have the sort of the health regulations that make it more difficult to work with the cloud or you have to work with certified vendors that can really handle the start-ups the regulation about how you handle health data for the rest AWS serve a large range of start-ups and they all have sensitive data but you know they are setup to handle the sensitive data correctly so there’s not much concern around that. The only thing is that sometimes you want data to be used for web apps in which case you know reading from red shift is not the best way to populate field in your website or web app so you might need a second representative of data that is faster to read based on calendar storage.

Kirill: So we’ve started delving into the tool so we can move now into the management tools like Trello which I found pretty interesting how you get somebody to post a post-it in Trello on your board and then they realized that “hey there are project that’s going to be prioritized and at some point yours might not be done very quickly” and now we moved on to red shift. What are the other tools that you use on a day to day basis in your analytics role.

Ruben: The two technology that I use you know all the time is SQL or for that case is SQL which is the basis for red shift and the R so I do much of my analysis in R. Basically that’s the language that I learned and I am very familiar with it. Some other people at Udemy use Python. It really varies. But I think one of the tool are they’re very convenient and flexible and those are the tools of the data scientists nowadays. Either python or r they have like the packages and installers open source. It grows as a very active community so those are typical for data analysis. In terms of like the tool that I use where I actually do my SQL, there’s a really neat company called Wagon. This have a better product but they right now have the best SQL editor especially for post (inaudible) SQL. It’s really neat because you can organize your queries into different folders, everything is always save on the cloud so nothing is lost and it’s very like does a neat interface and it’s very easy to use.

Kirill: Awesome, so that was Wagon?

Ruben: Wagon, yeah.

Kirill: How do you spell that?

Ruben: W-A-G-O-N

Kirill: Beautiful. So I haven’t heard of that one before. Something to definitely check out. And yeah. Very interesting how your organization has a split between r and python. I guess in the start-up world it’s more common but in the more larger organizations that’s been for a while that have a lot of legacy behind them usually analysts don’t have that luxury of being able to choose between the two. Definitely R is something that a lot of our listeners are interested in cause you know you maybe share a couple of packages or a couple of techniques that you most commonly use when you’re coding in R.

Ruben: Yeah totally so I’m gonna probably disappoint you and your listeners. I don’t use like a lot of different packages and advanced techniques. I tend to stick to like basic R in fact I don’t even run like others do. I just use like basic R, I don’t use GG plateau. I just use like the basic R graphics. I think part of the reason is I don’t see R as a way to produce extremely sophisticated reports or analysis. I use R mostly to extract the information that I need and run like the basics, the basic analysis so the typical thing that I would do in R is like I would just import a csv file, read it, do a literal cleaning, although I usually prefer to do my cleaning in SQL because I don’t think you should be doing the cleaning at all. I think really the cleaning should occur upstream and you only launch R to do some data exploration, you know create some graphs and do some statistical analysis. Typically, I would run regression analysis I use to eliminate to do lots of regression, I use (inaudible) to use random model or understand the relative importance of the different features and to your model so that’s kinda like my usage of R.

Kirill: Wonderful. I totally agree of that. R is definitely a very powerful tool and every analyst, every data scientist has the right to use it in the way they prefer and you kind of using r in a very lean approach that can be very powerful as well. The question – the million dollar question would be R versus Python. What are your thoughts and why did you end up picking R?

Ruben: My answer might disappoint you again. Like I never really chose between the two. I started with R and it fits all my need. I spoke to a couple of people who use python and I quickly realize that the type of analysis that I was doing, you know offline analysis, building offline model, trying to understand the drivers of a particular metric, Python would be more complex and would not add a lot of value. What I mean by that is in my day to day job, I don’t build actual you know data products. I don’t try to deploy a predictive model onto our infrastructure and therefore I don’t needed to be in Python. All I do is that download some data generally from red shift and I try to build some model to understand what’s driving this metric up or down and for this type of use case, from what I read, R is like the simplest and more a direct way of announcing the data and it’s also like the language that I know.

Kirill: Beautiful. Definitely that is also the case in many situations when you start of one and then you just keep going of that one if it suits your needs and what’s the point of changing. Totally appreciate that. Alright. So That’s really cool and we went into some detail on the tools you use and some techniques. I love that part of conversation. Let’s move on to a bit of more, some of the more softer stuff. For instance, I can see that you’ve changed and made this transition to data science from chemistry and you’ve never looked back. You were like powering through it and progressing on your career, growing other data scientists on your team and acting as a mentor. But along the way did you have any influence that helped you become and persevere in this data science career and maybe some mentors that you might have had or hobbies or some life changing events or even some articles or something that has really influence to you and helped you along this career path as a data scientist.

Ruben: There was a big learning curve for you I’ve never exercised as a data scientist and in fact I’ve didn’t really had a mentor at Udemy and so I had to figure out a lot of things by myself or ask the people outside and so enjoy. I really encourage people to seek out mentors . In my case I had a good friend of mine who had a bit of a headstart on me. He started doing data science for five years before me. He is someone who has strong opinions but also he’s very thoughtful about the different approaches and choice that he was making and I often you know had conversation with him. We had like a regular coffee where we exchange on technologies and techniques. I found this interaction with him very helpful. I went to different conferences although I would say all of that conferences were helpful but the one that I really like was the Airbnb in san Francisco, the conference called open air which I really appreciate it and also followed some blogs and newsletter. There’s a newsletter that I really like called datascienceweekly.org and it’s a collection of interesting articles about 10-12 interesting articles every week. I don’t read all of them I just pick the one that seems interesting and if you know if I pass the first paragraph and it’s really interesting then it’s for read and then like slowly I build a catalogue of tips. Soft thoughts that I think very useful and in particular I can’t remember if it’s thru this newsletter or maybe something like Linkedin online. I found someone posting in an old article from Leo Breiman. The guy who created the random forests that he published in 2001 which really resonated with me so I would really encourage anyone who is considering data science or starting in data science or someone who’s even like more advanced in their career, read this article because for me it really has expressed exactly the way I feel about the tension between statistics and machine learning and the tension I feel between building experimental models and predictive models and he does that in a very compelling and neat way. So the article is called Statistical Modeling: The Two Cultures by Leo Breiman which published in statistical science in 2001. I find it like the best to read to really get some groundings in data science.

Kirill: Wow. Wonderful. Definitely I haven’t heard of that article but definitely will check it out. Sounds like a great read and so that’s by Leo Breiman. Will definitely include that in show notes. Next question I would have is if you could share of us any recent wins of data science, wins that you’ve had in your department at Udemy.

Ruben: Two of them come to mine. My team has been responsible in building the new spam filter on Udemy so that’s the filter that determines whether a review is trustworthy or not and the old spam filter was built in a set of rules. So it was not even a naïve bayes, it was just a set of rules that totally make sense at the time but overtime you know people learn how to circumvent the rules. They’ve learned that the logic filter and they was a lot of none trustworthy spammy content on Udemy and my team tackle this problem and we were able to improve the accuracy of the spam filter by 8X factor so it’s a big win for the team but mostly for the company as a whole. That was one of them. The second one that comes to mine was moving and ad-hoc analysis that we did couple of weeks ago so you are an instructor in Udemy. You know that we changed the pricing strategy and so there was a resulting change in the student behavior and my team looked at final price, list price, discount, all of these influence to the decision and whether if we can build a model how student would react to them to different pricing strategies and so it was a very simple model and buying in nothing really sophisticated complex and it had the ability to explain the data that we have observe din the past and so for that reason was powerful both because it has experimental power but also it was built simple so people could understand what it meant.

Kirill: That’s very cool that 8X accuracy spam filter. That’s a significant improvement and model for behavior of different content that’s also very interesting. As you say I’m an instructor in Udemy, I can see the backend of these things and how they run in backend, I can definitely see how changes coming to play and how the platform is growing so it’s very exciting to know that you’re behind all of those changes. That’s really cool!

What’s your favorite thing about being a data scientist? What’s that one thing that excites you to get up and go to work in the morning and excites you to do your work and what is that thing that drives you to keep going?

Ruben: I think the most exciting thing is the intellectual challenge. It’s the fact that you are always facing new and unsolved problems and you are the person that’s being asked to solve that problems. And sometimes it’s just a data problem. Sometimes it’s more complicated than that. Sometimes you have to structure the problem and come up with the right data questions, solve them and give them answers that constant intellectual challenge is really what drive me.

Kirill: And for our listeners, from your perspective, from what you’ve seen, what you see currently in the field of data science, how you’ve seen it evolve since you’ve joined the ranks. Where do you think this field is going? What do you think our listeners should prepare for in the future? What should they focus on? What skills should they develop or what techniques should they think about it or generally where do you think this whole field is going?

Ruben: Yeah. That’s a good question. It’s hard to really predict where this is going and to like a generalized view, I can mention a few trends and I can mention also a few area also where I think people can really make a difference and show like and add value. So right now there’s a lot of talks around data platform. So it’s not just that you had a data infrastructure, you’re mining the data and you’re building some models to work some insights but there’s also this idea that in order to operate a efficient data team and efficient analytics team you need to have a better platform that enable data scientists to deploy experiments, run experiments, quickly build and validate models. So there’s that aspect of building the pipeline then the workflows that enable people to scale basically analytics so that’s one direction where things are going in and obviously this also goes into the direction of a little bit of specialization. This used to be that this engineer that could build a database because he would also extract data, run some statistical analysis and show the results to the head of marketing and now more and more using rules being a little bit more specialized, now you have people who specialize into data warehouse and data infrastructure and you have people who specialize in data platforms, people who specialize into the algorithm, people specialize into the analytics is part of data science. And so I think it’s important to understand all of these roles and there’s definitely some, that’s definitely some of the trend in the industry. In Parallel, I think we have to recommend is, I think it’s very important for people to dev up technological expertise because that has a lot of value. If you’re able to code in Python, on R and maybe even you know throw in a little bit of Java and you understand all of these technologies, it’s extremely powerful but at the same time, I would warn people against becoming too attached to certain statistical techniques or certain machine learning techniques. There’s always gonna be people who act specialize in deep learning and recurrent their old (inaudible) you know unless you’re one of those people, you don’t really need to go in that direction. You also don’t necessarily need to learn all the different algorithm. I think it’s more important to understand what are the different techniques doing and really deeply, like have a deep understanding of statistics and how you use different things in different cases and have the ability to learn and then apply the right model or the right technique to the right problem. So it’s more important to be able to map out the right technique to the right problem rather than know every possible techniques and algorithms that are existing on this one.

Kirill: Very very powerful advice there. I just all sum it up to all the listeners and just for myself as well. You are observing a trend that data scientists becoming more mature as a type of industry, type of work and therefore some roles are becoming more specialized. So I guess it’s a good idea for the analyst and aspiring data scientist to start to look out for what they are most interested in and eventually end up doing something that they are passionate about in this field and also that developing a deep technological expertise in a broad range of tools and techniques is very important because you don’t want to get stucked just using that one technique because this field is constantly evolving and you want to be able to adapt and learn new skills on the fly as they say. I think that’s some very powerful advice and I know you’ve recommended already a great article by the sounds of it by the creator of random forest. Is there any book, a one book that you could recommend to our listeners that if they had the time to pick something up and improve their data science careers and skills. What could be the one book that you think they should read?

Ruben: That’s a good question cause I never actually learn data science from a book. I learned it in school, I learned it by doing or by looking at websites and forums however like there is one book that sort of like influence my thinking around data analysis and really you know cemented my ideas around acquisition, correlation, how can you torture data to see certain things, and how that thing might be wrong and how can you look at the same data sets and how can you come up with conclusions. And that book is actually The Signal and the Noise by Nate Silver. It’s a popular book and it’s a book that is obviously very technical but I think the ideas in that book is extremely powerful and again it’s this idea that you know data science is not just a set of techniques and tools, it’s also a way to think about the problem and if you don’t think correctly about the problem, it doesn’t matter that you have the techniques you’re gonna end up with like wrong insights and conclusions. It is important to have a deep understanding of like what are you trying to achieve, how do you like approach a problem, how do you look at them correctly and then use the right techniques and tools. So for that reason, I would recommend The Signal and the Noise by Nate Silver.

Kirill: Lovely. Haven’t read the book myself. Definitely that’s going onto my list of books that I’m picking up in the near future. Signal and the Noise by Nate Silver.

That has been a pleasure, Ruben. For our listeners, where can they find you? How can they contact you, follow you in any social media, any websites, where can they get in touch with you?

Ruben: I think the easiest is that getting in touch with me on LinkedIn. I’m working on a blog and a website. It’s not out there yet but the link will definitely be on my LinkedIn profile. So that’s one of the best way.

Kirill: Definitely and I will also include the link in the show notes as well. If there are any updates I will include those in the show notes as well there. Thank you very much, Ruben. Really appreciate you coming on the show and sharing your insights. I think that this has been a great day. Wonderful and insightful conversation. Thank you so much for coming along.

Ruben: Yeah. It was my pleasure. Thanks for inviting me.

Kirill: So there you go guys. That was Ruben Kogel. I hope you enjoyed the show. You can get the show notes at www.www.superdatascience.com/1. There you can also leave me or Ruben a comment in the comment section at the very bottom. Ask question or tell us what you thought. Also if you did enjoy the show, make sure to share it with your friends and work colleagues so that you can help us spread the word about the SuperDataScience podcast and I look forward to seeing you next time. Until then, happy analyzing.

Show All

Share on

Related Podcasts