SDS 215: Integrating Data Science as a Developer - SuperDataScience - Big Data | Analytics Careers | Mentors | Success

SDS 215: Integrating Data Science as a Developer

Integrating Data Science as a Developer
Welcome to the SuperDataScience Podcast!

In this episode, I chat with Brian Dowe about how he structured his career and how he’s done integrating data science as a developer. Also learn great tips about developing, deploying, & maintaining models so make sure to tune in!

Subscribe on iTunesStitcher Radio or TuneIn

About Brian Dowe

Brian Dowe is a Full Stack Developer for Education.com who also has a background in teaching Physics and Biology. He’s in charge of Education.com’ website, from the front end to back-end responsibilities.

Overview

Education.com is a well-known platform for parents and teachers where they could get resources for teaching their kids or students. So, it is a must that the website offers a state-of-the-art experience for the users. This is where Brian’s responsibilities come in – maintaining the site’s A+ performance.

A Full Stack Developer’s responsibility is a combination of front-end and back-end technologies. Brian builds dynamic templates and user interfaces as part of his front-end responsibility. And on the back-end part, he’s in charge of APIs & other things that happen behind the scenes.

Aside from spending his time bringing value to his company, he is also studying machine learning on his own. His interest grew when he stumbled upon the Machine Learning A to Z course and discovered that there’s another way to look at data & gain insights from it. And, what’s more interesting is, despite not having any experience, learning data science & machine learning could be easy as long as you have the motivation to start and finish it. Don’t miss out on the opportunities to learn something new and add something up to your skillset.

So, during his company’s event, Hack Week, where everyone can build whatever project they want, he and his colleague developed a recommender system using machine learning. This model is to give the Education.com’s users an improved experience. For example, when a user clicks on a worksheet, there will 5 recommended worksheets that will show up on the page in which he can choose from.

We also discussed developing and maintaining models. Brian emphasizes that to build the best model, you should start with the problem and find what solution best fits for it, not the other way around. And as the model progresses, there could be issues that it might not keep up with its environment. It’s always a challenge to find a way to update the algorithm just to meet the needs of the users. It’s vital to observe the trends and track the behaviors of users.

Acknowledgement

“I ‘d like to acknowledge Graham McNicoll, the CTO at Education.com. He brought me on when I was just a self-taught programmer and has empowered me every step of the way in my development as a software engineer and as an aspiring data scientist. Without Graham, none of the data science accomplishments we’ve made as a company would have been possible.” – Brian Dowe

In this episode you will learn:

  • DSGO 2018 let Brian push his knowledge a bit further. (03:36)
  • What are the responsibilities of a full stack developer? (08:53)
  • What attracted Brian to data science? (10:18)
  • Integrating data science as a developer. (14:05)
  • Developing the recommendation model for their platform. (20:10)
  • Operationalising the algorithm and making it work for the company. (33:58)
  • Model Maintenance. (39:30)
  • Kirill’s recommendation to improve Brian’s model. (43:35)
  • How’s the learning curve since he started exploring data science? (47:40)
  • Understanding the models deeper. (52:45)

Items mentioned in this podcast:

Follow Brian

Episode Transcript

0

Full Podcast Transcript

Expand to view full transcript

Kirill Eremenko: This is episode number 215 with the full-stack web developer Brian Dowe.

Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name is Kirill Eremenko, Data Science Coach and lifestyle entrepreneur, and each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let's make the complex simple.

Kirill Eremenko: Welcome back to the SuperDataScience Podcast, ladies and gentlemen, very excited to have you on the show today. Today we've got an aspiring data scientist and full-stack web developer Brian Dowe joining us, and I literally just got off the call with Brian a few hours ago and what I can say about this episode is it's very inspiring, especially if you are a web developer yourself or a developer of any kind yourself. You will find a lot of useful tips and insights in this episode.

Kirill Eremenko: If you're not a developer, you will also find a lot of useful tips. But I personally found a very, like Brian's example of how he structured his career, very insightful and something to admire and kind of like dissect, and that's exactly what we did in the podcast. So you'll find out ... like we touched on three main things. So first of all we talked about what it's like to go from a developer to data scientist or in fact how to integrate data science in your career if you are a developer, if you work on web development or any kind of development, how to integrate data science in your career. In fact, I think anybody looking to integrate data science will find a lot of these tips useful.

Kirill Eremenko: Then we talked about models. We talked about developing models, deploying models in business and maintaining models. In fact, we're going to dissect the whole case study of a recent model that Brian has worked on using that Apriori algorithm, and you will hear some back and forth between us about the development, the deployment, and the maintenance life cycle. We'll actually come up with some ideas on the podcast, which you might find quite interesting in terms of brainstorming and how to think about modeling. And in general you'll get a lot of takeaways about modeling.

Kirill Eremenko: And finally, we talked about getting into the space of data science or how, what it's like to learn data science, what are the challenges in the learning curve that Brian has been facing. He's been a learning data science since the start of this year, so for almost a year plus a minus a couple of months. And you'll also hear about the tools, some of the tool recommendations that Brian has for you if you're just starting out into data science.

Kirill Eremenko: So there we go. This podcast quite interesting in terms of the three pillars that we discussed. We get a lot of value from here and without further ado, I bring to you Brian Dowe, full-stack developer and aspiring data scientists.

Kirill Eremenko: Welcome to the SuperDataScience Podcast, ladies and gentlemen, today we've got a very exciting guest on the show. Brian Dowe calling in from San Mateo, California. Brian, how are you going today?

Brian Dowe: I'm doing good Kirill, how are you?

Kirill Eremenko: I'm doing very well. Thank you very much. It was such a pleasure to catch up at DataScienceGo mate, it was exciting to hear your story, I can't wait to share it with our audience today. But first off, how do you feel at the events and how do you feel after it?

Brian Dowe: It was really an incredible experience. There were so many fantastic presenters. There were a lot of people that I sort of networked with and talked to outside of the sessions as well, and got a lot just all around out of everyone I interacted with. It was an amazing experience, and I learned so much and it's given me a lot of great jumping off points moving forward.

Kirill Eremenko: Yeah, thanks. That's really great to hear and you just for the podcast, you mentioned you were already getting into like you made this tilt or shift in your trajectory, and you started doing stuff on Kaggle, after DataScienceGo, you started the Andrew Ng machine learning course. Quite a few things have happened for you. What would you say has been the biggest shift after attending DataScienceGo 2018?

Brian Dowe: I think that before I attended I had had some experience with applying machine learning models and sort of an easier setting where the Dataset is pre-prepared for you and it's pretty clean, and you just got to get right to the modeling. And so one thing I wanted to do following up from DataScienceGo is push my knowledge a little bit further. So I actually went through and did like the derivations and the calculus for gradient descent and that made a huge difference for me in just understanding what's going on under the hood.

Brian Dowe: And so even if the ... it can be more convenient to use libraries like Scikit-learn, Intensive flow to do projects, it's still really helpful to understand what's going on behind the scenes so that you can work with those tools more efficiently. So that was really ... yeah, that was really huge for me.

Kirill Eremenko: Awesome.

Brian Dowe: So with Kaggle, I've been going through just some of the datasets that they have posted and trying to make some like predictions even with just like basic models, but that's given me some good practice in data pre-processing and organizing a sort of not clean dataset into something that can be fed into a model and that's been some really valuable experience for me.

Kirill Eremenko: Fantastic. Great to hear. Amazing. One more thing on DataScienceGo, I was curious about this, what was your favorite talk? Who was the speaker who gave you a top talk?

Brian Dowe: I think one of my favorite ones was that Gabriela de Queiroz from IBM. Her talk on a deploying or I think it was deploying machine learning models in five minutes or something like that-

Kirill Eremenko: Deep learning. I think it was deep learning models in five minutes.

Brian Dowe: Yeah, exactly. And just she showed us the model asset exchange that you could use to sort of find a bunch of prebuilt and pre-trained models and start deploying them very quickly. And I thought that was really interesting and walked away from that with some items on my to do list to go through a lot of resources. I really enjoyed her talk a great deal.

Kirill Eremenko: Fantastic. Well that's very exciting. Well Brian, very cool to have in you the show and one of the reasons is because I am excited, super excited about your career path. I think you have a very inspiring journey that you've created for yourself, and I would love to share it with our listeners. And what I mean here for our listeners is that what you need to know about Brian is that he is a full-stack developer, web developer, and we'll talk more about that in a second. And Brian sees the value of data science, sees the value of machine learning and deep learning and is actively applying that in his career.

Kirill Eremenko: And I know that's across our audience, across the listeners in our podcast, a very large percentage of you guys, I don't know, maybe 30, 40 percent, that's my rough estimate, it might be even more, are people who are developers who are also looking to get into data science or are already in data science, have transitioned into data science or have seen the value of data science. And I think it is ... this is going to be an inspiring story for you guys to model in your careers. But even if you're not a developer, the steps that Brian has taken to integrate data science in his career without fully jumping straight into it and quitting his job and just going data science, data science, is quite inspiring. So I think that's going to be cool. So I'm very excited to dig into it. How are you feeling about this Brian?

Brian Dowe: I'm really excited. Yeah. I'm excited to dig into this as well.

Kirill Eremenko: Awesome. Some good self reflection opportunity for you, I guess.

Brian Dowe: Yeah.

Kirill Eremenko: Awesome. Okay. So tell us a bit about you Brian. You are a web develop, a full-stack web developer. If somebody off the street were to ask you, Brian, what is it that you do? What is a full-stack web developer do? How would you answer that question?

Brian Dowe: Sure. So generally when people ask me that because it's come up a decent amount in conversation, I'll say that full-stack is a combination of front end and back end technologies. And in my specific case, I worked for a company called Education.com, it's a web application. And so for web applications, front end usually has to do with building templates. So all the html and CSS are the parts of the site that users see visually and also some Java script for interactive components. And the backend side is like the database and APIs that interact with your database and grab data to display to the end user. So the front end would be like what you see when you look at a webpage, and the back end would be like what happens behind the scenes when you click a button to submit a form, like where's that data going? What is it doing? Most of that is handled by back end technology.

Kirill Eremenko: Gotcha. How long have you been with education.com?

Brian Dowe: It's been about nine months now, so not very long. I interned for four months, and I've been full-time for about roughly the last like five and a half months.

Kirill Eremenko: Okay. Gotcha. Tell us what attracted you in data science, like why are you on this podcast, how did you get into this, hear about data science and what the next steps did you take from there?

Brian Dowe: I actually first learned about data science, Kirill, through your course on machine learning A to Z. And I started doing it maybe about two or three months after I had started working for education.com. And prior to that I had built some applications just to sort of teach myself that were, had like very simple data components, like maybe just a user database and maybe like the ability to make blog posts or write reviews, etc.

Brian Dowe: And when I stumbled across machine learning A to Z, I had heard of the field, I didn't really know too much about it. And I remember watching your introductory video where it kinda goes over a lot of the applIcations and use cases and machine learning. And I thought, oh, this is really cool this provides a really interesting way to look at data, gain insights from it and improve the actions that you can take on the basis of that. And so that was really interesting to me, and I think as I progressed through the course, I found it to be more accessible and something that I could jump into even though I didn't have too much experience. And yeah, it just kind of progressed from there. And my interest of it has only grown over time.

Kirill Eremenko: Interesting. Let's rewind a little bit. Tell me this, how do you stumble across a machine learning A and Z course, what were you searching online for? Obviously there was some kind of a need that you were trying to fulfill when you saw it. Like people don't normally stumble upon machine learning A to Z unless they're actually looking for something related to data science, like what was the initial trigger for that to happen?

Brian Dowe: I had been using Udemy for a long time before that for projects that were [inaudible 00:12:25] specifically to data science. So I mentioned before how I was trying to get started by building a simple application to develop my development skills basically. I had taken some courses at that time on building a clone of YELP with ruby on rails, and then I found one on building a price alerts app with Python and flask. And so throughout the course of my learning, my Udemy feed or the courses that popped to the top had to do a lot with app development and with programming in general and with Python also, specifically over time. And then I think it was through that your course just sort of popped up on my list.

Kirill Eremenko: Okay. Like you got into machine learning and data science because machine learning and data science got you in there in the first place. That's what it sounds like.

Brian Dowe: Yeah. Sort of, yeah.

Kirill Eremenko: The circle has closed, right? We've gone full circle, this is so cool, right? Like you're studying exactly the stuff that has influenced your career to study that stuff. This is like Inception level type of thinking, right? Have you ever thought of it that way?

Brian Dowe: I haven't, but that's really interesting now that you put it that way.

Kirill Eremenko: Oh wow. That's so cool. That's so cool. One of the craZiest stories. Okay. So you got into machine learning and you started taking the course and at the same time, how were you able to apply this at work? Like you're a full-stack developer, and this is where the interesting stuff starts to come in because I know your story a little bit already. How were you able to take that into ... as a developer, if I'm a full-stack developer, I might be a bit like shy or I might be a bit ... not even come to mind that I can bring this stuff, machine learning to work. It's not applying. It's completely unrelated to my role. So how did you go about that?

Brian Dowe: That's a really Interesting question and I definitely did feel that like not knowing exactly where to start, where the right opportunity was and it happened in a very roundabout and sort of like by chance way. Our company, we try to hold at least one hack week every year. And for those who might not be familiar with what that is, it's generally where everyone can sort of come up with their own projects that they want to build outside of the development pipeline and then people can team up and then just try to build whatever they want and then we all present to each other at the end of the week. And by the time hack week came about, which was about like may of this year, I had been studying machine learning maybe for like two to three months.

Brian Dowe: I had talked ... like briefly mentioned it to some people at work in passing. And then when hack week came about, one of my colleagues approached me and he wanted to build a recommendation system using machine learning. And that was basically the extent of the background for it. And, yeah. So I remember sitting down with him sort of trying to brainstorm what to do ... before I get too much deeper into that, this hack week was really the catalyst for a lot of the professional application of machine learning or in my workplace to happen just because it provided an opportunity to just build a project that you're passionate about without any other restrictions. I thinK looking back, that really empowered me to start bringing up ideas and trying to make things work outside of that context and just do it as part of our normal pipeline of projects.

Brian Dowe: But if I could, looking back, make a recommendation to someone in a similar position, it would be to not hesitate to just dive in and start trying to find problems to solve. Because I think a lot of full-stack developers have access to a wide breadth of data because you have to work with it when you build your applications. And so I think just sort of diving into the data that your company has and then trying to figure out, okay, what's a way that I could use this? Like what's something that I could predict based on certain features or what value can I extract from this that could be presented to the end user in some way.

Brian Dowe: I think my recommendation would be to just dive in, start thinking about problems to solve and when you have something or an idea of how to go, odds are that the other developers you work with will be intrigued by this and curious about this. Do you have an idea that could potentially bring value to the company? I think like trying to empower yourself and just dive in and start looking for problems to solve is one of the best ways to get started.

Kirill Eremenko: I totally agree with that. And I just wanted to mention two points here. First one is that, you also obviously need to be careful as a developer. It depends on company to company, but often you do have potential access to data, but for instance, in Facebook, if you use it for the wrong reasons, you'll probably get fired. So kind of like, be sensible about that and maybe confirm with your boss if you can use that data or maybe if you, if that's a bit too early to do, maybe create some dummy data sets that are similar or you have The similar columns in terms of structure to the company data set, but like play around or download some data sets in your free time to play around with before you actually play around with real data that is customer base, especially if it's sensitive data. That's one comment I just wanted to make.

Kirill Eremenko: But in general, indeed, like even if your company doesn't have these hack weeks or hackathons, which I think are very useful especially, but they usually take place in large organizations that want to inspire innovation. So if your company doesn't have that job, maybe you can talk to the management to start introducing it. But even if that's not the case, you can still play around with data or dummy data or similar data and see how you could potentially bring value to the business. And you can still bring those results and those suggestions to your management or to the company leadership and present it to them. Every company in this day and age wants to be data driven or model driven. They will jump on top of it. If you can add value to the bottom line of the company, it is a very rare instance that management, especially when it's concerning technology and data science where the investment isn't that high, but the return on investment can be massive. Very rare do you get cases where management turned down those ideas, turned down those innovations.

Kirill Eremenko: So as long as you're proactive, don't let the absence of a hack week or a hackathon get in your way. Or if for instance, you're a hack week is 11 months away, it happens once a year. If it's 11 months away, don't wait, you can already do it now. Those are just some of my thoughts on that. Brian, you were talking about you were preparing for this hack week and a colleague of yours came up to and then you got into this project. What was this whole idea? You said you were going to dive deeper into that topic.

Brian Dowe: Sure. So the project was a recommendation system for our content. So just maybe to give a little bit of background information on what we do in education.com, we're a platform for parents and teachers to come and find resources and teaching tools to help their kids. And so when I say like a recommendation system for our content, I mean we have just a bunch of worksheets and games and lesson plans and all of this stuff that parents and teachers can come and use.

Brian Dowe: So what we wanted to do is create a way or improve the way that we recommend other resources that you can use based on what you're viewing. So if you like go to our website and click on a worksheet, you'll see at the bottom, here are five other worksheets that are related to this one that we would recommend for you. And so, yeah, so just to give a little bit of background on that.

Brian Dowe: But anyway, so a colleague approached me, his name is Yon Burke, a brilliant engineer. So he approached me wanting to build this recommender and as we were sitting down and sort of talking about what some different angles are that we could go about this with, like what data would we use, like how I'd rearrange it, like what model would be needed? He brought up something or he said, "What if we're looking at this problem the wrong way, what if this problem is people who downloaded this also downloaded this" Or like, what if that's the way that we're going to look at it.

Brian Dowe: And as soon as he said that, a light bulb in my brain went off and I thought, I know that I learned of an algorithm in machine learning A to Z that's perfect for solving this exact type of problem. So I went through and I looked through all my notes and it turned out that that algorithm was the Apriori algorithm. I know that I made that connection because the example given was using that algorithm for a grocery store to try and figure out a where to place their products in relation to each other based on what products most users or most customers would buy together. And so I remember it just ... like light bulb moment went off and made that connection, and then I knew that was the place where we had to start based on the way that we had a scoped the problem now.

Brian Dowe: So what I did was I went back and I looked at the slides from machine learning A to Z from that section and In those slides, Kirill, you went over the equations for support competent and list for that algorithm. And then using those equations, I scripted or I just converted that into code basically in Python and then set up the structure to loop through all of our data and gather the information that we needed.

Brian Dowe: And Yon, the engineer, I was workIng on this with did the pre-processing portion and then got me the dataset. And then we started testing and trying to figure out if it was going to work, if it was gonna give back good predictions because one of the attributes of this algorithm that made it a little challenging to figure out whether it was going to produce good results is that we didn't get the immediate feedback on whether it's a correct or incorrect estimate, for example, like you would with a supervised machine learning algorithms.

Brian Dowe: So we had to look at some of the results by hand and figure out, okay, is this producing the results that we wanted, and what we were specifically looking for was to have sort of more loosely defined recommendation than we currently had for a given worksheet because let's say you looked up like a two digit multiplication worksheet on our site, what we had before, it would have just given you like three or five [inaudible 00:24:28] multiplication worksheets.

Brian Dowe: And we were thinking, okay, like how useful is that really for someone who is coming to our site and looking for another jumping off point. Like if they find a worksheet, are they just going to want to find more worksheets that are exactly the same? Probably not. What might be more useful is like maybe they're looking for like two digit division, something that's related to what they're on, but not exactly the same. And so we were hoping that by using a user download histories, instead of just like a content tagging or like what the exact subject was, that we would get some better results that would actually be more beneficial to users.

Brian Dowe: And as we went through, that turned out to be the case and we saw that we were getting the recommendations that we were looking for. And then after we presented at the end of the week, it started picking up steam and enthusiasm and then we decided to run it on the site as an AB test. And after running it for a while, it turned out that it won against our former recommendation system, and then we pushed it to production and it became our first machine learning model that was deployed on the live site. So that was a huge win and a really exciting first step in implementing machine learning. But the first of many is what I'm very much hoping for and striving to make the case is that this is just the first step on a longer journey to integrate a more machine learning into our operations. And so yeah, it's very exciting stuff.

Kirill Eremenko: Wow, that's awesome. Congratulations. That's a huge project and most importantly that it actually got implemented into the business and added value, that's just really cool to hear. And just to recap, so let me know if I'm getting this right. So before the recommender system was recommending more content to your users based on the tagging. So for instance, you have these multIplication sheets and maybe your tutorials, maybe some videos, some other content that you have all sorts of content, they're all tagged, like this is multiplication, this is for first grade, this is for fourth grade, this is math, this is this other topic and so on. And based on the tagging, your similarity of tags, it would recommend something.

Kirill Eremenko: Where's your new recommender system, which is through the Apriori algorithm, it would look at not just the tagging, but it would look at what are users actually downloading. So if people on average download X, what did on average they download after that or what did most of them download after that? So you're looking at the behavior of your users and based on that you're saying, okay, well even though it's not tagged identically, looks like that's what people want, that's what people are after, and we're gonna use that as a suggestion. And so that Apriori algorithm approach work better than your previous recommender method. Is that about right?

Brian Dowe: Yeah, that's essentially the gist of it.

Kirill Eremenko: Gotcha. Sounds a lot like Amazon. Like you go on amazon and you buy something and then out of the blue they recommend something else, it might not be related at all, but that's because most people kind of like after they bought that one thing, they usually go searching for that other thing; the people that are similar to you, I guess.

Brian Dowe: Yeah.

Kirill Eremenko: Gotcha. And then after that you did a quality assessment like not a quantitative assessment, as you mentioned because you didn't have like with supervised learning algorithms, you couldn't say yes, no, correct or incorrect. In this case, you did a qualitative assessment where you just went in and you tested out a few of those recommendations to see if what the recommendation was actually made sense to you. Like the example you gave like a three by three multiplication table instead of getting a four by four one, you get a division table, which kinda like makes sense. Is that what you did next?

Brian Dowe: Yeah, that's essentially what we did. We were looking for things that were like not exactly the same but still would likely to be within the same subject area that might be closely related, for example, yeah, like that example that I gave hoping that users would be maybe studying those two subjects at the same time and they could find things that would be more directly related to like where they would go next from a given subject or something that they might be also practicing at the same time.

Kirill Eremenko: And so how long has this system been in place now? This a recommender algorithm?

Brian Dowe: It's been a couple months now that's been up. The hack week was in May and after there were some additional steps that needed to be taken to a to set it up and running with our whole dataset and that took some time. But yeah, for a couple months now it's been up and running and live.

Kirill Eremenko: How are the other results? How is the management seeing, are they seeing some positive impacts? Are they happy with how like this change, what has change has brought?

Brian Dowe: Yeah, it has brought a positive change, but we're definitely also looking to see how we can improve it because this was just sort of like a first run like it did win the AB test, but we were still looking for ways to improve and to provide more valuable recommendations to our users. And I think, yeah, that this was a good first step to at least identify resources that are being used and consumed most often. But I think there's so many places that this could go and there's so much that we could still do to improve our recommendation capabilities and that's something that we're diving into right now.

Kirill Eremenko: That's so cool. And I can just imagine the boost of morale that you got from that, like when your algorithm got implemented, how did that feel?

Brian Dowe: It was incredible. I think one big takeaway that I had from that is that when I was first studying machine learning, sort of leading up to that, and I think still a little bit following from the implementation of that algorithm. My focus was I wanted to learn all these like, cool, cutting edge tools like neural nets and like Gans and stuff like that. Like all this some of the more advanced modeling tools, but the process of actually sitting down, looking at what data we had available and trying to choose a model that fits best with that sort of showed me that you have to start from the problem and then try and find the model that best fits the solution rather than just picking a cool model that you have in mind that you really want to work with like a cnn and then searching for a problem that fits that solution.

Brian Dowe: I think that can be like a great way to learn about a new technology like if you're focused on learning about like cnn's and you look for problems as specifically fit that use case, that's great for learning. But in practice, you don't always get to choose the problems that come to you, sometimes there's just a problem at hand that needs to be solved and you need to explore your toolkit and just choose whatever is best fitted for that solution. Like sometimes I think I've seen cases or at least read about vaguely cases where a bunch of models were tried and a simple linear regression or a simple logistic regression was the best outcome. And sometimes that's the case.

Brian Dowe: Sometimes the model that you initially had in mind that you were focused on, isn't the best tool for the problem and something that maybe is a little bit simpler and scoped is. And so I think that's something that I thought about a lot is it doesn't have to be like the craziest algorithm out there to make a really big impact. And so that's something that I think about going forward is just starting from the problem first and then searching for the tool that solves that after the fact.

Kirill Eremenko: Very, very wise words. Totally agree with that. It's very easy to get carried away with a machine learning and all of these new shiny cool things. And that's awesome to learn them and they're very good to inspire you to learn and to grow and to find like these different nonstandard applications, but at the same time, sometimes less is more. You just go for what solves a problem and does that efficiently. And in your case it was the Apriori, which is awesome to hear.

Kirill Eremenko: Another thing I wanted to ask you on this topic was that the integration of the algorithm into your company services. I think this is a very cool area that a lot of the time is missed by data scientists that you not only have to develop an algorithm, but you have to also operationalize it. You have to make it work in the company so you have to somehow put it into production, it has to integrate with the website and all those things. So I think that'll be really cool to talk about that. Are you able to share some details with us on how you went about it and what challenges you faced along the way?

Brian Dowe: Yeah, I could share some details about that. A lot of the bulk of the work to actually set this up on production was handled by my colleague who has like a lot more familiarity with the way that these systems work. A big challenge was just being able to run all of this data because we have so much data in user download histories from like all of our users and all of the things they downloaded it. It gets pretty big pretty quickly, and so knowledge of a cloud computing services became really helpful in this case. And that's what we were able to use to do this to do this. And so one thing that, a challenge that came along with that is trying to decide when and how often the model we need to be wrong. Because I think that's a question that is really important is like do you need to make actual calculations with this model on the fly or can you run it like every so often and use the results or store the results from that to display to the end user?

Brian Dowe: And so we ended up going with the second approach of just running our algorithm on every so often to gather this information like for a given worksheet or game that a user downloaded or played, figuring out the top, so many worksheets that are associated with it, which is mainly the tasks that the algorithm was accomplishing. We decided that it would be much more efficient for us to just run this every so often and store the results so that when a user visited the worksheet, we can just pull it directly from our database without having to actually run the calculation on the fly.

Brian Dowe: And so I think that's something that can vary from use case to use case and model to model depending on how you're trying to apply it, maybe you will need to actually run a prediction or a calculation on the fly when a user clicks a button for example. And that can be ... whether or not that's a feasible place to implement your model, I think depends on the weight of the calculation that it needs to make. And so if it needed to make a calculation that required a lot of processing time but the user needs the result immediately, then that might not be a feasible way to go versus if it's something that might not change drastically from day to day or week to week, then you could just run your algorithm every so often and then store the results within a structure that you already have set up to store data for example.

Brian Dowe: So I think it depends on the use case, it depends on what you're trying to do and how often the results of the model need to be updated. And I think yeah, there are definitely ways to think strategically to integrate systems like this without running massive programs like every time a user clicks a button for example.

Kirill Eremenko: Okay, gotcha. Very, very cool advice as well that there's different types of models and hope people are taking notice of this, that sometimes you might need to run them on the fly and get results all the time, sometimes it's enough to run them occasionally. And especially big companies with lots of data, they have time slots for running things on the servers, so usually they run ... like I was at a company after leaving Deloitte where they would run things at night and their every single hour and sometimes usually in 15 minute chunks or maybe even shorter chunks, the whole night is split into these periods where you need to apply to get server time locally or ... now more things and more and more things are going to the cloud, but still a lot of companies do things on premise, do calculation on premise.

Kirill Eremenko: And so they, in order to run all these models and do all the calculations, update all the data and stuff like that, especially like for example, the financial services, you gotta turn over all the things that happened during the day, make sure everything is accurate, reconcile a lot of stuff. That all happens at night and you want to be very mindful of how you're using server time in your company. But even if it's in the cloud, every time you rerun the model, it's still going to be a cost, it's still going to take some kind of budget or something. So being conscious about whether you need right away or not is quite an important thing. I'm going to ask you this question, so I want to see in a bit of a different space, have you taken the data science A to Z course?

Brian Dowe: No, I don't think I have actually.

Kirill Eremenko: Okay. So the reason I ask is because in the data science A to Z course we talk about model maintenance. And I just wanted to see like ... obviously this has been a big success and breakthrough for both you and the company in terms of developing and deploying this Apriori model. What are your thoughts on how you're going to maintain it? And the reason I ask is because I've seen situations where models deteriorate. In fact, I've seen a situation where a model used to be very effective, like it would bring like 80 percent accuracy, but then over time, over a period of 18 months it deteriorated to a level where the accuracy was less than 50 percent, was like 42 percent or something. Meaning that it would have been more efficient for the companies to just flip a coin and do the recommendations based on that rather than using a model. And so I'm just curious, what are your thoughts on how to maintain this model?

Brian Dowe: Yeah, that's a really interesting question. I think to a certain extent, this can depend on the domain that you're in. And what I mean by that is I think like different domains have a different pace to how their data shifts structurally over time. And so for us, who are in the education space, at education.com and that kind of means that we follow the cycle of a school year. So for example, given parent or teacher, based on the time of year and based on when they signed up, like maybe they sign up in like August and then from August through June, they're using a bunch of second grade resources. And maybe for a parent like they then move onto the next grade and then they're going to be consuming a bunch of third grade resources.

Brian Dowe: So, if we were to just store and continually update the model just based on new data coming in and not deal with eliminating old data, it would undoubtedly begin to grow stale over time in the sense that you're now will be making predictions based on users who had been with the platform for several years. And so you might not get the best recommendation for a second grade worksheet because you'd be pooling from download histories from a bunch of different grades as well. And so, I think for us, that's something that like already implemented as part of the algorithm is to keep it seasonal and to try and look at data that's irrelevant to the time period that we're in relative to the school year.

Brian Dowe: And so I think thinking about things like that, like how often ... like for us right now, like grade is the main way that we look at this, but I think it could be interesting to dive in and see how subjects covered might change over time. And, yeah. But for now, grades and school year is like the main data that we have to go on because there's a lot of different curriculums out there and our site is used very widely and so it's a challenge to find a way to update your algorithm in a way that meets the needs of all the users on your platform. School year and grade just happens to be one way that's pretty standardized across our users, but I think trying to push that further and make it even more relevant to what a given user might be looking for is definitely something that is a challenge that we're going to have to rise up to and meet overtime. A really challenging problem.

Kirill Eremenko: Okay. Gotcha. That's a very cool consideration about the seasonality. If I may, I'd like to give you another suggestion, is that okay?

Brian Dowe: Sure. Yeah.

Kirill Eremenko: So one thing I was thinking about is with your model, you could measure how well it makes those recommendations. So you could say, all right, we made all these recommendations, how much of those recommendations was actually used and how many cases, and what percentage of the cases was the recommended content actually consumed by the user to whom it was being recommended or consumed within a relevant timeframe like it could be maybe they didn't consume it right away, maybe they had to think about it. But like within a week or a month they did indeed consume the content because if you have the right data points in place, you can actually collect that information like we recommended content Z and they didn't use content Z in the week or they did use content Z. And kind of like just have a yes, no type of approach and see that like currently at this stage your model has, I don't know, like a 35 accuracy or percentage or that's a been low, maybe like 75 percent accuracy rate, that 75 percent of the content you recommend is actually indeed consumed by the users.

Kirill Eremenko: And then I would set up a system that would track that, that even autonomously it could track that and say, all right, so how is our model going month to month? Maybe you'll see some seasonality in that, which you could possibly be able to explain, but maybe with time you will see that, oh that's interesting, before it was 75 percent, now it's dropped to 74, now it's 69, now it's down to 65. And if you see a consistent trend going downwards, that means something's going on with your model. And possibly that could be a shift, not just a seasonal shift, but that could be a shift in the demographics of your population. Maybe a change in some kind of legislation around what students have to learn, what they shouldn't learn. Maybe a change in the available content on your platform or available content outside or some influence from competitors or advertising agencies that people are looking to other stuff that your model couldn't have possibly taken into account at the time you created it because that event or that environment was not there at the time.

Kirill Eremenko: And so I feel that like tracking, like that dynamic tracking or constant tracking through time is important because it allows you to fish out these changes that you can not otherwise see, right? So you need a fly, you need kind of an indicator that something is going on in the industry or in our platform, in the business, in our audience that we need to address and we need to either retrain the model, rebuild it or so on. Because if like if that's on track then it actually can be a bit too late. What do you think on that?

Brian Dowe: Thank you so much for all those insights. I think those are some really good ideas. I know for me, this was a big first step for me into this space. And yeah, it's really helpful to hear about other ways to push this further from someone such as yourself who has more experience in this field. And so, yeah, I think I'm going to be walking away from this with a lot of things to think about and a lot of interesting ideas to bring up to my colleagues tomorrow.

Kirill Eremenko: That's awesome. That's awesome. Well thanks for sharing. I think this has been a good case study for our listeners in terms of the whole process, not just creating an algorithm or a model that solves a problem, which is awesome, but actually then deploying it and then considering maintaining it. Those are like all three are extremely important in the whole life cycle of a data scientist or a data science project.

Kirill Eremenko: All right. Well, now let's shift gears a little bit. Let's talk about ... you mentioned learning data science, you mentioned that how you got interested in the field and actually like the Udemy recommender system brought machine learning to you and then you got into it. Tell us a bit about the learning curve, how difficult is it to learn data science for somebody who's coming from a developer background?

Brian Dowe: I think it depends a little bit on what you want to do with it and specifically like how deep you want to dive into, like for example, the underlying mathematics and stuff that drives a lot of these concepts. And to that end, I will say I think it's easy or easier to get to a point where you can apply these models quickly as opposed to if you were to like say try and build a certain model yourself from scratch, like that would take a little bit more time.

Brian Dowe: But I will say coming back a little bit to machine learning A to Z, I think that course did a really good job of empowering you to get started with applying algorithms quickly. Like after the data pre-processing section, it basically gets right into here's how you can apply a linear regression model, here's how you can apply a support vector machine. And you see from doing a bunch of models like that, that the actual code to apply the model if you're using a library like Scikit-learn, is not that extensive.

Brian Dowe: And so I think getting to the point where you can at least start applying models and start working with them, you can do that very quickly. I think once you enter that phase, transitioning from that into applying these models in the real world can be a bit steeper of a learning curve because of the data pre-processing component. I think in a lot of courses that you'll see or just like data re-arranged in a fairly neat format, relative to how you would actually find it in the real world. And I would even go so far as to categorize it as a separate skill from modeling that you need to bring together with modeling in order to apply data science.

Brian Dowe: And you could spend a ton of time sort of as like I've been trying to do more recently with Kaggle data sets. Just looking at a random data set that you find and thinking, okay, how can I arrange this and get it ready to be fed into a model just where I have all the features that I need, they are scaled or converted appropriately into what they need to be like for categorical data like one hot encoding it to make sure it can be fed into a model. You can spend a lot of time in just diving into the data pre-processing aspect. And I think there can be a learning curve there, but I think if you just push yourself to start tackling these problems and just dive in, just look at a random data set and just start playing around with it, like sort of look up what you need to as you go for like how you manipulate a dataset with a given tool like Pandas for example.

Brian Dowe: And if you do that and just practice with different datasets in different contexts, you can gain a lot of skills fairly quickly to at least be able to not be paralyzed by the idea of seeing this data set that you feed it into a model right away and you get a bunch of errors back. That can be very discouraging to someone who is just starting out and doesn't really know what to do when previously they've just ran a command that worked perfectly because they had a dataset that was pre prepared.

Brian Dowe: So the kind of synthesize a little bit, I think that it can be easier to start learning about how the models work and apply them on clean datasets and a little bit harder to get to the point where you can really use these effectively in practice. But it's still very much attainable, it's very much feasible if you just push yourself, apply yourself and dive in, start tackling problems, even though it can be a bit nerve wracking at first, it's still a really great learning opportunity, especially when you're struggling with all these challenges and pushing yourself to figure it out. I think that's when some of the deepest learning happens.

Kirill Eremenko: Totally agree. Through those challenges. And what would you say is ... like you mentioned some of the difficult things, especially about data preparation. Totally agree on that. What are some of the things that actually keep you going? Have you noticed or naturally your successful implementation of the Apriori and how it was used in the business, that was something really cool that definitely gave you a massive push. Was there anything else along your journey that you noticed that once you get to these milestones, that gives you the inspiration to keep going because that could be helpful for some or all listeners who are considering taking this approach?

Brian Dowe: Yeah. So yeah, getting the Apriori live was definitely a big one to actually see that it was possible to take something through to production. I think for me I sort of felt frustrated by the fact that I knew how to apply a lot of models without diving into the deeper mathematics. And I really wanted to understand, okay, like what is gradient descent actually mean? Like I've seen it covered, I understand the general idea of what it's doing, but how does it actually work?

Brian Dowe: And something that I did actually very recently is went through and actually by hand wrote out all of the calculus to derive gradient descent for a linear regression problem. And then I did the same thing to derive the back-propagation algorithm for neural networks. And I think when I did both of those things and realized the connection between the two and sort of saw that the back-propagation algorithm is an application of gradient descent on neural nets in the same way that gradient descent works on a linear regression or at least with the same conceptual grounding, that was really like a big breakthrough and understanding for me where I actually ... it's like, what does it actually mean to train a model? It's like, okay, you feed it data and it learns how to predict based on that data, but like what does that learning actually mean?

Brian Dowe: And then getting on top of gradient descent, I think at the core of how many algorithms learn, and for me, that was huge and just ... I think so many things sort of clicked and came together and made more sense after I did that. I would definitely recommend that, maybe not like a first starter activity to do, but after you've like played around with some models, have a general idea of like what a model is trying to accomplish and what training is trying to do, diving in and trying to understand what's going on under the hood can make it a lot ... I guess booster competence when looking at ... it can be sort of disorienting to look at one line of code and know that that line of code is the entire training for a machine learning algorithm. It can be hard to look at that and understand what's really going on if you like haven't like for example, done some of these derivations or deep dives into the underlying mechanics of how everything works.

Brian Dowe: And so, I would definitely recommend doing that at least like once or a couple times to help yourself understand those things and it gives you a lot more understanding when applying the models in practice, even if you end up using the single line of code training, it's like you'll understand in a more deep way what's going on there.

Kirill Eremenko: Gotcha. Very, very interesting advice. Thank you for sharing that. And yeah, interesting. I think everybody has ways to get inspiration for pushing themselves forward and for learning more. That's definitely a valid suggestion for our listeners as well. One more thing I wanted to ask you is what kind of tools would you recommend starting with for somebody who's just getting into this space or considering getting into the space?

Brian Dowe: For me, a Python was definitely my language of choice. I know that like R is also very, very big in the space. But for me, I had already done some work in Python so I was a little bit familiar with it. And I do think it's um, it's a good place to start. Scikit-learn, which I've mentioned a couple of times is, I think, a really good library to start with as well because there are a lot of different machine learning models that it accounts for. So it can be really easy to swap out different models, like try different things and sort of see what predicts and better.

Brian Dowe: Pandas is really important for Python, for working with your dataset. And I think a lot of times I've seen it used in tutorials, it's usually just use for like importing the dataset, but there is a lot more that you can do with it. Like you can do join sort of in a similar way that you would do in Sequel. I think Sequel is also really important to know on a practical level, that's something that I'm working to build my skills in because a lot of time your data can be stored in different databases, maybe something in a sequel database, maybe something in a no sequel database like Mongo db, and a knowing how to run queries in any given situation where your data might be.

Brian Dowe: So you can pull data from different sources and sort of arrange it in a way that's comprehensible is definitely valuable. And I think a Sequel is definitely something that you would see a lot in practice and Mongo or like no SQL databases are picking up a lot of steam as well. So understanding basic queries to pull your data. Once you get your data into, like for example, my case like a Python script, Pandas can be really useful to manipulate the data in any way that you couldn't or didn't with the previous methods before it gets into your script.

Brian Dowe: Scikit-learn great for modeling, NumPy is sort of like just a basic use case in Python that ends up working really well with a lot of the other tools that I've mentioned. So that is very valuable. If you are looking to get beyond or get into the spectrum of neural nets, [Charisse 00:59:28] is really useful. That's something that I've done a decent amount of work with in my spare time. And you can create like a neural net in just a few lines of code and swap out different techniques or different like last functions or activation functions to change the structure that you're working with really easily. So, it's a great tool for people who are starting out with neural networks to sort of try out different things and see how different structures of your network can achieve different results. So yeah, those are probably some of the tools that I've worked with the most while trying to teach myself, and I'm still very much a novice with a lot of these things. But it's been a very useful way for me to get started and dive in.

Kirill Eremenko: Gotcha. Well, thanks so much. That's some great advice and great list of tools. And Brian, we're, uh, we've gone to the end of the podcast in terms of time. Thank you so much for coming on the show and sharing all these insights. I think it was really cool, we discussed your journey through a development or software development to machine learning, the different stages in building, deploying and maintaining a model and what we just talked about getting into the space of data science, how easy it is or what challenges people face. Before we wrap up, I did want to ask you, what is the best way for our listeners to get in touch with you to follow your career on and maybe learn more things from you?

Brian Dowe: The best way to get in touch with me would be a Linkedin, uh, just a linkedin.com/in/Brian-Dowe. Yeah, that's the best way to get in touch with me. If you message me on there, I see and respond to all my messages on Linkedin.

Kirill Eremenko: Awesome. All right, cool. Well that's where you guys can find them, Brian. And before we finish up today, one last question, what is your favorite book that you can recommend to our listeners to empower their careers?

Brian Dowe: One thing I've already talked about a lot is Machine learning A to Z and I just want to ... Well, first off, thank you so much Kirill and also to Hadelin for putting that out there. It was the course that got me started on this journey and I can't thank you enough for that. But specifically why I'd recommend this, it gives you a lot of tools to work from. You learn about not only how to code a lot of algorithms, but in Kirill's intuition tutorials, he talks a lot about good use cases for the algorithm, he gives an overview of how the algorithm works and what it's doing under the hood.

Brian Dowe: And in my case, being able to identify the Apriori algorithm as the best tool to solve the problem that I was working on. It was directly connected to Kirill's explanation on what that algorithm is used for. And so I think even if you are just starting out, what this gives you is it gives you a roadmap of different tools that you can use, different algorithms that you can work with and the use cases that they apply to. So it empowers you to be able to at least look at a problem and think, oh, I can think of some algorithms that would fit this. Even if it's as simple as, oh, this is a regression problem versus a classification problem. It's like, okay, well now I've narrowed down the list of things I need to look into that could be a potential solution. And I was very pleased with how quickly I was able to do that after taking that course. And so, especially if you're just starting out in the field, I would highly recommend it.

Kirill Eremenko: Thank you so much. That's so nice to hear. And you actually gave me an idea because like usually for this question, I expect a book and you recommended a course and I actually thought of we should release this course as a book. Like it should be a book machine learning A to Z. I think that would be pretty cool for people to get their hands on as well, like a supplement thing. Thanks for the idea anyway, I'll talk to Hadelin about it.

Brian Dowe: Sounds good.

Kirill Eremenko: Yeah. Well, thanks so much for coming on the show. It's been a huge pleasure and I'm sure a lot of listeners got so much value out of it. Once again, thank you so much.

Brian Dowe: Yeah. thank you so much for having me.

Kirill Eremenko: So there you have it, ladies and gentlemen, I hope you enjoyed this podcast. Quite a lot of different topics that we discussed and very inspiring career path. I'm sure you'll agree that Brian is creating for himself how he's leveraging the skill he has to be better at data science, how he's leveraging data science to help of the work that he's doing, how he's bringing data science and machine learning into the company where he's working and helping them create better products and services or service their customer base even better and derive value out of their data even more efficiently.

Kirill Eremenko: My personal favorite part of this podcast was the discussion we had about modeling, the development deployment and maintenance parts or stages of any models lifecycle. And personally, I really enjoyed that whole brainstorming part we had about maintenance and how you need to think about maintenance and how you can actually go about maintaining a model. I think we came up with some interesting ideas. And just in general, the whole process, you can see the whole of like discussing things with your colleagues and peers can be very helpful in terms of sharing ideas with others but also coming up with ideas in the process. And I think that's what we saw in this podcast.

Kirill Eremenko: On that note, you can find all of the links to this podcast at www.superdatascience.com/215, and that's where you can also find the URL for Brian's Linkedin and connect with him there and make sure to follow his career and see what he gets up to and maybe share a message if you have any questions about modeling, if you have any questions about integrating data science and machine learning into your careers in a similar way that he did.

Kirill Eremenko: And there we go. Hope you enjoyed this podcast. If you did, make sure to leave us a review on iTunes. That would be very helpful for us to spread the word and get more people involved so that they know that there are valuable insights that they can pick up from this podcast and from our guests. Apart from that, thank you so much for being here today and sharing this hour with us. I can't wait to see you back here next time. And until then, happy analyzing.

Kirill Eremenko
Kirill Eremenko

I’m a Data Scientist and Entrepreneur. I also teach Data Science Online and host the SDS podcast where I interview some of the most inspiring Data Scientists from all around the world. I am passionate about bringing Data Science and Analytics to the world!

What are you waiting for?

EMPOWER YOUR CAREER WITH SUPERDATASCIENCE

CLAIM YOUR TRIAL MEMBERSHIP NOW
as seen on: