Welcome to episode #109 of the Super Data Science Podcast. Here we go!
Today's guest is CEO and Founder of Business Science LLC, Matt Dancho
How does a mechanical engineer with a management job in sales, become the author of multiple R packages that are used by large companies to solve management problems?
Find out by tuning in to hear how Matt Dancho once used Excel as a data science tool before he discovered Python and R, and eventually left mechanical engineering to start a consultancy.
You will hear us talk about tidyquant, timetk, sweep, and tibbletime among others, and why Matt prefers R to Python. He explains how he developed a solution in HR analytics to land a job with a Fortune 500 company, and what other strategies you can use to create awareness for the work that you are doing.
Let's get started!
In this episode you will learn:
- The demand for consultancies in data science (04:25)
- The beginning, and how frustration with excel led to Python and R (09:03)
- Why Matt prefers R over Python (15:26)
- R is evolving to embrace advanced machine learning (19:11)
- A case study of HR analytics- the most popular article in Matt’s blog (26:03)
- H2O World (37:22)
- Use diagrams and infographics to make blogs accessible and interesting (41:04)
- Matt’s advice on how to start a consultancy in data science (45:19)
- Biggest challenge is being able to communicate insights to clients (50:01)
- The tools are going to get easier to use, but what's really going to be very powerful is being able to communicate data in an interactive way. (55:46)
Items mentioned in this podcast:
Kirill: This is episode number 109 with founder of Business Science LLC, Matt Dancho.
Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur, and each week we bring inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let's make the complex simple.
Welcome everybody to the SuperDataScience podcast. Today we’ve got a very inspiring guest on the show. Matt Dancho is the founder of Business Science, a consulting firm in the space of data science, which works with companies ranging from start-ups to Fortune 500 companies. Not only is Matt the founder of Businesses Science but also, he's the author of multiple R packages such as tidyquant, timetk, sweep, tibbletime, and others. And so, therefore, if you're an R fan, this is definitely an episode for you. Even if you're not, you will learn a lot about the journey of a data scientist, through his career, how he started out in a completely unrelated field in mechanical engineering, and how he progressed into data science, and how he even started his own consulting firm.
We talked a lot about a lot of different things on this podcast including the background, including the packages, and of course we even went through some case studies. Matt has some very popular blog posts on some data science techniques and analytics that he's done and that he's proposed, and he walked us through a couple of those. All in all, a very exciting podcast, can't wait for you to hear it. Without further ado, I bring to you Matt Dancho, the founder of Business Science.
[Background music plays]
Welcome everybody to the SuperDataScience podcast. Today I've got a very exciting and inspiring guest, Matt Dancho who's the founder of Business Science. Matt, welcome to the show, how are you going today?
Matt: I'm doing great, thanks for having me Kirill.
Kirill: The pleasure’s all mine, and you are calling in from Pennsylvania. What an interesting location. I have never been to it, I don’t think we've had guests from Pennsylvania. Are you originally from there?
Matt: Yeah. I grew up in Williamsport Pennsylvania. I'm actually located in State College PA, right next to Penn State University.
Kirill: I'm just looking at your LinkedIn and you’ve got a proud photo all of the PSU football stadium, is that right?
Matt: Yes. Penn State undergrad, graduated in mechanical engineering from Penn State back in 2006.
Kirill: Nice. And now you've got your own, is it a consulting company, Business Science?
Matt: Yes. That's the primary focus of us. What we do at Business Science is really applying data science to business and finance, so we work with organizations anywhere from start-up all the way to Fortune 500, just helping them apply machine learning and data science within their organization.
Kirill: Okay, that's really cool. How long have you guys been around for?
Matt: We actually just LLC’d back in February. It seems like a long time ago just because we've been very, very, busy, and very active, but it's really a pretty young company at this point.
Kirill: Yeah. I know exactly the feeling. It feels, on paper it's recent, but you've probably done so many different projects. That’s so cool.
Matt: Yeah, a lot of projects, a lot of speaking events, and then also we're pretty heavy on the open source software side so a lot of work that's been going into that.
Kirill: Awesome. How would you say … Is it hard to start a consulting company in data science these days? Did you right away experience a lot of demand for your services or did you have to look for clients for some time?
Matt: We are kind of unique in that, I basically LLC’d the company because of demand after I put out my first R package, it's called tidyquant. We started to see demand from people who wanted the finance services and really creating like Shiny web applications, which just allowed people to have interactive data science on their websites. So, in order to basically be able to do that work and not have liability on my end, that's kind of how it started. Then it's just kind of grown from that.
Kirill: Awesome. So you developed a couple of packages for R.
Matt: Yeah. The main one is tidyquant but we have four core packages. Tidyquant is like financial analysis, it’s starting to be adopted more in industry right now. They all kind of centre around time series analysis which is something that I'm particularly pretty passionate about. We have timetk for time series machine learning and working with time series in R. We have sweep which is for tidying the forecast workflow from Rob Hyndman’s forecast package. Then we have tibbletime which is something that my software manager, Davis Vaughn, he’s the author of that, or the creator of that and co-author of it. Really that's a super cool package, I’m very excited about it. It's for analysing time series actually in the tidyverse and it's really what we wanted to do from day one with tidyquant. We're now able to do it with this new package called tibbletime.
Kirill: Tibbletime. Okay that’s really cool. These are packages that anybody can go to the R repositories and just download?
Matt: Yes. If they want to go on CRAN which is the R repository, they can download tibbletime or tidyquant, it’s all free and open source. We’re really excited about them so we definitely encourage people to download it.
Kirill: That’s awesome. So guys listening to this show, make sure to go download tidyquant, tibbletime, sweep, timetk. Even just to see what kind of innovation goes into somebody becoming so popular or so in demand, their resources becoming so in demand, that they have to go and start a consulting company. Matt, how many people, if that's not sensitive information, how many work in your company right now?
Matt: It's actually just three of us right now. It's me, Davis our software manager, and my wife. We all kind of have our own roles. Me I’m the founder and CEO, so the face of the company in the data side circuit.
Kirill: And the podcasts.
Matt: And the podcast circuit, yes. This is actually my first podcast, super cool.
Kirill: Congrats. I’m super excited, that’s awesome.
Matt: Thank you very much. Yeah, doing that, also working on software with Davis, although that's kind of his area of responsibility. Then what my wife does is kind of the stuff that I honestly can't stand. It's like the contracts, making sure that we're buttoned up from a legal end and all that sort of thing.
Kirill: Oh wow. You’re very lucky to have her on your team if she's taking care of all that.
Matt: She's awesome so I'm super happy.
Kirill: That's fantastic. All right, so that's your business right now. There's so many things that I wanted to ask you, you have such a diverse background and I'm sure we won't even have time to go into all of them. But probably, like the most interesting thing, or the couple of things that I would love to talk about is a bit about your background, what inspired you to get into data science, how this journey came to be, and what inspires you to keep going. The other thing that our listeners are always very interested in is case studies. You run a consulting company, you have some very interesting blog posts which we'll talk about in a bit. And maybe you can also run them past us and tell us how they came to be or how those case studies unravelled. Do those two sound fine, and where would you like to start?
Matt: Yeah. If you want to do background, that's a good place so people can kind of understand where I'm coming from. I’m good with both of those. I think we've got a lot to talk about.
Kirill: Sounds good. Let's get started on the background. Let's start all the way back. We know you were born in Pennsylvania and probably you’ll skip the high school part and then you went to university. You studied Bachelor of Mechanical Engineering, is that correct?
Mat: Yeah, that's correct.
Kirill: You also even did Master of Industrial Engineering. Sorry, I am just opening your LinkedIn right now. How did that lead you to doing data science?
Matt: Okay. You have to understand, data science being a relatively new field, it wasn't around back then. They had computer science which really wasn't my area of expertise coming out of school so going from high school into university, I really was strong with the maths and engineering just seemed like a common natural fit for what my skills were. I kind of just fell into Mechanical Engineering, didn't even really do a whole lot with data, to be honest, at least on a professional level, up until my first management role. I graduated in ’06, I immediately went to work for a company out in Pittsburgh, Pennsylvania, that was like a Navy subcontractor and my particular responsibility was valves.
Matt: Yeah, valves. Like actual, the mechanical, they open, they close, and fluids go through them. There wasn’t a whole lot of data science being done there except for maybe on the engineering side. Being math-driven, heavy on statistics but it was primarily in Excel, and that's where I first got my exposure to using my first, we'll call it a data science tool, but it was Excel.
Kirill: I’m sure a lot of our listeners will disagree with you on that one.
Matt: Yeah, I know
Kirill: But I totally agree. I totally respect Excel, I've used it myself, I agree with that. Okay.
Matt: On the professional side, what I did was I eventually got a job with another company, and I was thrust into management. At that point in time, I knew nothing about management and eventually I actually started to oversee part of the sales department, like technical sales, because of my engineering and technical background. And at that point in time, I was somebody who was totally unfamiliar, couldn't be further outside of my element, trying to manage a sales group, to help drive the organization, obviously, the hope is to make a profit. The one thing that you kind of get when you put somebody with a technical mindset into these roles, they want to understand it, and for me that was really data. So, I started collecting data on things like anything I could get my hands on. I was measuring everything. Whether it was like how many quotes we had, what customers we were quoting, whether or not they converted into orders, even stuff that probably had no relationship to anything. I was trying to collect as much as I could. So my relationship with data became more out of necessity and then really I just started to see some value. Back in 2011, I think, is when I got my first management role. I quickly found out that Excel was not a good tool for the amount of data that I was collecting. I was getting, it’s called an ERP system, stands for Enterprise Resource Planning but basically, it’s a data base where you can download sales histories from.
Kirill: Something like salesforce, right?
Matt: Yeah. Sort of like salesforce except very crude, it’s just orders data it's nothing with quotations or anything like that. But basically, you dump that into Excel and after a certain limit it just totally shuts down, you can't even open the files. That's when I recognized there's got to be a better way, right. That led me down the path of Python and R. Getting involved with those two, initially I actually started out with Python. The issue for me was more, I'm coming out from a statistics perspective, being in engineering, so Python was a little bit too foreign for me because the code is a computer programming language which is great, but for a statistics person, R had some benefits there that I kind of gravitated to. Eventually got started in R and really haven't looked back. I mean, you get that wow factor. I think you do like one single project and for me it was actually analysing our customers and being able to group them into customer segments. Just based on like their ordering patterns. That's just something simple nowadays with K-Means clustering but for me that was the wow factor. Then from there I was hooked. I was like, there's so much that we can do in business with these tools.
Kirill: Do you use RStudio for your R programming?
Matt: Yeah. Absolutely. RStudio, in fact we're an RStudio partner, Business Science is. And really the RStudio IDE which I think is what everyone is most familiar with, it's probably the best IDE out there. I know there's Visual Studio and some other really good IDEs. On the Python end, I use Jupyter Notebooks a little bit but I'm really 99.9% what I do is typically in R and RStudio.
Kirill: Okay. Very interesting. Before we continue with your background, could you tell us a bit of why, except for back then, I understand the reasoning why R back then. But now let's say somebody’s starting out and they have the choice between R and Python, why would you give preference to R over Python?
Matt: Okay, super question. I know there is probably going to be a lot of debate on this, but my specific needs are not only being able to have a super accurate algorithm to be able to do prediction and typically with machine learning, with programs like H2O, XGBoost, some of those high-end algorithms that are known for their accuracy. But in business, you also have to be able to explain the data. What that means is you really want to be able to understand what the drivers are in your models. Just telling an executive that, hey you know I've got this shiny model out there, it's really strong, it's got great AUC, area under the curve, it's high performance …
Kirill: Sorry, by Shiny do you mean the R Shiny or just ...
Matt: No. You got this really cool, I just mean cool algorithm that gives you high performance, right. When you're in business and also in finance and if you're working in economics, those types of areas, you really need to be able to explain that data and what's being able to drive your models. What's nice about R is it's got a very robust set of tools. The ones that I use the most for like data visualization, it has ggplot. It also has the interactive libraries such as plotly, leaflet, all of the really great visualisation libraries. I think that’s something that R is pretty strong in. Also, we just talked about Shiny, the actual application, so powerful in business. If you go to these conferences that talk about R in the enterprise, most of the conversations are going to involve Shiny, I call it, it's a web application framework, but what it's really good for is like rapid prototyping web apps and just building them very quickly with a very few lines of code, and making it interactive so the people that you're interfacing with can actually not only just see the results, but actually interact with the data science.
Kirill: Yeah, the interactivity of Shiny is one of the most powerful elements and it helps it become rival to Tableau, Power BI and so on, and I think it's very important for people to have that in their toolkit if they're following the R path.
I’m going to play the devil’s advocate here and say, that’s very good reasoning, and I really appreciate you going into the detail here. However, there will be people who are listening to this podcast who will give you one contra argument that I'm just interested to see how you will react to that. There's one aspect of data science which is evolving rapidly which is deep learning and artificial intelligence, which is much easier in Python. Which is possible and is actually happening in Python. You can do some deep learning in R, some basic deep learning, but I haven't heard of more advanced artificial intelligence or computer vision happening in R. What are your views on that? Is that a concern for you or is that something that you don't see your business evolving into in the future?
Matt: Actually, I see our business evolving into that right now. I've actually been using some of the new packages. There's the Keras package in R, it's built by RStudio, the folks there. They have, I think it's called Keras, they have tf. estimators and they have TensorFlow that are now ported into R. They're actually pretty easy to use. It kind of follows the same type of approach like Keras in Python uses, except you can do it with the pipes in R so if you're familiar with the tidyverse and magrittr package, you can do piping with it. It's pretty cool. For your broader point, Python is very well established for deep learning and I agree with you 100% that right now probably, I would say, 95% of people who are doing deep learning, they’re doing that in Python. That’s where it is right now. These packages that I just spoke about that RStudio has been developing, and they have a whole website on it if you just Google Keras R Studio it should pop right up. But it looks to be pretty robust, in fact I think one of my next blog posts is going to be using Keras in business applications because of the deep learning and some of the benefits you get with being able to really easily get interactions. When you have feature interactions from the deep learning. That's something that I'm super interested in. I think Python probably is going to be a leader for a while in that. But certainly, if you're interested in doing, like business analysis, I wouldn't let that turn you away from R because as we see these two languages evolve, they're kind of going very parallel paths. There's not going to be a whole lot of these libraries that are only available in Python or only available in R. There’s going to be a lot of mixing and matching.
Kirill: Gotcha. All right. Well thank you. There is some great encouragement for those who are passionate about R but afraid a bit about getting left out in terms of deep learning and AI. You’re hearing it straight from firsthand experience, Matt.
Matt: Yes, stay tuned with Business Science because we are going to be putting a blog post out. I don't know how soon that's going to be, maybe not this month but next month I expect to have a post on Keras and R.
Kirill: Awesome. What's the website where they can read about these blog posts?
Matt: It's just our Business Science website. You can either Google business science and it should be the first hit, or you can just type in www.business-science.io.
Kirill: Gotcha. Is there anything else we need to know about your background?
Matt: One other area and it's kind of interesting, a lot of people don't understand or don't know about this. The tidyquant package. I do business analysis, that was always my professional use for these data science tools, but personally I also have been an investor, just with my own and now my wife's since we've been married, managing our own investments. I think that's important to understand. There's these tools out there, again it's with the R programming language. If people are interested in investing and learning how to create portfolios and optimize portfolios, there's a lot of tools out there, and R is a really good language. That's what I ended up using for my personal investments after I got my butt whooped in the financial crisis.
2008 came and I lost probably about 60 or 70% of my net worth of investments, just the investments portion at that point time. So I started using Excel at first but quickly switched to R once I figured out that there was some good tools out there for that.
Kirill: Interesting so with tidyquant you're literally putting your money where your code is.
Matt: Yeah. And really it was just to make my life easier because I was doing these same types of processes over and over and over, but I really liked the tidyverse and tidyverse just didn't have the tools for financial analysis. So this is literally the package that I use when I do my own personal portfolio analysis and it's being used in industry right now. I don't want to mention names but we're talking with very prominent financial institutions, actually working with one right now to get the Bloomberg which is the R-B-L P-A-P-I, Rblpapi. It's the Bloomberg terminal which every financial institution uses. We’re getting that integrated into tidyquant right now, we're going through the beta testing on it.
Kirill: That’s awesome. Congrats, it's a big step and it will open up lots of opportunities for you guys. If you don't mind me asking, what kind of markets do you trade for your personal investing?
Matt: Mainly domestic so S&P 500, Russell for the small caps. I don't really get into too much overseas just because of the forex, it’s additional risk. But my whole basis is I try to maximize return, minimize risk, and minimize correlation between the investments that I select.
Kirill: Yep, get that portfolio effect going.
Matt: Yeah, so you got tidyquant which helps you do all of that.
Kirill: Awesome. Well, we definitely have lots of listeners among our audience who are interested in both data science and trading, so it's actually really cool that you bring this up because a lot of them ask about time series analysis and so on, and it seems like these are exactly the right tools to look into, so guys, make sure to check those out. Okay, cool. How about we move on to the case studies now.
You have a couple of very interesting blog posts, the ones I've had a look at are HR analytics, using machine learning to predict employee turnover, and sales analytics, how to use machine learning to predict and optimize product back orders. Just before the podcast you mentioned that your HR analytics blog post has gained some traction, becoming quite popular. Let's maybe start with that one.
Matt: Sure. Just so everyone understands, I’ve always been a blogger. I’ve been doing blogs for probably two or three years now. But really the process that I go through with having to do the research and it just helps kind of push me as a learning process. These articles, Kirill, you mention, are really no different. For me, the HR analytics article, that's the one that's the most popular, and really the cool thing about that is we're taking a problem like employee turnover which every organization out there has to varying degrees. And what we're really trying to do is show that you can actually use these tools. Machine learning tools are always evolving but the two the two tools that we use in that particular article are H2O and Lime. H2O is really neat because it's automated machine learning, and this was something that I think a lot of users really, it was just their first exposure to it. I think that's one reason that the article did very well. And then there is this other package called lime which is actually used for explaining the deep learning models and the neural networks, and the stack ensembles, and these types of models that traditionally are very … You just don't have mechanisms to show feature importance which you need for business. Anyways that employee turnover problem, we wanted to showcase how you could use these tools to solve that. What we did was we worked … One of the problems is there's just not a whole lot of companies that are willing to give up their HR data, right. They're not just going to show you if they're having an employee attrition problem. But the folks at IBM, IBM Watson, they actually put a data set out and so as you can imagine, they do a lot of consulting and while they came up with a dataset that was not necessarily real-life data, it was actually artificially generated, but it's based on, or at least I believe this to be the case, based on their own consulting experiences and their data scientists’ experiences, actually working with that type of HR data.
It had all sorts of features to it, I think I had 35 or so features. The target was attrition, whether or not … Each row of observation was an employee over a certain time period, and one of the columns with, which was the target, was attrition. Whether or not that employee left the company during that particular time frame.
Kirill: You had a bunch of independent variables?
Matt: Yeah. Then you have a bunch of features, independent variables that hopefully you can use to describe that data, whether or not they left the company.
Kirill: Give us some examples of the independent variables just so we can build a mental model.
Matt: Yeah. It had all sorts of things in it. Like their wage, what their job role was, whether or not they lived close to the company or how close they lived to the particular plant or location that they commuted to. Just like all sorts of crazy HR data that you can actually collect in companies, a lot of companies do collect, just to really characterize their employee. Some of it could be performance measures, things like their most recent performance appraisal, how did they do on a scale of 1-5. It’s really a lot of different data that you can hopefully use, not all of it’s going to be important for this problem, but some of it, it turned out, was.
That article kind of blew up, we've got the H2O piece and you can actually, if you Google “predict employee turnover,” that will be the first article that pops up. In that article, you can see all of the code, we actually walk through that whole data science workflow from importing the data to if we have to clean it and then pre-processing to actually utilizing the H2O package in automated machine learning. I think for a lot of people it had a lot of the points that they were really looking for in an article because it's a pertinent problem, it's got some advanced tools, and it shows people how to use them. Literally, this is the type of stuff that I use when I'm consulting with my own firms. That's kind of a strategic choice that I took or tactical choice to let people see some of the code, but I think it's a good decision in the long run just because we're helping progress the community, and give people insight that they might not normally have.
Kirill: That’s very admirable. Since that article, have you had a chance to do any actual HR consulting with real clients, or are you still looking to get into that space?
Matt: No, we actually did that to land a Fortune 500 client that we were really trying to impress.
Kirill: Wow, what a way to impress a client, that's awesome.
Matt: Yeah. We got the job, we work with them on another problem that they have, not that they have but they're trying to solve. They were actually trying to target people for, I think it was executive promotion or executive succession planning. We utilized some more advanced tools to really showcase what insights we could gain from their particular data, but we were able to use a similar process and actually identify, I think, 16 or so people that they did not currently have pegged as executive potential.
Kirill: Okay, that’s really cool. That’s a very cool project then. All right, so what about the other article that you have that's quite popular as well.
Matt: Sales analytics, it's a very similar approach. What we're trying to do there is predict product back orders. For those that might be listening that don't necessarily know what product back orders are, a back order happens when, say you’re Apple and you have just released your new iPhone X. Immediately that product pretty much goes into a back-order situation because Apple just doesn't have the supply to be able to supply to all of the demand. That's one kind of a case, that's a special case, though, where it's a new product that you're rolling out and you kind of know going into that that you're not going to have enough product so you can elect whether or not to take a back order. Typically, back orders aren’t good because it just means that your customers have to wait until you are able to get more products in the inventory to be able to supply that demand. What really a lot of companies are concerned with though, are during peak seasons for example Walmart has so many different skews. A skew again for people that are maybe unfamiliar is a product number, a unique identifier for a particular product. Say like cleaning supplies, think of a cleaning product might have a certain skew that identifies it and separates it from another cleaning product. What a company like Walmart or like Home Depot are typically interested in is seasonality and what can we do within the data to predict whether or not a product's going to be back ordered.
Another example, and I actually write about this in the article, is snow blowers at Home Depot. It's something that you can actually utilize external data like weather data. If we're projected to have a snow storm, they better be stocked up on snow blowers and other supplies. You can actually predict those back orders with that data quite easily. Basically, what we looked at was, and I’m trying to think where the dataset … I think that one actually came from Kaggle. It's to identify this for a particular manufacturer, whether or not products were projected or could be predicted to go on back order. It's just a true false, it's a binary classification problem, and what we're trying to do is we ended up using the same type of pattern with H2O except we kind of took it to the next level. It’s actually a little bit more of an advanced article because we then tie it to the business case and try and optimize profitability using precision and recall. I think it's a really good case study for those that want to be able to tie these machine learning models to actual business and financial success, in this case profitability.
Kirill: Okay. Well that's a very cool way to approach it and that’s kind of like what a lot of data scientists miss. They forget to analyse how whatever they're doing, their results, their insights, affect the bottom line of the business. Not just five snow blowers or five hundred snow blowers, but actually in numbers in forecasts, which is that element that executives and managers are after at the end of the day.
Matt: Yeah. Because we're in consulting, that's really what we need to be able to show our clients. At the end of the day, they need to see that ROI, what the effect is on the bottom line or the top line, and really see how it impacts the business. And if you can do that, you're going to be successful.
Kirill: Yeah, exactly. Being in consulting you definitely have to take that into account. You mention H2O twice now, and also before the podcast you told me about H2O World. You just came back from H2O World, is that right?
Matt: No. H2O World is in December, so I'll be doing a presentation.
Kirill: That's a whole event dedicated to this one package?
Matt: Yeah. H2O, I don't even know all they do, it's a very interesting company. You have H2O.ai which is their website, and if you go on there, they offer all sorts of tools in Python and R, and then they actually offer online tools so that way people that maybe don't necessarily work in Python or R but still want to be able to do machine learning on their data, they offer tools for that as well. I've actually never used that, I've only use the R package, H2O. You’re able to run H2O locally on your computer and it's pretty slick.
Kirill: That’s simple. Has this event been on before, H2O World?
Matt: H2O World, I’m not exactly sure how long. I'm pretty sure they did it last year, I'm not sure about the year before, how long …
Kirill: I was just wondering if you had attended before.
Matt: No. This is my first opportunity. I've been doing a lot of blogging primarily with H2O so I've actually gotten to know some of those individuals just through social media. Yeah it ended up that we could work it out that I'd be able to get out there this year and do a presentation for them.
Kirill: Nice, congrats. I hope you might meet some of our listeners there. Guys, if you're listening to this and you are going to H2O World, make sure to check out Matt Dancho’s session and come up to him and say hi.
Matt: Yeah. We’ll definitely be talking a little bit about Business Science but primarily about that HR analytics article, and really kind of walking people through what we learned through that process and how we decided to attack the problem. You’ll get a few insights that you won't be able to get through the blog post.
Kirill: Yeah. You've got to leave something special for those who put in the extra effort, right. Gotcha.
You definitely have more articles there such as the customer segmentation of K-Means and more. But what I wanted to talk about now, unless there's another case that you want to share with us. Is there anything interesting that's happened in the past couple of months that you'd like to share?
Matt: The other stuff that we do. We're really focused on the open source community and developing our own in-house packages primarily related to time series analysis, so if you go onto our website, there's categories. Go to the blog section and then you can get to categories and they have ones on time series analysis, we also do code tools, which is primarily for updating people on the new, as our packages evolve and we come out with new versions. But really, I think there's these two cohorts of people that would be interested. People that are interested in solving business problems, there's a bunch of case studies on there, related to business, and then there's also the financial investment side and time series side. If you go on there and you're interested in time series, definitely check it out, there's a ton of articles on there, I don't even know how many we're up to now but it's a lot.
Kirill: Gotcha. Okay, that’s really cool. Thanks so much, I'm sure these will be useful for our listeners. What I like about your articles is that they've got a lot of pictures, you actually have these charts and code snippets, so not just text that some people can get bored, but actually extracts, insights along the way of the analysis. I think that's pretty valuable as well in itself.
Matt: It’s a good tool for education. We walk people through, they're all walk throughs. They’re don’t just like, hey hypothetically you could do this. No, we're showing you how you can do it and we do a lot a lot of that with the visuals and you see the charts that are created and how we go through our analytical process.
Kirill: How long does it take you to write one of these articles?
Matt: It varies. Some of them we can just get done in probably a day or two especially if it's just recapping a software update, like hey, this is the new functionality in tidyquant or tibbletime. But if it's like the sales analytics article which is probably our longest article, that took probably about three weeks to put together. You’ve got to think through, there's a lot that goes into it in terms of research. And also, we implemented different frameworks, for example like the one framework that’s the expected value framework, we had to research that and make sure that we were implementing it properly. The source for that is how to use data science for business, I think. There is a book that's out there on that. Yeah it was a pretty heavy process, at least it can be.
Kirill: Gotcha. I checked the code updates and you have quite a few there. From a person who's developed a package, and multiple actually, what goes into maintaining an R package and making sure it's not out of date, and adding more functionality over time?
Matt: GitHub has been a critical tool for us. We have our own GitHub account and our users, the users of our packages, our R libraries, they can go on there and they can actually submit an issue, and they can actually even work with us on particular requests, so they can make a pool request. What is really good about the open source community is that you put this product out there and it's just your vision of what you think it should be, but literally in three months or six months, as it becomes used more in the open source, that community aspect of saying, hey, when people have issues, they will submit them and then if you're lucky, some of them even want to work with you on them. For example, Davis, my software manager, he was one of the initial super users. I call them super users because they adopt them really quickly, he was one of the first super users of the tidyquant package. Really got where we were going with it and had a couple of ideas of his own so I ended up actually bringing him on the team.
Kirill: Awesome. Do you have any stats around how many people use your packages?
Matt: Yeah. Actually, there's a package in R, I think it's called cranlogs, you can download it off of CRAN and it has the logs of how many downloads you get. We do track that, I mean it's not a statistic that … it doesn't mean a whole lot to Business Science, but personally, I like to see. Obviously, you want to create software that's being used. I think we're up to like 23 or 24 thousand downloads. It was just released the last day of the year last year, so it's been out less than a year. I think we are around 23 or 24 thousand downloads. It's actually growing too, that's what's nice to see. You don't want to see it just kind of like flat line.
Kirill: Yeah, that’s very cool. Okay, I have an interesting question for you. Now jumping from the technical to the business side of things. I love these podcasts when we can talk about some business, some technical, these are the best. What would you say to a person who's listening to this podcast and who wants to start a consulting company in the space of data science but is not really interested or equipped to create a package and become popular or in demand that way? What's another way that you could recommend somebody to go about this?
Matt: The biggest thing is honestly, people. You have to be able to market yourself. One of the biggest misconceptions with a consulting company is that you know how to do something very well, so you've got supply, right. Well there's also another aspect of it, and that's the marketing aspect or in general terms, the awareness. You have to have some way to make people aware that you exist and you want to work with companies or whoever your target audience might be. Obviously, one way that you could go about is to make an R package and then hopefully it becomes popular. I would say that's kind of a unique circumstance because really, I wasn't going after that, it just kind of happened that way. I think the big thing though for me is GitHub having a solid, whatever you want to call it, collection of code samples where you can show people what you've been doing. Then also blogging. Blogging is huge and I think David Robinson, he just had an article on why new data scientists should all create a blog.
Kirill: That’s really interesting. We should link that in the podcast.
Matt: Yeah. That’s something that's really interesting to me, personally, because I found that beneficial also to gain awareness. You really just need to get yourself out there. I wouldn’t spend a whole lot of money like trying to use Facebook or whatever. Personally, I actually even spent money before and it hasn’t panned out for me specifically. What I've found really beneficial though is just getting your name out there either through blogs or through creating packages, being involved in the community.
Kirill: Okay, that's some great advice and I totally agree on that blogging one. Any kind of exposure you can get, we've had speakers on the show who were just blogging for fun and then all of a sudden, a company approaches them and asks them to perform some sort of an analytics or visualization that they've showcased on their blog before, and that's how their career started. Definitely just following your passion, but also telling people about it, getting out there in the world otherwise they will never find you. Okay, let's move on to some rapid-fire questions. We are coming to the end of the podcast, are you ready for some rapid-fire questions?
Matt: Okay, yeah. Let's fire away.
Kirill: All right. What tools do you use on a daily basis? I think we've covered that pretty much, but in summary.
Matt: In R, I use primarily the tidyverse. For those that aren't aware of that, it's a collection of packages, most of them developed by RStudio, and it just makes the data science work flow like everything from importing data, connecting to databases to modelling, to iterating, visualizing, makes it really easy. Some of the packages within the tidyverse, the most popular I think is dplyr and ggplot2. Those are the two most popular packages that I use all the time. Then there's obviously for time series, I've been developing these packages and so has Davis, tidyquant, timetk, tibbletime, those are really good tools. In machine learning, H2O.
Kirill: What's your favourite machine learning technique?
Matt: My favourite technique was XGBoost but now that I'm starting to get into deep learning, that's going to be, I think, the big area. Because of being able to … with deep learning being good for identifying interactions between variables, which happens quite frequently in business or even financial data, that's kind of where I see myself going. H2O is great though they had that automated machine learning, which I think is probably going to scare some people because they're going to be fearful for their jobs, some data scientists, because it’s automated now. But it's just so good for being able to get a higher accuracy algorithm very quickly and efficiently, which I view as a huge benefit. You don't have to spend a whole lot of time trying to tune or do grid search or hyperparameter tuning, all that fun stuff.
Kirill: Awesome. Okay, next question. What's the biggest challenge you’ve ever had as a data scientist?
Matt: The biggest challenge honestly, it's not really on the data science end, it's actually more on the communication end. One thing that can happen, and it's very difficult when you're talking, when you as the data scientists are interfacing with either a client or might even be somebody, it could be within your own organization, just being able to communicate the results and being open minded with them and being able to kind of educate them in the process, without being like, hey this is the result, this is what you know. Because oftentimes they have a belief of the way things work and that may not necessarily be the reality, but you have to be open minded and see from there their point of view too. Then the last thing too, that we get into trouble sometimes, and actually I’ve iterated my business model because of this, is really understanding that business side of it, the business problem. We spend a lot of time up front really analysing and asking a lot of questions, just listening, because we want to understand totally 100% how to dissect their business and be able to implement predictive analytics and get the desired result.
Kirill: Gotcha. Okay, that's communication. At the start, at the end, I totally agree. A lot of people miss the first part at the start completely, and then they can’t even do the last part I think.
Matt: Yeah, I think one thing too, I just want to bring up. You've got these competitions like Kaggle out there where, almost to me, it trains the data scientist to really focus only on creating the best model. But that's really such a small part of being a data scientists in my mind really is. Understanding the business or the problem at hand, whatever it is, just really framing that problem and developing interesting questions to be able to understand and answer that and come up with a solution.
Kirill: Gotcha. That's definitely some really good words of wisdom in terms of, yeah data science is not just about modelling. It's a whole end-to-end process. People have got to master all of it. All right, next question. What is a recent win you can share with us that you've had in your work?
Matt: Recent win.
Kirill: You mentioned you landed a Fortune 500 company, that sounds pretty big.
Matt: Yeah that was huge, that was actually very huge. I think one of the big things too, one of the metrics that I follow … The Fortune 500, that to me is super huge. But the thing that I'm most focused on really right now is like my KPI is for the website and actually being able to grow our network. It's not necessarily attacking business straight on but actually seeing the network grow. One of the biggest things, a key win for me was the CEO of a very prominent financial institution actually came up to me at the EARL Conference that I just attended in Boston a couple weeks ago, and he had been using our R package, tidyquant, and he introduced himself actually like a fan of the package and really liked what we were doing, and he wanted to work with us on growing that package.
Kirill: That's really cool.
Matt: Yeah, that was awesome. To me that was just like I was kind of awestruck of him because he's just you know the CEO of this very prominent financial institution, that was just very humbling.
Kirill: Awesome, okay. Congrats, you got some very big wins, looks like you're doing the right thing. What is your one most favourite thing about being a data scientist?
Matt: Being a data scientists is really cool. I appreciate the learning aspect of it. I have always been somebody that never takes conventional wisdom, I always have to kind of figure it out for myself and sometimes to a fault. But for me, what cooler thing can you do? I'm a data guy so analytics, I kind of grew up and you know my story now with that, but really having the combination of code, the combination of statistics and machine learning now out, and now these machine learning algorithms that are evolving. Being able to tinker with those and just try and figure things out and help answer the questions that I have as just a curious person, I now have this super robust set of tools to help me answer just about any question.
Kirill: Fantastic. You make data science sound like Superman, or like Batman more like, you have a robust set of tools. An interesting philosophical question. From what you've seen in data science and from the way it's evolved, where do you think this field is going, and what should our listeners look into to prepare for the future?
Matt: In terms of the field of data science, again, I think we’re still just in the beginning with it. We've got a lot of tools that are evolving and I think we're going to be going more towards automated tools. I love what H2O is doing right now with actually making it easier to get these high accuracy models now. Where I really see the benefit is in a different part of data science which is actually communication.
We talked a little bit about Shiny web applications, I really see that as being a huge benefit for businesses, that's something that at Business Science we're actively investing a lot of time into. In some of our first projects, I think our first five projects with the client were Shiny web apps. It's being able to distribute analytics interactively. So you've got a machine learning algorithm, you're trying to forecast something or trying to predict whether you know an employee is going to churn, you have to have a way to distribute that information to non-technical people in a wide audience and I think Shiny is the way to do that, at least one of the ways. You've got Power BI and Tableau as others that, Kirill, you had mentioned. I really see that as kind of the future of data science. As being both good at the tools, you need to know the tools inside and out and understand them, I think the tools are going to get easier to use, but I think down the road what's really going to become very powerful is being able to communicate that data in an interactive way.
Kirill: Gotcha. Awesome. Thank you, that's a very apt answer. Sums it up very nicely and I totally agree with what people should be looking into to become super powerful data scientists and change the world.
All right, so thank you so much for coming on the show. Can you share with us some of the places, of the many places where people can contact your or get in touch or even just follow you and your company to see what you guys get up to in the coming future?
Matt: Yeah, absolutely. Like I said, just Google Business Science and the first hit should be our website. Again, it's www.business-science.io. We are also on Twitter and LinkedIn, those are our two primary social media platforms. Facebook, I'm just kind of dipping the toes in the water, but we are on Facebook as well. Twitter and LinkedIn definitely. If you want to contact me specifically, my e-mail [email protected] Or you can also contact me through the website, just let me know what your thoughts are on stuff. I love feedback, I love hearing from the community, positive or negative, if there’s things that I should be doing differently or Business Science could be doing, definitely get in touch.
Kirill: Thanks. And is it okay for listeners to connect with you on LinkedIn directly?
Matt: Yeah, absolutely. Matt Dancho on LinkedIn, just hit me up.
Kirill: Awesome. Yeah, get in touch with Matt, guys. I have one final question for you today. What is your one favourite book that you can recommend to our listeners for them to become better data scientists?
Matt: Okay. R For Data Science. It's a free online book, it's by two gentlemen at RStudio, Hadley Wickham and Garrett Grolemund. If you're just starting out in R, if you're just starting out in data science, read that book. It's all online, it's very easy to navigate and it will really walk you through some of the robust but new tools in R that really make it super easy to learn R. R used to be very difficult to learn, it's not anymore with the tidyverse so you'll enjoy that book.
Kirill: Awesome. Thank you so much, and once again, thanks so much for coming on the show and spending some time here to share all your experiences and insights into the world of not just R but data science in general.
Matt: Thanks, my pleasure, Kirill. I really love what you guys are doing at SuperDataScience and I'm also really interested in your conference that you have now, DataScienceGO.
Kirill: Thanks. Will we see you at the next DataScienceGO?
Matt: Absolutely. I will be there.
Kirill: Fantastic. All right, thanks Matt.
There you have it, that was Matt Dancho, founder of Business Science. I hope you enjoyed this podcast episode and picked up quite a few interesting things. What I liked about this podcast is of course how it combines the space of data science with business, those are always interesting to see how people go about creating businesses in consulting for data science because it is such an in-demand space. If you're thinking of creating a data science consulting firm or a data science business or even being a freelancer data science consultant, then I'm confident that you found Matt’s journey interesting. And another thing that I found of value in this podcast was Matt's comment on deep learning in R. I wasn’t up to date with how things are going in that space, but it's good to hear that the developers of R and RStudio are looking into creating accommodating more the space of deep learning because that is definitely something that Python has over R at the moment. That gives us hope that if you're an R fan then, with time, these packages or libraries such as Keras will be implemented in R as well and that will open up the world for deep learning. I’m not saying that deep learning is a must, that you have to study deep learning in order to be successful. No, but it's always good to have that option, good to know that the tool that you’re studying will allow you to progress into that space which is such a booming area, and as Matt pointed out even his business is slowly growing into the space with deep learning more.
We really hope you enjoyed this episode. Make sure to head on over to www.superdatascience.com/109 and there you will find all of the links to the resources that Matt mentioned, including his LinkedIn, so make sure to hit him up and follow him on LinkedIn as well as other places like Twitter. Also check out his bog, I had a look at some of the articles, very well written, very interesting especially that article that took him three weeks to write about HR analytics. I think that’s a very interesting case study as well. There you go, hope you enjoyed this episode. If you know anybody in the space of data science that might be thinking of starting a consultancy or going out as a freelancer, send them the link to this episode, this might help them out. On that note, thanks so much for being here today, I look forward to seeing you next and until then, happy analysing.
[Background music plays]