SDS 475: The 20% of Analytics Driving 80% of ROI

Podcast Guest: David Langer

June 1, 2021

We have the incredibly witty and entertaining David Langer to discuss powerful modeling approaches in spreadsheet tools, the SQL databases you need, why R should be our default language, circumstances where ML drives value for businesses, and more!

About David Langer
Dave Langer is the founder of Dave on Data where he crafts and delivers training designed for ANY professional to develop valuable data analysis skills. Dave’s vision is a world where data analysis skills are as common as skills with Microsoft Office. Dave has successfully trained 100s of professionals in a live classroom setting and 1000s more via his online courses and tutorials. Dave is a hands-on analytics artisan, having used the combination of Excel, SQL, and R to deliver insights that drove business strategy at companies like Schedulicity, Data Science Dojo, and Microsoft.
Overview
We kicked things off by discussing Dave’s work through Dave on Data which teaches, according to his tagline, the 20% of analytics that drives 80% of return on investment. During the course of his work, he found himself returning to the same pieces of data and analytics in his work throughout much of what he was doing. He found, first, that linear regression and a technique from statistical process control called an XMR chart (also known as a process behavior chart). This was surprising to me as I’d never even heard of this statistical technique. Ultimately, when we use statistical techniques, we assume data will fit a bell curve, but in this technique, you don’t have to worry about the look of the data or normalizing it.
We then discussed Dave’s courses on SQL content for Excel users. Excel can work for a lot of folks, but at a certain point, you may need to switch to SQL (structured query language) to handle certain datasets at large sizes. Most database storage platforms speak SQL. From there you can work with the data and format it or even run analyses within SQL. This is an awesome thing for people to know, even if you’re not a coder because your experience and skills in Excel translate smoothly into the use of SQL. Personally, I think it’s something that can be learned in an hour for most users.
Dave, who has a course called “R Programming is Easy”, is a big proponent of using R. He’s coded in multiple languages over the course of his career and found R to be his preferred language. First, because you write less code to get things done, and second, R is easier programming language to use for people coming from non-technical backgrounds. He is also a proponent of recognizing most people are coders without releasing it if they’re using Excel, and because of this, the R language translates easier for those who code in Excel. R can and will do any analysis you will need, even if it’s not the most popular for ML or deep learning. Regardless, we both agree the Python vs R debate is a little ridiculous, both are useful for different folks with different needs.
The vast majority of business analytics in the world is done simply by Excel, SQL, and R without the need for intense software. But, Dave also does education work in machine learning. So how do you move from a statistical approach to machine learning? It starts with knowing what kind of question you’re answering. Machine learning allows you to answer questions around classification space and patterns. Many people move into ML because they want to answer business questions that are yes or no answers or need labels.
In this episode you will learn: 

  • Intro to Dave on Data [6:50]
  • 20% analytics that drives 80% of ROI [11:04]
  • The benefits of SQL [19:15] 
  • The uses of R [24:50]
  • Machine learning [34:15]
 
Items mentioned in this podcast:
Follow David:
Follow Jon:
Episode Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 475 with Dave Langer, the founder of Dave On Data. 
Jon Krohn: 00:12
Welcome to the SuperDataScience Podcast. My name is Jon Krohn, the chief data scientist and bestselling author on deep learning. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now let’s make the complex simple. 
Jon Krohn: 00:42
Welcome back to the SuperDataScience Podcast. We are very lucky to have the witty and entertaining Dave Langer as our guest on the show today. Dave is a data scientist and software engineer. He’s held quite a few senior roles over his career, including at Microsoft’s global headquarters in Redmond, Washington, where he spent nearly a decade. At Microsoft he was a principal software architect, a principal data scientist and director of analytics. Today, Dave leverages his rich industry experience as the founder of Dave On Data, a unified brand for his fun and popular courses, webinars, and videos on data analytics, statistics, and machine learning. 
Jon Krohn: 01:22
You don’t have to take my word for it. Some of his YouTube videos have over a million views. Today’s episode is focused on the 20% of analytics that drives 80% of company’s return on investment. Dave fills us in on powerful modeling approaches we can deploy in spreadsheet tools like Microsoft Excel alone. The SQL databases we’ll need if the data sets we’re working with are too big for Excel, why R programming is easy and should be our default language choice for moderate to advanced statistical analysis. And circumstances where machine learning really does drive value in a business setting. This episode is ideal for listeners who are early on in their data science journey, but Dave does introduce some ideas and practical techniques that may be new to even the most experienced data scientists. I, for one, learned a lot from him today. 
Jon Krohn: 02:20
Dave Langer, welcome to the show. I can’t believe you’re here in the flesh while in the pixels, at least. I have wanted to have you on the show for a couple of months. So about two months ago, episode 457, we had Harpreet Sahota on the show and I couldn’t help noticing that he runs these Happy Hours, the Artists of Data Science Happy Hours that anyone can drop into a Zoom call. And I see your face there all the time. And you’re such a big player in the data space. I’ll just embarrass you by saying it. And so I’ve been like working up the courage to ask you be on the show and yeah, now you’re here. It’s wonderful. So how are you doing today, Dave? 
David Langer: 03:05
Jon first, thanks for inviting me to be on the show. I don’t know why you waited so long. I am not that big of a deal and I’m doing great. I’m happy to be here just to talk about data with you today. 
Jon Krohn: 03:18
Nice. So where are you in the world, Dave? 
David Langer: 03:22
So I live in beautiful Bozeman, Montana in the United States. For those folks that might be outside the US not familiar, it’s on the north part of the US towards Canada, but on the west part. So not too far from the Pacific ocean, generally speaking. 
Jon Krohn: 03:40
Nice. And then I imagine you must have like a pet bison or something like that, probably that. 
David Langer: 03:45
No, I have a Doberman Pinscher dog, but no bison. Yeah, no bison. 
Jon Krohn: 03:52
But it’s kind of Montana is famous for well, I guess a lot of beautiful outdoor landscapes, mountains, plains, I imagine rivers. 
David Langer: 04:00
Oh yeah. So if you’re a fan of Brad Pitt, he’s been in a couple of movies that take place in Montana. 
Jon Krohn: 04:07
Oh, wasn’t he in a Michael Lewis one. 
David Langer: 04:11
Legends of the Fall and A River Runs Through It. Those are all Montana and yeah, it’s pretty epic. I actually spent most of my life in the Seattle area, the Puget Sound area which is in the Northwest portion of the United States. And I moved here about three years ago. And it is quite epic. Yes. Very nice. 
Jon Krohn: 04:33
I think Brad Pitt in The Big Short, so based on that Michael Lewis book about the financial crisis in the 2008 financial crisis, 2009 financial crisis I think even in that movie, Brad Pitt he plays a character that like old hedge fund guy and he ends up in Montana, I think. I don’t know. 
David Langer: 04:52
Oh, that one surprised me. There’s a lot of wealthy people that own like large tracks of land here in Montana, Ted Turner, for example. 
Jon Krohn: 05:04
Do you know how many bisons Ted Turner has? 
David Langer: 05:08
Dude, I’ve heard he has, and I’m not an expert. Okay. So I’m not an expert. So if anybody from Montana watches this and they’re like, no, you’re wrong Dave. I apologize for that. But I think he’s got like 30,000 acres or more in Montana. I mean, just like huge amounts. So I would say thousands of Bison and Buffalo. 
Jon Krohn: 05:28
Wow. 
David Langer: 05:29
It wouldn’t surprise me. But then again, I’m a computer data guy, not necessarily a rancher. So take that with a grain of salt. 
Jon Krohn: 05:38
Well, life is long. You still have time to become that data rancher you’ve always dreamed of being. 
David Langer: 05:44
Yeah, that’s right, maybe I’ll bring data science to ranching. 
Jon Krohn: 05:47
Nice. 
David Langer: 05:47
Who knows. 
Jon Krohn: 05:49
So how has the pandemic been out in Montana? Were things ever really locked down in Montana or things are opening back up or did that never need to? 
David Langer: 06:00
So socially, Montana’s a conservative state. So the measures that were in place were never as severe as they were like where I’m from in Washington State or most certainly in California, for example. But where I live in Bozeman there were some restrictions that were in place, mask wearing, not too many people in restaurants and that sort of thing. But I kind of just hunkered down and just stayed in my house for the most part. So it didn’t really affect me too much because I was just like, I’m just going to stay in my four walls and call it good. 
Jon Krohn: 06:34
I’m just going to stay in and make some more awesome YouTube videos. That’s what I’ve been doing. We both been doing that the last year. 
David Langer: 06:43
Oh, yeah. 
Jon Krohn: 06:45
No weddings to go to, just another weekend to make another sweet YouTube video. So you have been making YouTube videos for a while. You’ve got over 40,000 subscribers. You’ve got at least one video with over a million views and some awesome content that I think our audience is going to love to hear about in fact so much so that I don’t even really know where to start, but I think we could start with kind of the overall concept of what you’re trying to obtain with what you teach. So the tagline is you’re offering 20% of the analytics, you teach the 20% of analytics that drives 80% of the return on investment. And yeah, I love that idea. I think it makes so much sense to me. When I saw that as your tagline on LinkedIn. Maybe that’s why I wanted to have you on the show, right from that moment. I just knew, like, it just makes so much sense to me. On the show I talk, guests talk a lot about the most advanced machine learning techniques and all that stuff is cool, but it isn’t actually typically the thing that’s that useful day to day for most people in a given organization. So anyway, I’ll stop talking. Tell us about your wonderful tagline. 
David Langer: 08:05
Yeah. So that came from essentially my own experience, which mimic exactly what you were talking about, Jon, which is when I first got into machine learning, which was while I was getting my master’s degree which was probably, I don’t know, 10 years or so ago now. And I got fascinated with it and I wanted to learn all the things, neural networks and inverted indexes, and a whole bunch of different types of topics that were generally in that data science umbrella. But over the years when I actually got a job and I was actually doing real-world analytics work, I ended up using very little of any of that stuff. And I ended up going to a few tried and true methods over and over again because they were just disproportionately valuable whether I was doing an analysis for product management or marketing or finance or whatever it might be. And then that’s when eventually I said, look you know what? If you focus on just a small percentage of machine learning and data science and analytical techniques, those are actually disproportionately valuable for a wide audience. Now, don’t get me wrong. Like if you’re going to go work for Uber and you’re going to build self-driving cars, you need computer vision, you need deep neural networks. That’s what you need. 
Jon Krohn: 09:25
You don’t think you can build a self-driving car in Excel? 
David Langer: 09:32
With all due respect to the Excel PM community at Microsoft. No, I don’t think so. Although I did work with a guy once who claimed that he built a neural network in worksheets in a workbook by doing all the calculations. 
Jon Krohn: 09:47
No kidding. 
David Langer: 09:47
Yeah. 
Jon Krohn: 09:48
I guess it’s possible. It’d be hard to scale up probably.
David Langer: 09:52
Well, technically it’s Turing complete, right? The collection of worksheets and then the formulas that you can put in the worksheets, it’s Turing complete. 
Jon Krohn: 09:59
Right. 
David Langer: 09:59
So technically you need- 
Jon Krohn: 10:01
I have heard that before. Not because I am not totally abreast on what Turing complete means, because I totally am. Would you mind just letting us know, let the audience know what exactly that means? 
David Langer: 10:13
So Turing complete essentially is a base level computer science test to see if a programming language allows you to do general purpose computing essentially. And it’s a very low bar, to be honest with you, right? I mean, anyone who’s coded in C# or Java or Python, or even R for that matter, it’s going to look at this section of the video and go, why would I ever want to program in Excel? It’s just theoretically Turing complete, doesn’t mean that you should do- 
Jon Krohn: 10:42
Right. That is vaguely actually. Your explanation of it is the idea that I had in my mind. But I don’t think I could have explained it so eloquently. So thank you. So yes, Excel is Turing complete, but we don’t recommend making neural networks in it. So what kinds of things do you teach in Excel? What are some examples of this 20% of analytics that drives the 80% of return on investment? 
David Langer: 11:11
That’s a great question. So there are two primary things that I teach and advocate for in Excel that most people might not necessarily be aware of. One is linear regression. You can perform linear regression analysis in Excel. I mean, it’s not necessarily the best tool if you’re doing something sophisticated. So if you’re doing some econometrics types of things, it’s not going to be very useful for you. That’s when you’re going to go to a different tool like R for example. But for many, many professionals in many, many scenarios, interested in analyzing data, linear regression in Excel is just fine. It works great. I’ve used it myself in hands-on real-world scenarios, it works just great in those cases. 
David Langer: 11:56
So that’s one, the second one is a technique from the area of statistical process control. And that sounds real fancy, but it is actually wildly easy, which is known as a XMR or process behavior chart. And essentially what it allows you to do is robust statistical analysis over measures over time, which is perfect for many business professionals. Because for example, they’re looking at sales over time or credit card chargebacks. Or if you’re in the HR space, we’re looking at attrition over time. I mean, so much of business analytics is looking at business measurements over time that this is a really powerful and easy statistical technique for analyzing those types of scenarios. And those are wildly useful. 
Jon Krohn: 12:44
I haven’t heard of that before. SPC, statistical process control. And you said XTR? 
David Langer: 12:52
XMR chart. 
Jon Krohn: 12:53
XMR chart. 
David Langer: 12:56
It’s also referred to as a process behavior chart as well, but they’re essentially the same thing. XMR chart or process behavior chart, same thing. 
Jon Krohn: 13:04
Wow. I mean, that goes to show, I didn’t actually expect when I asked that question, you to introduce me to a new statistical technique that I hadn’t even heard of. I thought you were going to say like histogram. 
David Langer: 13:20
I’ve got content on how to build histograms in Excel. It’s on my YouTube channel. 
Jon Krohn: 13:26
They’re super useful. I mean, I do use histograms basically every day. But this is awesome. So an XMR chart, so I guess this is kind of a time series analysis technique. 
David Langer: 13:38
It is. But it’s not like the traditional sense of the word, like where I’m using Exponential Smoothing or ARIMA to do as a univariate time series forecast for the future. It comes from the area of statistical process control of manufacturing. And it was actually invented back in the 1920s, the 1930s for controlling and monitoring and optimizing manufacturing processes to get the quality with a certain fault tolerances. And it turns out that the calculations that they use in XMR charts are essentially distribution independent. So like, for example, if you use linear regression in Excel, you need to worry about some of the typical OLS assumptions when you use it. None of those assumptions, none of those assumptions apply to the XMR chart because the calculations are done such that they’re essentially distributionally independent. 
David Langer: 14:36
Now, you do get a certain amount of loss of fidelity if your data is normally distributed, it works better. But if it’s some other type of distribution, the calculations are good enough in the business analytics space. You wouldn’t use this in big pharma or doing a drug trial. You’re going to use a standard statistical test for that. But if you’re in HR and you want to compare the attrition rate between two organizations, the chart is plenty good to say, look, I’ve got data over time for both organizations. Let me compare them statistically. 
Jon Krohn: 15:09
Nice. That’s really cool. So basically to kind of rephrase something that you were saying there. So when we use a lot of statistical techniques, we make parametric assumptions which is that typically like in a nutshell to describe that it’s basically that your data that you’re modeling fit a normal distribution. So like what we colloquially often call a bell curve. And so what you’re saying is that with the XMR chart, we don’t need to worry about that assumption. So you don’t need to worry as much about what your data look like. You don’t need to worry about reshaping them. If you do do a histogram of your data and you notice that it’s not a bell curve, you don’t need to do a Box-Cox procedure or something and get it to normal. You can just use the XMR chart, feel confident about it, because I guess it’s a non-parametric approach. 
David Langer: 16:03
Yeah, the calculations that they use are derived from the standard normal distribution. However, they have done folks in the community, The Statistical Process Control Community have done empirical research where they like do computer generated tests over thousands of randomly generated distributions. And what they typically find is, yeah, maybe with a normal, the cutoff threshold is 99.7%. But with this distribution, it’s 91.2%. But for most business analytics, that differences doesn’t really matter, right. That level of fidelity doesn’t matter. It still allows you to conduct an analysis and come to a conclusion to an insight that’s actionable. That’s what the charts are really for. And then basically the only restriction that you have to have is that the data points that you put, the time series that you put on the chart, they have to be logically comparable. So one example is in brick and mortar retail, typically they have higher sales on the weekends than they do during the week traditionally. 
Jon Krohn: 17:04
That makes sense. Yeah. 
David Langer: 17:04
So you couldn’t probably put daily sales for a brick and mortar store on the chart because they’re not logically comparable, but if you roll them up to the week level, the [inaudible 00:17:14], then you’re good to go. 
Jon Krohn: 17:17
Nice. All right. That’s really cool. You may already have heard of DataScienceGO, which is the conference run in California by SuperDataScience. And you may also have heard of DataScienceGO Virtual, the online conference we run several times per year. In order to help the SuperDataScience community stay connected throughout the year from wherever you happen to be on this wacky giant rock called planet earth, we’ve now started running these virtual events every single month. You can find them at DataScienceGO.com/connect. They’re absolutely free. You can sign up at any time. And then once a month, we run an event where you will get to hear from a speaker, engage in a panel discussion, or an industry expert Q&A session. And critically, there are also speed networking sessions where you can meet like-minded data scientists from around the globe. This is a great way to stay up to date with industry trends, hear the latest from amazing speakers, meet peers, exchange details, and stay in touch with the community. So once again, these events run monthly. You can sign up at datasciencego.com/connect. I’d love to connect with you there. 
Jon Krohn: 18:33
So super, I love learning, not just now, I imagine these kinds of things. I tend to use Google Sheets myself instead of Microsoft Office Suite, but I suspect they’ve done a really good job. The product managers at Google have done a great job of taking all the functionality that you’d want in Excel. So I’m sure I could be doing XMR there too. That’s cool, I’m looking forward to checking that out. 
David Langer: 18:57
Yeah. And actually just to cap off on that, to you use XMR any of the Google Sheets or Excel, you’re basically just doing very simple calculations, so they will work in either you’re not even using anything fancy, like you would be when you’re using linear regression. 
Jon Krohn: 19:12
Nice. Cool. All right. So then another topic area that I’ve noticed that you do a lot of teaching on, a lot of YouTube videos on is directly related to Excel. It’s teaching content on SQL targeted at Excel users. So I guess tell us a bit about that. I guess you can even introduce us to what the structured querying language is to what SQL is and why you’re seeing this particular need. So my guess is that you’re observing a fair number of circumstances where Excel breaks down, where people get to maybe a data set size where they need to start working with a real database tool like SQL. 
David Langer: 20:03
Yeah. So the approach I take is this kind of choose your own adventure kind of thing. So for many professionals that are interested in using data analysis to have more impact at work, maybe Excel is all they need. Maybe they’re not dealing with that much data, maybe a combination of exploratory data analysis, process, behavior charts, and linear regression is all they need. Great. Cool. I’m totally fine with that. I’m not going to tell you to do any more than that, if that gets what you need, great. But if you need more power, the logical next step is SQL, SQL, structured query language. And if you’re not familiar, SQL is essentially a programming language that you use to talk to databases. And it turns out that SQL is wildly useful and basically any database, any data storage technology that’s in common use has the ability to understand SQL. 
David Langer: 20:59
So it becomes kind of this lingua franca. If you want to work with databases, SQL is the way to go. SQL server from Microsoft, Oracle, Hive on the big data platform. Spark SQL, Snowflake, you name it, Terradata. So many database platforms, so many data storage technologies, they all speak SQL. So it’s a logical next step for any business professional, any professional that’s interested in upping their analytics game to move into SQL because it allows you to communicate directly with the data storage platforms that are in use like a database. And you can pull the data, you can massage the data, you can get it into a format that you want, put it in your Excel spreadsheet if that’s what you’re doing. Or you can even conduct analysis directly in the database, because you just code them up in your SQL and then you can run them over millions and millions or even billions of rows depending on what you’re working with. 
David Langer: 21:51
So SQL is like, it’s not this logical next step. And not to mention too, if you’re actually interested in like moving out of a business role into like a formal data analytics role, like a data analyst or data scientist, you’re basically going to need to learn SQL anyway. Because it is, like I said, the lingua franca. I’ve analyzed job descriptions at Facebook and Amazon and they all want SQL, all of them. So you can’t go round with learning SQL. The only problem might be is that some roles in some organizations, depending on what you do, your IT department might not allow you to run queries directly against the database. 
Jon Krohn: 22:32
Oh, no kidding. 
David Langer: 22:33
Yeah. That’s not uncommon. So in this age of self-service BI that’s becoming less and less a thing, but in the old days it was definitely a thing, right? Your DBA would be your database administrator. Your DBA would say, no, no, no, no, we’re not going to let random business people fire queries at the production database, no way we’re going to let that happen. But regardless SQL is something that is awesome for people to know. And it turns out that even if you’ve never coded before, which I would actually argue that’s a bit of misnomer, but we can put that on the side for right now. Even if you’ve never even coded before, your knowledge of Excel helps you learn SQL because at base Excel is all about working with tables of data and that’s what SQL is all about. So I have a series of tutorials on my YouTube channel that maps your Excel knowledge to a SQL concept, and then teaches you gradually how to conduct data analysis with SQL. 
Jon Krohn: 23:32
Nice. That makes perfect sense to me. And SQL is hugely useful. If people aren’t already familiar with it, if you’re a listener that isn’t already familiar with it, it is definitely worth getting familiar with. And the basics you really can learn them in an hour. The general idea, it almost is English. It’s barely a code. Like it’s like select these, you’re specifying exactly what rows and columns you want. You just need to not make any spelling mistakes. 
David Langer: 24:13
Yeah, exactly right. And it’s like, oh, and if you’re not familiar, there’s this thing in SQL called GROUP BY, which is very much a kin to creating a pivot table in Excel. Right. Because keep using Excel, you’re pivoting by grouping something and calculating something, same idea. So there’s a lot of conceptual overlap. 
Jon Krohn: 24:31
Yeah. Cool. So SQL definitely recommend that. I love the natural progression from Excel to SQL that table structure. I can see how it could be so easy to pick up if you already are familiar with Excel. Let’s move on to another topic area that I think flows to kind of the next level of complexity. So maybe you’re already familiar with handling two dimensional tables of data in Excel. And I guess you could actually go right from there to R in some circumstances. Particularly, I guess if you’re not so making the move. So I guess this goes into the choose your own adventure piece. So depends on what you’re doing at your business, what problems you need to be solving. So going from Excel to SQL might be a great path to go down, if you’re going from thousands of rows to millions or billions of rows of data. Now R, that becomes useful if you want to be applying more advanced statistical techniques. 
Jon Krohn: 25:35
So we talked about some of these already, so things like you could have neural networks in R, you could have the kinds of time series analysis that you were talking about, like the Anime approaches. So I guess that’s the kind of natural progression there. So if you want to be using slightly more advanced statistical techniques than you can do in Excel, R programming, as you say is easy. I think that’s the name of one of your courses. 
David Langer: 26:12
Yes, that is absolutely true. And just to give your audience some background, I’m a former software engineer. So I’ve been paid to code in many languages over time. And in the data space, there’s this classic Python versus R debate, and I’ve written code Python. Python’s a fine programming language. I just prefer R even though I could write code in any language, because I’ve used many in my time. I use R for two reasons, one, you just write less code to actually get stuff done as compared to using Python or Java or another object oriented programming language. And then two and this is most important, I think to your point, Jon, which is R is just easier for people to learn if they’ve never used a formal programming language before. 
David Langer: 27:03
And I know that because I’ve taught hundreds of working professionals over my career, R programming from all kinds of backgrounds, and it’s probably a good spot in the episode here to talk about what I mentioned earlier, which is actually, if you are an Excel user, you write code, you write code. If you go into cell G2 on a worksheet in Excel, and hit equals, you are now opening up an interpreted programming experience with the Excel engine. That’s what you’re doing. So if you type in equals, average, open parentheses, a table name, square bracket, a column name, and then close that all off and hit enter. You’ve just told Excel to interpret a piece of programming code that you’ve written and give you a result. 
Jon Krohn: 27:48
Okay. If you’re a programmer [crosstalk 00:27:51] and you didn’t even know. 
David Langer: 27:52
You didn’t even know, oh my gosh. Yes, it’s absolutely true. And here’s the kicker. If you look at that code that you type into Excel cells and compare it to the code that you would type in R they look remarkably similar, remarkably similar, very, very sort of similar. And I’ve got content. I mean, you don’t have to take my word for it, I got content, you can go check it out, that shows this. But then because- 
Jon Krohn: 28:19
If don’t take my words for it on the podcast, you can take my word for it in a video. 
David Langer: 28:23
Yes. Right. Which is just a video of a different flavor. That’s a good point, Jon. 
David Langer: 28:29
But once again, it’s kind of like what we were just talking about with SQL, right? Okay, you say, okay, great. We’re working with tables of data in Excel. Generally speaking, when you’re using R for data analysis, you’re just working with tables of data in R, and the R code for working with those tables of data looks an awful lot like the Excel code. Kind of just makes some sense that if you’re a professional interested in getting into analytics and up in your analytics game R is a great place to start, because it’s just easier for you to learn. It doesn’t mean you have to stick with R, maybe you learn Python later. I don’t know, but maybe not, maybe R is all you will need because R will do basically any analysis that you need. You can do deep neural networks with it, even though it’s not the most popular scenario for it, but linear regression, machine learning, any sort of statistical analysis can all be done in R just fine. And you can do it with over millions and millions of rows of data just on a laptop usually, which allows you to scale past Excel. 
Jon Krohn: 29:37
Yeah. And I also have used a lot of R, I’ve used a lot of Python, and it does also seem ridiculous to me that there is an argument, they’re both valuable tools. One of the ones that you mentioned there for statistics in particular R actually tends to have better developed packages out of the box. So if you want to dig into the results of some statistical analysis using the model summary method in R is pretty much always going to give you or give me at least exactly what I was looking for. And then in Python, I really have to dig around, or I might have to write something custom to dig out the information that I wanted. I think you end up needing Python, I think in scenarios where either your code is going to go into production somewhere. So it’s going to be powering a web app. I don’t know anyone that writes R code that powers a web app. It probably happens somewhere. But I think it’s an unconventional choice. Python is a really nice glue language. And then I guess the other thing, although I could be wrong with this. You might know more about this than me, but how easily can you distribute some R process over a bunch of servers, over a bunch of different cores?
David Langer: 31:04
Yeah. So there are R packages that allow you to do that. Typically R is used within a machine. So the easiest way to scale are quite frankly is usually is to just go to the cloud, spin up a really big VM with lots of RAM and lots of virtual CPUs and just run it there. And R is very good. It’s very easy and very good too. And that sort of scenario to say, look, I’ve got 16 virtual CPUs, use them all please, and it will do it just fine. You can also distribute R across multiple nodes as well, but then it gets a little more complicated. So what I would say is based on my experience, and by the way, I have built maintained production systems in R by the way, so it can be done. 
Jon Krohn: 31:56
Oh, no kidding, all right.
David Langer: 31:57
Yeah. But they were small scale, granted there were relatively small scales. So if you’re talking about large scale software engineering, so like if you’ve got a team of like 20 people, then you’re probably going to want to graduate to a language like Python. That makes a lot of sense to me as a software engineer and as a former software architect. But what people fail to realize, I think is that if you look at all the analytics that are done in businesses around the world every day, the vast majority of it do not require any sort of hardcore software engineering, Excel, SQL and a little bit of R is going to be probably sufficient to cover a huge majority of the cases. 
Jon Krohn: 32:43
Yep. That all makes perfect sense to me. And so one kind of last big topic that I want to go over that I think we’ve kind of progressed to naturally. So we’ve got kind of our Excel base for doing this kind of, I don’t know, can I say like what you see is what you get programming? In Excel it’s kind of there, although actually in a way almost kind of the opposite now that I’m thinking about that analogy really breaks down because I was thinking you see all the manipulations there in the cells. But actually what’s happening in the cell is kind of hidden after you press enter. So it’s almost the least visible programming. 
David Langer: 33:25
Exactly right. A lot of people say, hey, you want to graduate from Excel to Python and R for reproducibility, because you want a script file essentially, or a notebook because exactly what you said, Excel kind of hides that stuff behind the scenes. 
Jon Krohn: 33:38
Right. But for whatever reason, Excel ends up being the way that our brains are configured. It seems like the vast majority of people, it’s kind of easier to understand and start to poke around with data analysis using a table based structure like Excel. And then from there, we can go, if we have really big datasets millions or billions of rows of data, we might need to jump to SQL. If we want start doing some relatively advanced statistical approaches or very advanced statistical approaches, then we might need to jump from Excel to R. So the last kind of topic that I want to talk about, I think progresses along that statistical complexity, that programming complexity line which is machine learning. So you have a course called machine learning in a nutshell. So when would somebody want to move from a statistical approach, like an ordinary least squares, regression maybe t-tests for comparing distributions. Why would just somebody, and what kinds of circumstances do we need to make a jump to machine learning? 
David Langer: 34:48
That’s a great question. The way I tend to phrase it is what kind of business questions are you interested in answering first and foremost? So for example, the process behavior chart, like we talked about earlier, the XMR chart. You can compare the attrition rates between two organizations if you’re an HR professional over time. So that’s the tool that answers that kind of question. Machine learning allows you to answer questions primarily it’s most often used in my experience in the classification space, most of all, because in regression or regression problem, you’ll probably just use linear regression because it’s quite good, right? There’s a reason why it’s so popular, it’s useful. But it’s more the classification space. So things like, do I want to approve this loan? Yes or no. Do I want to be able to understand, for example, what are features and patterns in my data that are highly associated with paid conversion, that sort of thing, those sorts of questions are typically where I see most professionals graduating using machine learning. 
David Langer: 36:00
Then you could use a technique called logistic regression for this, absolutely. The only problem with logistic regression is to interpret the models you have to learn about odds ratios and the natural logarithm and things like that. And sometimes people’s eyes just kind of glaze over. And this is where machine learning… Yeah. You can interpret the logistic regression coefficients at base, like you can with linear regression, they’re not the same, you have to transform them. You can lose a lot of people. Anytime you say, odds ratio figure you lose half your audience. Okay. So, which is nice because then you can use machine learning techniques, which are much easier for everybody to understand what’s going on. And in particular, if you focus on the 20% of machine learning, noticing a theme here? 
Jon Krohn: 36:49
Nice. 
David Langer: 36:51
It’s actually quite easy. So for example, if you’re using something like a decision tree or a decision tree based algorithm, the mighty random forest, not only is it easy for any professional to learn how they work because the mathematics are very, very simple, relatively speaking. But also the results are very easy to interpret and to communicate to other people, as opposed to saying, yeah, I’ve got a logistic regression model and I got to take some natural algorithm and this is the odds ratio. And we got to think about it like that. So that’s where I typically see people moving into machine learning is like, I want to answer these kinds of business questions that are yes or no, or I need a particular kind of label, like bronze, silver, or gold. I’m trying to predict Olympic outcomes or something like that. That’s typically when you see people graduate, because as I said, logistic aggression is certainly a fine tool to use in this space. Even though technically it’s not a classification algorithm, a statistician will yell at you if you say that. But it helps you answer those kinds of business questions, but it’s just more complicated to work with than machine learning, generally speaking. 
Jon Krohn: 37:57
Nice. That is a wonderful tour of the 20% of analytics that deliver 80% of the ROI. And that’s the Pareto principle, right? 
David Langer: 38:07
Yeah, exactly. I invent nothing. I stand on the shoulders of giants, that’s all I do. 
Jon Krohn: 38:16
Well, you’ve invented some really marvelous YouTube thumbnails. I highly recommend it to people at least go check those out on Dave’s YouTube channel, they are really funny. And so I’m sure we’ll get lots and lots of views. So I love everything that you’ve to us here today. I think all of this makes perfect sense to me. You’ve talked a fair bit if there’s been one statistical technique that both surprised me that came up, but also has come up many, many times on the podcast. That’s this statistical process control idea, the XMR chart. I don’t suppose you have a book recommendation related to that concept, do you? 
David Langer: 39:00
I do. The book I always recommend for anybody who says, “hey, Dave, I want to get started in data analysis and I don’t really have a technical background. What book should I get?” And the title is Making Sense of Data, that’s the name of the book and it’s by Dr. Donald J. Wheeler. And that is a awesome, awesome book. Right now at the time of this recording unfortunately it seems to be in short supply on Amazon. So it’s a little bit expensive, but you can get it at a company called vitalsource.com, they sell electronic versions of it. 
Jon Krohn: 39:35
Nice. 
David Langer: 39:36
So that’s a very, very good book. If you’re looking for, if you’re like, I got only enough money for one book, Dave, and I’d never done any data analysis before, what books should I buy? That’s the one I always recommend. 
Jon Krohn: 39:47
Nice. That’s a cool recommendation because I suspect that it isn’t a common recommendation. It’s the first time that I’ve heard of the book, for example. And I like to think I’m aware of a fair few books in the analytics and data science space. And so that’s a cool one. I look forward to checking it out. 
David Langer: 40:03
That’s great. It was an eye opener for me. It came in handy. Just a quick backstory if I may. 
Jon Krohn: 40:12
Oh yeah, please. 
David Langer: 40:13
How I found out about this book was I was working at Microsoft at the time and I had sold my general manager on building a brand new data intensive system to help run Microsoft’s supply chain more efficiently. And he said, okay, cool, Dave here we’ll give you X amount of dollars to build the system and blah, blah, blah, blah, blah. And then he asked me a question. He said, how will I know it’s working? You sold me on the idea if we build this thing and we use all this data, it’s going to help us run the supply chain more efficiently. How will I know that in fact it’s working and you just haven’t sold me a bill of goods? And for all of my knowledge of SVMs and machine learning algorithms, and decision trees at all, and cluster algorithms and K-means and all that kind of stuff. I didn’t know how to answer that question in a rigorous way. I didn’t know how to answer it. And that led me to Dr. Wheeler’s book eventually, because I couldn’t answer that question. I didn’t know how I could put up on his dashboard on my general manager’s dashboard something over time and then show him, yep, see that right there. These collection of points. Yep, it’s working now in a statistically valid way. That’s how I found the book and it is wildly, wildly useful. 
Jon Krohn: 41:28
Cool. So you have this problem that you needed to solve. You didn’t know how to solve it. And so like Google searches or something led you to this particular book, it ended up being the right fit. 
David Langer: 41:38
Yeah. So interestingly enough, I did Google searches and that led me to a consultant in Australia. And she is a performance measurement consultant. So she’s like someone who goes around and helps people, create KPIs for their organization and measure performance and that sort of thing. And she had a webinar saying your KPIs are lying to you. And so I watched that in earnest. And then in the end, she said, I learned how all of these things I’m talking about my webinar from Dr. Wheeler. And that’s how I found out about it. 
Jon Krohn: 42:10
Cool. 
David Langer: 42:11
Yeah. 
Jon Krohn: 42:12
Great. All right. That’s an awesome recommendation. Dave, I’ve absolutely loved having you on the show. I’ve learned a ton and I’m sure our audience did too. I’m looking forward to having you again on the program sometime soon. 
David Langer: 42:27
Okay, that’d be awesome. I’d be totally down for that. 
Jon Krohn: 42:35
Dave is so cool. Isn’t he? I loved filming today’s episode with him. We had a blast. In this episode, we learned about the 20% of analytics that drive 80% of ROI. In particular Dave filled us in on how spreadsheet software is a straightforward, surprisingly powerful tool for a wide range of modeling approaches, including for linear regression and XMR charts. We also learned how SQL is great for managing data when we have too many rows for spreadsheet tools like Excel to handle such as when we have millions of rows or more. We learned how our becomes necessary when we’d like to have reproducible data processing and modeling and how easy and intuitive our programming can be particularly if you have experience with spreadsheet tools already. 
Jon Krohn: 43:17
And then finally we learned how machine learning, for example, in R, can be very useful in business settings where we’d like to solve a classification problem. For more awesome content from, Dave, I recommend following him on LinkedIn and subscribing to his YouTube channel. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show and the URLs for Dave’s LinkedIn profile and YouTube channel, as well as my own social media profiles. That’s at www.superdatascience.com/475, www.superdatascience.com/475. If you enjoyed this episode I’d of course greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel, where we have a video version of this episode. To let me know your thoughts on the episode, please do feel welcome to add me on LinkedIn or Twitter, and then tag me in a post to let me know your thoughts on this episode. Your feedback is invaluable for figuring out what topics we should cover next. 
Jon Krohn: 44:17
All right, thanks to Ivana, Jaime, Mario, and JP on the SuperDataScience team for managing and producing another amazing episode today. Keep on rocking it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts