SDS 463: Time Series Analysis

Podcast Guest: Matt Dancho

April 20, 2021

I learned a ton from Matt on time series analysis in this episode. We discussed Matt’s work on their Modeltime ecosystem for time series analysis, state-of-the-art tools and techniques, the most powerful models for time series forecasting, and more!

About Matt Dancho
Matt Dancho is the founder and CEO of Business Science, an educational platform and school dedicated to applying data science to business problems. With over 2000 students, Business Science has impacted diverse companies and industries and has most importantly helped students accelerate their careers. Matt is an avid member of the data science community and has contributed software in time series and finance.
Overview
Matt, the founder of Business Science which has thousands of students, has been developing time series packages that formed a small ecosystem of R packages. He started this in January of last year when he was developing a course on time series. He realized the existing system in R was giving him enough trouble to make him want to quit the project several times. Part of the issue is much of this was developed by different folks who did not communicate with each other. So, he began developing his own libraries of models.
He spoke about a conversation he had with Hadley Wickam who wanted to talk to him about his work. It’s indicative of the R community’s openness and friendliness, according to Matt. One toolkit he developed in that work was timetk (which stands for a time tool kit, in keeping with Matt’s straightforward naming convention), a tidyverse toolkit that helps prepare data for time series analysis and visualize and wrangle time series data. He describes it as a productivity booster. Another tool is step_timeseries_signature that provides a “recipe” to convert date data. When I first learned R, none of the tidyverse tools existed but I have to say life is way better in the tidyverse than without.
Matt was reviewing forecasting competitions on Kaggle which followed a pattern: a big time series data set. The solutions people came up with, he noticed, utilized ARIMA and didn’t solve the scalability piece across thousands of customers and products. So he set out to come up with tools. He observed two main practices: using machine learning algorithms to process all the products together and forecast as well as the need for a framework to run through algorithms quickly to select the winning one, and deep learning, which has a huge impact on time series analysis, with recurrent neural network deep learning models winning multiple competitions in the space. Both these method observations resulted in Modeltime, a time series forecasting package that could really speed up the process. His Modeltime Ensemble involves three levels of stacks with the first level of sub-models, the second level of algorithms, and a third of weighed stacks. It handles the averaging and model management for you. We also discussed two more tools in forecasting using auto ML and tools for backtesting to develop a resampling workflow and time series cross-validation to determine model stability over time.
All of this just scratches the surface of the tools and practices growing in time series analysis. Matt has a hugely useful set of R packages out there that I encourage everyone to check out. 
In this episode you will learn:
  • How Matt got into time series library development [4:22]
  • Business Science [7:00]
  • R Shiny [9:36]
  • Matt’s 6 time series models [14:11]
  • Timetk [15:02]
    Modeltime [29:32]
  • Gluon package [36:04]
  • Modeltime Ensemble [43:12]
  • Modeltime H2O [45:22]
  • Modeltime Resample [48:10]
Items mentioned in this podcast:
Follow Matt:
Follow Jon:
Episode Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 463 with Matt Dancho, founder and CEO of Business Science. 
Jon Krohn: 00:12
Welcome to the SuperDataScience Podcast. My name is Jon Krohn, the chief data scientist and bestselling author on Deep Learning. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now let’s make the complex simple. 
Jon Krohn: 00:42
Welcome back to the SuperDataScience Podcast. I’m your host, Jon Krohn. There’s an absolute ton to be learned from Matt Dancho in this episode on time series analysis. Matt’s the founder and CEO of Business Science, an applied data science education company. That is not however, our focus this episode. Instead, we focus on Matt’s work as the lead developer of the Modeltime open source ecosystem for Time Series analysis. This episode will largely be of interest to hands-on data scientists, or other technical folks who are keen to learn about this state-of-the-art tools and techniques for handling time series data such as financial data or any quantitative information that varies over time. Specifically, we cover time series data pre-processing, tidy Time Series objects, and the most powerful models for Time Series forecasting, including deep recurrent neural networks, ensemble models, automated machine learning, and re-sampled back tests. 
Jon Krohn: 01:43
Before we dig into the episode, I have a quick announcement, that starting with episode 465 next week, we will begin releasing guest episodes on Tuesday mornings, New York Time. Historically, we’ve released on Wednesday evenings, but by releasing 36 hours earlier, we’ll be giving you two more morning commutes in your week to enjoy the episode. I can’t imagine any downsides to this change, but I didn’t want to catch you off guard when it happens. All right, let’s get to Matt. 
Jon Krohn: 02:18
Matt, welcome to the program. It is an honor to have you on. How’s it going in your part of the world? 
Matt Dancho: 02:23
Oh, it’s going well. Thanks for having me. I really appreciate it, Jon. 
Jon Krohn: 02:28
Where are you calling in from today? 
Matt Dancho: 02:31
Sunny State College, Pennsylvania. It’s finally starting to warm up. So if you’re on the East coast near New York, about four hours away from New York City. 
Jon Krohn: 02:41
Nice. Four hours South, I guess? 
Matt Dancho: 02:43
It’s almost directly West. 
Jon Krohn: 02:47
Oh, yeah. 
Matt Dancho: 02:48
Yeah, you just take 80 to the West. 
Jon Krohn: 02:51
Pennsylvania wraps around New York and New Jersey. Nice. Well, I haven’t been there, but I’m sure it’s lovely. And we are enjoying, it’s starting to warm up here, so we’re filming right at the end of March and we’re starting to get some nice sunny days, getting shorts out. It’s nice. 
Matt Dancho: 03:11
Yep. I’ve got my shorts on right now. I went for a walk today. I’m feeling good. 
Jon Krohn: 03:18
Great. So you’ve been on the episode twice before, but it’s been a while. I looked it up, you’re on episode 109 in November 2017. And then 165, not long thereafter in June 2018. I think it’s safe to say that a lot has evolved since then. I can’t wait to fill in our listeners on it. We’re delighted to have you back. Given your specialization in creating Time Series libraries for R, I think we should have an episode dedicated to the wondrous aspects of Time Series analysis. 
Matt Dancho: 03:54
Absolutely. And it’s definitely something that’s so helpful for companies. I used to do it all the time for my last company and it’s something that I really want to be able to help other data scientists be able to learn how to do it for their companies. Literally, it affects revenue and affects every part of the business. If you can forecast better, you’re going to be in high demand. 
Jon Krohn: 04:22
Nice. Tell us how you got into developing Time Series libraries for R. 
Matt Dancho: 04:29
There’s been probably the past year and a half, I’ve been developing a series of Time Series packages that is forming a small ecosystem. It’s like a small army of R packages. Really the story behind it is I was actually developing a course January of last year and I almost quit several times, it was a Time Series course, because I was trying to use all of the existing system in R and I was just having a heck of a time putting it together. Number one, I was running thousands and thousands of lines of code to be able to do simple things, like to be able to visualize Time Series, to be able to clean the data, to be able to come up with engineered features. I was constantly having to go through and back and forth through different systems. One’s called XTS. One’s called the tidyverse. 
Jon Krohn: 05:35
Yeah, I remember that. Been a while since I’d been proficient in R, but I remember XTS, the tidyverse was just starting to come along and near the end of when I was using R a lot. 
Matt Dancho: 05:50
Tidyverse has helped out a ton. That’s actually the system that I centralized on. I said, all right, so the problem I was running into, you got 15 different time series packages that do different things. One does anomalies, one does forecasting, one does sliding calculations. Those types of things, they are all developed by different people so that these people aren’t talking to each other, they aren’t standardizing on, Hey, this function should operate this way and then you can pipe it right into this… No, there was none of that. So it ends up driving me nuts, to be honest. Literally I’m trying to put together this awesome course on teaching people how to do all the stuff that I’ve been researching and yeah, it wasn’t going well. That was the start of the ecosystem. 
Jon Krohn: 06:50
I guess for the sake of our listeners, we should also mention that you create a lot of courses. That’s how you’re making a living these days, right? 
Matt Dancho: 06:59
Yes. 
Jon Krohn: 07:00
You run the Business Science platform? 
Matt Dancho: 07:02
Yes. I’m the founder and CEO of Business Science. To give you a little bit of background, I’ve been developing courses over the course of about three or four years now, have about 2000 students. I focus on basically taking somebody who knows nothing about data science, who has very limited experience with data science and getting them their careers accelerated as fast as possible by learning the most in-demand technologies like Time Series being one of them, but also how to make web applications, how to you use AWS, how to do all of the things that no one ever teaches data scientists how to do. 
Jon Krohn: 07:45
Then web applications, I guess, would be R Shiny? 
Matt Dancho: 07:47
Yeah. Shiny web apps. I have two courses on that, intro one and then a developer/advanced one. Then also the whole consulting, how do you take an organization, integrate data science into it? What frameworks do you use? How do you go from A to B and then build out a large scale project with them? So I teach all of that stuff. It’s all based on my real experience too. The time series stuff, that’s all based on things I’m either actually doing at my current company, which is business science or at previous companies or engagements that I’ve had. Obviously nothing proprietary, no proprietary data or anything, but the techniques are all solid. They’re all the ones that I’ve been using for the past seven or eight years now. 
Jon Krohn: 08:42
Yeah. Business Science then also does some consulting on top of the teaching? 
Matt Dancho: 08:49
Not really. We used to do consulting, actually Business Science used to be a consulting company for the very, very first beginnings. But after about six months to a year, we quickly transitioned into education. That was just basically, I was seeing a ton of, every company that I was working with, it was all an education problem. They all had competent people that just didn’t know how to apply data science to business. That’s really where I got the idea. All right, this is the real problem. This is the real market. This is what I need to be doing, is helping to educate the future workforce. That’s how I got into it. 
Jon Krohn: 09:27
That sounds hugely valuable. And we are, I promised to you listener, we’re going to dig into time series analysis to an incredible amount of depth in a moment. But before we do that, I want to talk about R Shiny for a second again. Because I don’t think we have [inaudible 00:09:41] it up enough yet. R Shiny is cool. It allows you to very easily build your own web applications around some… I don’t know why you would use it if there wasn’t some kind of data involved. I think that’s the main use case. You wouldn’t make it to, I don’t know, create a website for selling something, but it’s for making web apps that have something to do with data. So you can imagine the same kinds of things you might make, you might use Tableau for, but it allows you to have way more customizability. You have a huge amount of control over the way that you’re storing data in the backend, how you’re calling it up, how you’re visualizing it. It all is rendered in HTML and you can port it over to a server in a couple of lines of code. Right? 
Matt Dancho: 10:32
Right. Yeah. RStudio Connect really makes it super easy now. It’s just push button publishing of apps. R Shiny, that’s honestly the first gigs that I got as a data science consultant, they weren’t really even data sciency. I wasn’t doing a whole lot of machine learning. It was just people who wanted a way to take the data that they had and put it into some application that didn’t involve Excel and them having to open up a file, just being able to pull it up through their web browser or Chrome. So fast forward now though, Shiny, the ecosystem has evolved so much. It’s insane. It’s just so powerful. 
Matt Dancho: 11:26
In fact, I’ve done a lot with integrating time series into Shiny apps. That’s one of the things… I built this app called, I call Nostradamus. But it’s basically an automatic forecasting app that uses some of the Times Series packages that we’ll talk about Modeltime, timetk, under the hood, but in a user-friendly way where if somebody just clicks one button run, builds like 28 or 30 plus models, ensembles them together, makes the forecast, makes visualizations so people understand what’s going on. And it does it pretty quick, so R Shiny. 
Jon Krohn: 12:07
Nostradamus. 
Matt Dancho: 12:08
Yeah. 
Jon Krohn: 12:16
Eliminating unnecessary distractions is one of the central principles of my lifestyle. As such, I only subscribe to a handful of email newsletters, those that provide a massive signal to noise ratio. One of the very few that meet my strict criteria is the Data Science Insider. If you weren’t aware of it already, the Data Science Insider is a 100% free newsletter that the SuperDataScience team creates and sends out every Friday. We pour over all of the news and identify the most important breakthroughs in the field of data science, machine learning and artificial intelligence. The top five, simply five news items. The top five items are handpicked. The items that we’re confident will be most relevant to your personal and professional growth. Each of the five articles is summarized into a standardized, easy to read format, and then packed gently into a single email. 
Jon Krohn: 13:11
This means that you don’t have to go and read the whole article. You can read our summary and be up to speed on the latest and greatest data innovations in no time at all. That said, if any items do particularly tickle your fancy, then you can click through and read the full article. This is what I do. I skim the Data Science Insider newsletter every week. Those items that are relevant to me, I read the summary in full. And if that signals to me that I should be digging into the full original piece, for example, to pour over figures, equations, code, or experimental methodology, I click through and dig deep. So, if you’d like to get the best signal to noise ratio out there in data science, machine learning and AI news, subscribe to the Data Science Insider, which is completely free, no strings attached at www.superdatascience.com/dsi. That’s www.superdatascience.com/dsi. Now let’s return to our amazing episode. 
Jon Krohn: 14:10
All right, that’s a great segue. So we can go from Nostradamus, your R Shiny app for making forecasts, to talk about the underlying Time Series models. You mentioned them there briefly. There are one, two, three, four, five, six that we’re going to talk about in turn. The two big ones that we need to talk about upfront are timetk and Modeltime. So timetk allows you to prepare data for time series analysis, and then Modeltime is for applying machine learning on those time series data once they are properly formatted. Quickly trying out models. So tell us about why you created these libraries. I’ve now pretty much said everything that I know about timetk and Modeltime. So please fill us in with more detail on why this [inaudible 00:15:01]. 
Matt Dancho: 15:02
Yeah. Let me start with, so timetk is where you do the data wrangling and the visualization, the cool things just at a high level. If I wanted to pitch you on why you should use timetk versus whatever else you’re using, one line of code pretty much does everything. So if you want to plot something, you just use plot time series and we centralize on the data frame. So if you’re comfortable with the tidyverse, all you do is you take your data frame, you pipe it into plot time series and it spits out a, either an interactive visualization if you select interactive, or if you turn interactive to false, it produces a GGPlot, a static plot. So either a Plotly or a GGPlot object. That’s the plotting utilities. 
Matt Dancho: 15:53
There’s probably six or seven plotting utilities, everything from anomaly detection to seasonalities, all the normal stuff, ACF. You have auto correlation and those sorts of things. That’s just the plotting though. Then you have data wrangling. So there’s all these dplyr functions. So dplyr, that library has a function called summarize. You use it often, like group by and et cetera, et cetera. 
Jon Krohn: 16:22
Yeah, I’m familiar with it. Yeah. 
Matt Dancho: 16:24
Okay. Well, then you’re immediately familiar with timetk then, because instead of group by and summarize, group by and summarize by time, and you just tell it what element if you want to summarize it weekly or daily, and it automatically does all that stuff for you. It speeds you up. It’s a productivity booster. Then it has a lot of stuff for, as you take your data science through the workflow, so you’ve visualized it, you’ve wrangled it a little bit, and then you’re getting it to a point where you want to do some forecasts. So there’s all sorts of, they’re called pre-processing recipes. This recipes thing is a big package, is being developed by the people at our studio. They call it a tidy models ecosystem. 
Matt Dancho: 17:13
Recipes as a package in that ecosystem where you do pre-processing recipes, and I have a bunch of time series pre-processing recipes. The biggest one is the Step Time Series Signature. It’s got a funky name to it, but that is the number one function you need to learn if you’re getting ready to do some forecasting and feature engineering. It creates like 30 time series features out of your timestamp. Things like day of the week, month of the year, it just automatically creates all that stuff for you. And then you can normally just run your model right off of it. Big productivity booster. 
Jon Krohn: 17:56
Nice. So let’s give you a little bit of context on some of the items that you’ve mentioned. So RStudio, huge commercial player in the R space, probably the biggest. They have created an IDE. The development environment for the building and running R models, including R Shiny, web apps and it all functions very nicely, that’s available for free. I think they, I guess they monetize by providing support, commercial support if you need it, helping you get things up on servers, that kind of thing. 
Matt Dancho: 18:30
Yeah. RStudio, and they’re actually getting pretty big into Python now too, to be honest. And I know that’s probably just the data science market tends to be very heavily, a lot more typically Python users than R users. But I will tell you this, they have noticed that and they’re expanding the support, but their big thing is the RStudio IDE, integrated development environment. That’s where I learned to do data science. It’s very nice, very user-friendly, it’s probably the best data science IDE out there. I’ve tried them all, Jupiter notebooks, Labs, VS Code. I’m creating a Python course right now with VS. RStudio just makes things so easy. They do. Big commercial player. They also spend probably a good bit of their resources on developing open source. Without them, R probably wouldn’t exist. It’d be all Python to be honest. But- 
Jon Krohn: 19:39
Hadley Wickham? [crosstalk 00:19:40] 
Matt Dancho: 19:40
Hadley Wickham works for RStudio. 
Jon Krohn: 19:41
Yeah, he was a guest on the SuperDataScience Podcast [crosstalk 00:19:45]. 
Matt Dancho: 19:45
Oh, you guys had Hadley. All right. 
Jon Krohn: 19:46
Yeah. 
Matt Dancho: 19:48
He’s a personal friend of mine. Just small story about him back when I was first starting to attend some conferences, back when I first started Business Science, he actually reached out to me and we had lunch at one of the conferences. I couldn’t believe, I was star struck the whole time because this guy, literally he’s the person who developed dplyr, and he also developed GGPlot [inaudible 00:20:14]. 
Jon Krohn: 20:15
Yeah. We’d mentioned both of those. Maybe we should talk about them quickly. So dplyr allows you to pipe together a series of functions. If you’re familiar with piping things together in a Unix terminal, it allows data processing to be so easy and clean and intuitive in a way that prior to dplyr, you had to have either separate lines for each step in the process, or really ugly nested parentheses of functions. So dplyr, beautiful for that. GGPlot, GG is grammar of graphics. It’s based on a book called Grammar of Graphics. Hadley Wickham used that book as the inspiration for coming up with a really comprehensive way of plotting. 
Matt Dancho: 21:07
Yes. It’s really those two packages alone, they’re two of the most, they’re definitely in the top 10 most downloaded packages, probably the top three or five which shows you how popular they are, but really it’s just a new way of doing data science. He basically created a structure that makes it almost like writing a book, like writing a paragraph of text, Hey, start with my data, group it on this particular category, select these columns, summarize this part of that data. So it’s all like, it’s words. It’s not like you don’t feel like coding. You feel like you’re writing down something. Then the GGPlot syntax, like you said, extended the grammar of graphics, but that it’s the way I learned to plot GGPlot to create a GGPlot at a scatter plot or at a smoother to it and type for Time Series. It’s wonderful. 
Jon Krohn: 22:20
All right. So Hadley Wickham, inventor of these brilliant R packages. He invites you out to lunch at a conference, and then… 
Matt Dancho: 22:30
Yeah. This is a guy that, he’s basically the face of R, the programming language that I have grown to love. He invites me out to lunch. We have a great conversation. I got to learn a lot more, but he really just wanted to learn about what I was doing. I just couldn’t believe that somebody so prolific is willing to take time out of his day to sit down with me, little old me, who made one small R package that he saw and that he knew about, that package’s called tidyquant. That was a financial package. That package is actually now been downloaded, I think around 600,000 times, 700,000 times. So it’s grown too, but- 
Jon Krohn: 23:20
What did it do? We’re not going to [crosstalk 00:23:22] too much. 
Matt Dancho: 23:23
It’s related to Time Series, but it’s more for financial analysis. It integrates some of the stuff that’s out there, like the XTS, the quantmod. It makes it tidy. So ports that functionality over to the tidyverse. 
Jon Krohn: 23:38
Yeah. I guess we haven’t mentioned this yet for people that aren’t aware of the tidyverse, the whole point of it, it’s a collection of packages that makes data tidy. A specific format makes it easy to read, makes it easy to work with with other packages from the tidyverse. 
Matt Dancho: 23:56
Yes. Yeah. It’s a whole ecosystem and it’s a movement too. It’s really like people, you’re either based R or you’re tidyverse. I’m more tidyverse because that’s just how I learned dplyr came out right when I was starting to learn R and that package, it changed me. It changed me as the whole way I thought about data. I was just like, yes, this is how data is meant to be cleaned and- 
Jon Krohn: 24:26
I’m so old that when I learned R, none of these tidy things existed, but I can confidently say having made the switch, that life is way better in the tidyverse than outside of it. So you don’t have to convince me. All right. So anyway, Hadley Wickham invites you to lunch on the basis of your tidyquant package. I guess that’s the story? 
Matt Dancho: 24:49
That’s pretty much it. At the end of the day, he’s a very popular guy. He’s somebody who I looked up to, and I still look up to. He’s just willing to take his time out of his day just to take 30, 45 minutes and learn about what I’m doing and why I’m doing it. That really, I think it’s that sense of community. If you look at the broader R community, that’s like a representation of what I love about it, is that people are so willing to help you. All you got to do is reach out. He reached out to me through Twitter, said, “Hey, what are you doing? DM me.” I messaged him and said, “Holy cow, you’re Hadley Wickham, thanks for reaching out.” And then he says, “Hey, let’s do [crosstalk 00:25:43].” 
Jon Krohn: 25:43
Holy cow. [inaudible 00:25:43]. I don’t know if you knew. 
Matt Dancho: 25:47
I can’t believe it. I got pretty excited. I was actually out in San Francisco getting ready for the event. I was with some of my buddies and I was actually at a pub and I pretty much screamed. I’m like, “Holy crap. My buddies, you all got to get over here. Check out this.” And they’re like, “Who the heck is this guy?” They’re not data scientists, so they don’t know. But- 
Jon Krohn: 26:17
I understand, that would be a big deal. Hadley Wickham, if you’re listening, if you ever want to reach out to me, I will scream with delight. So there you go. We’ve actually, Hadley Wickham. I’ve seen him a few times. I saw him at a conference. It was the Americans to Cisco Association, their main statistical conference, like 80,000 people go. It’s something crazy. It was at this huge convention center in Chicago. He was at an RStudio booth and I was presenting a poster not too far away. I saw him that he was standing at the RStudio booth and we made eye contact and I got too shy and I was just like, walk away. 
Matt Dancho: 27:04
Oh, he’s very friendly and he’s always willing to chat some more. 
Jon Krohn: 27:08
But I was just like, what am I going to say to a celebrity like that? The other time he came and he spoke at a meetup in New York, the Open Statistical Programming meetup, and it was amazing. I can highly recommend, if you have the opportunity to see Hadley Wickham talk in person, even if you just want to look at some talks online, really great talk, amazing pacing, lots of interactivity with the audience in terms of looking up to people in so many ways. Productivity is R package developer, software developer in general, but also his presenting skills, what can’t he do? 
Matt Dancho: 27:50
Yes. What always struck me is he just has a clear vision of what the future looks like for R. And whether or not he truly knows it, it’s like a lot of that stuff seems to come to fruition and it’s crazy. He’s a visionary. All right [crosstalk 00:28:11]. 
Jon Krohn: 28:11
Yeah. All right. [crosstalk 00:28:12]. We’ve got to get back to Time Series analysis. Anyway, so I guess we were talking about timetk. What does timetk stand for? What does the TK stand for? 
Matt Dancho: 28:21
It’s toolkit. Time Series Toolkit. I like to keep my R packages, their names pretty short or pretty obvious. Including the time, so you get the sense of what it’s used for, and then TK to give it the toolkit. 
Jon Krohn: 28:44
Yeah, go ahead. 
Matt Dancho: 28:45
Oh, I was just going to say, just like dplyr is like plyrs. It’s like data plyrs. This is like Time Series Toolkit. 
Jon Krohn: 28:53
Nice. We’ve already talked about what it does a bit, so it allows us to prepare data for Time Series analysis, but also lots of other cool things. Visualization, which could be interactive or not in plotly or- 
Matt Dancho: 29:07
GGPlot. 
Jon Krohn: 29:07
GGPlot. It would be a non-interactive. Data wrangling and also pre-processing recipes. That’s where we got to. And then I completely sidetracked us with RStudio and how they look them and all that, but where you were probably wanting to go was to then talk about Modeltime, which once you have your data prepared in timetk, you can then put it into Modeltime and actually apply machine learning on the Time Series data. 
Matt Dancho: 29:32
Yeah. The fundamental principle around Modeltime, and I spent about a year or so developing the framework for it. The reason was, is because as I was going through my course, I was doing a ton of research. What I was finding is that the methods that were being used to win some of these Kaggle competitions were often not what was being taught in academia and it was certainly not what was necessarily available through some of the forecasting frameworks that were already developed in R. So what ends up happening is that there’s this forecasting competitions that I was reviewing, many of them on Kaggle and what not. They all developed around these problems that are pretty similar to what businesses are now facing, which is they’ve got a big time series data set. 
Matt Dancho: 30:34
This could be anything with a timestamp in it, but you’ve got lots of products, you’ve got lots of customers, you’ve got just tons and tons of data. The problem is, is that everyone assumes just you’ve got a data set and you’re just going to do a remodel. Well, that doesn’t solve this problem that businesses are facing, where they’ve got thousands and thousands of customers or products, there’s a scalability piece. I did a ton of research and I found that Kaggle competitions were the closest thing out there and what was winning them was really two different methodologies. 
Matt Dancho: 31:15
First one was related to machine learning. Basically utilizing machine learning algorithms to process not just one model for each product, but rather process all the products together and then develop a forecast based on all of the products at once. So you have one model for many Times Series. That’s what I set out to do with Modeltime, was to come up with a framework that made it easier to go from this machine learning side of forecasting and integrate that into a framework that allowed us to visualize, to walk us through a best-in-class workflow, where you actually test on how to sample data. A lot of these packages that were written in R utilize these in sample metrics. To me, that’s just like, that’s not how it’s done. That’s not how you really measure and test and get a true confidence about your forecast. You got to test on out of sample data and you got to walk through a solid, well, I call it best practices. I guess that’s the right word for it. 
Matt Dancho: 32:27
Basically a structured approach where you should test these models. Where Modeltime came in was, I could develop one algorithm or whatever, but really the strategies that were winning these competitions weren’t just doing one algorithm. It was doing a bunch of different algorithms and just experimenting and rapidly trying to figure out which algorithms they should include and which they shouldn’t include, what should promise. And then the idea hit me. I’m like, Oh my gosh, this is what we need in R, is a straightforward framework where it doesn’t take a whole lot of code in order to be able to run through all these different algorithms. And then I started getting ideas for adding the algorithms together ensembling, because that ended up being as I was doing more and more research, that was like a winning strategy and so on. That’s what Modeltime came to be. So it’s really just a forecasting framework for enabling us to do machine learning models, also does ARIMA and profit, some of those other types of models, but like… 
Jon Krohn: 33:37
What is the, for the audience, ARIMA, A-R-I-M-A? What does that stand for? 
Matt Dancho: 33:44
Yeah. It’s Auto Regressive Integrated Moving Average. And it’s basically, think of it like a linear regression, but that uses only lags of your Time Series. That’s how that algorithm works. The problem with that is when you bundle it in a forecasting software, it becomes a univariate analysis because it’s got a time dependency and you can’t scale that up. But Modeltime O5O, which was just released yesterday on CRAN includes a recursive function that does auto regression. It allows us to do basically the same technique that ARIMA does, but you can use any different algorithm. So you can use extra boost, you can use GLMNET, which is an elastic net. You can use anything. It’s awesome. A lot of best-in-class techniques, a lot of things that are built based on what the needs are and the demands are nowadays and how forecasting is done. 
Matt Dancho: 34:57
The other piece of it is, so we talked about machine learning, deep learning is the other one from competitions. Deep learning is a big deal. Mostly in business data, deep learning hasn’t really made an impact. Time Series is the one area though where deep learning is having a huge impact, the Wikipedia website, forecasting challenge, deep learning won by like a mile. Also, the M4 competition, which is another big forecast and competition. Deep learning ended up winning that. And then that’s also where this newer algorithm called N-BEATS got some fame there. But yeah, so deep learning is the other thing that I started to integrate into the Modeltime.
Jon Krohn: 35:50
So there are, in addition to timetk for pre-processing and Modeltime, the main library for applying machine learning on time series data, on top of that, there are four additional packages in the Modeltime ecosystem. The one that I think you’re going to want to talk about right now is Gluon which uses deep learning. I’m really interested to hear about this. So these Time Series competitions, what kinds of deep learning models are they using? Like recurrent neural networks? 
Matt Dancho: 36:18
Yes. So basically what they’re doing are using variants of LSTM, which is a recurrent neural network. What they’re doing is, well, in the Time Series competitions, they’re building them from scratch, but there’s a little company called AWS or Amazon. They decided to take these deep learning algorithms that normally take a lot of code to put together. They started making variations. They have this system, have you ever heard of MX net? 
Jon Krohn: 37:00
Sure. 
Matt Dancho: 37:01
Okay. So MX, and that’s one of the big three, we’ll call them deep learning frameworks. But it’s on another level. 
Jon Krohn: 37:09
Yeah. Well, [crosstalk 00:37:11]. It would depend. And then if you’re saying that, then you’re considering Keras not to be a separate. 
Matt Dancho: 37:18
Right. TensorFlow is the other one. And then PyTorch. 
Jon Krohn: 37:21
TensorFlow, Keras and PyTorch are head and shoulders in terms of popularity. I do a talk maybe once a quarter at a conference on PyTorch, Keras and TensorFlow, which should you be using? I do it. So every time I do that, I go and I look at Google search popularity for the big libraries, and it’s crazy. TensorFlow, Keras and PyTorch are like head and shoulders above MXNet. And then I guess CNTK would be something I’d throw in there as well. 
Matt Dancho: 37:50
Yes. MXNet really doesn’t get a whole lot of usage but I will say this, the GluonTS is something that they’ve developed, which is PyTorch and some of these other, TensorFlow haven’t really developed anything specific to time series. 
Jon Krohn: 38:11
Nice. 
Matt Dancho: 38:11
So I found out about GluonTS and what GluonTS does is it integrates and N-BEATS, which is one of the bigger algorithms that has shown some promise with forecasting competitions and they have a DeepAR, which is surprisingly good. 
Jon Krohn: 38:30
All right. So first, spell the… What’s the deep, I hadn’t heard that. 
Matt Dancho: 38:35
Okay. Yeah. So N-BEATS. 
Jon Krohn: 38:38
N-BEATS. Okay.
Matt Dancho: 38:39
So the letter N, N, neural and then BEAT, and I’ll be honest. I don’t remember what each of the letters stands for. I’m sure it’s something, but it’s like- 
Jon Krohn: 38:51
N-BEATS. 
Matt Dancho: 38:53
Yeah. Like an ensemble of deep learning, LSTMs and they’ve got some special concoction that they’ve implemented and they’ve termed it N-BEATS. And then there’s DeepAR, that’s the other algorithm that Modeltime, GluonTS. So I have those two big ones. There’s more algorithms in the GluonTS package, but those are probably the two most popular. In terms of results, I have several consultants that are in my courses. I have a community where I actually have started getting pretty close with some of my students, the ones that show promise, I’ve started to groom some of them for software development. 
Matt Dancho: 39:41
But the consultants that are in there that are using DeepAR for these heavy duty time series projects, when I say heavy duty, I mean like 10,000 plus time series, they have to use Azure and they’ve got cloud resources and compute and GPUs that they can put behind them. They’re using my Modeltime software and the linkages to GluonTS. And they’re having really good results there, they’re having good results with XGBoost. It depends on the Time Series, but they’re having some pretty good successes with some of these strategies and they’re the same strategies that are winning the Kaggle competitions. 
Jon Krohn: 40:24
That is super cool. Can I take a quick second to try to summarize what you’ve been saying here, just to make sure I’m getting it right, which will probably help the listener too? 
Matt Dancho: 40:33
Absolutely. 
Jon Krohn: 40:34
N-BEATS and DeepAR, these are two approaches to building time series and neural network models as a part of the Amazon Web Services, AWS, GluonTS, GluonTS package. 
Matt Dancho: 40:52
Yes. 
Jon Krohn: 40:54
I’m getting head nods. So far I’m on track. So included within these deep learning time series approaches, we’re typically using a variant of recurrent neural networks called LSTM. So we set LSTM a few times, but I just want to spell out, spell it as the opposite of what I want to say. I want to read out what LSTM stands for, which is Long Short Term Memory units. The U doesn’t get a letter. 
Matt Dancho: 41:22
Yeah. 
Jon Krohn: 41:24
But LSTMs, they are a specialized recurrent neural network that allows these neural networks to take information from many times steps back in a relatively efficient way. I just wanted to read all that back in case a listener is listening, who isn’t aware of RNNs and LSTMs. Cool. I’m also glad that I understood all the linkages there that you’ve been talking. 
Matt Dancho: 41:53
Yeah. The cool thing about LSTMs is they’re basically, I read somewhere that they’re like an ARIMA on steroids. An ARIMA uses lags as your features. LSTMs do something similar where they’re just taking a look backwards in time and using that sequence to decide how that path should look into the future. They’re just a sequence algorithm, but they do a really good job on Time Series. They also do a pretty good job at predicting future text. 
Jon Krohn: 42:29
Yeah. Which is a Time Series, which we don’t call natural language processing time series analysis, but hugely similar. Whether we’re talking about the sound of a voice or words on a page, they occur in a sequence. So RNNs are often very popular in that space as well. 
Matt Dancho: 42:46
Yeah. 
Jon Krohn: 42:48
Okay, awesome. We’ve got timetk for pre-processing data, Modeltime for quickly trying out machine learning models, applying machine learning models to the Time Series data, the MT Gluon package that you made specifically allows us to apply deep learning models, including these super powerful GluonTS provided approaches like N-BEATS and DeepAR. I’d love to talk about your MT ensemble package next, because I know that ensembles are another great modeling technique for getting top results. So do you want to tell us quickly what an ensemble is and what the MT ensemble package allows us to do? 
Matt Dancho: 43:28
Sure. So Modeltime Ensemble, basically what we’re trying to do is figure out a way to combine different forecasts. In terms of results ensembling was used in the most recent M5 competition, which analyze Walmart store sales, it was a hierarchical dataset. What that means is that there’s a store has many products, they’d fall into different departments. Each store has multiple departments so there’s the different dependencies. The thing is that competition showcased was, you don’t just pick one algorithm and run it, or if you do pick one algorithm, you got to try it with in different situations, different features and whatnot. And then you average the results or you combine the results somehow. That’s what Modeltime Ensemble does. 
Matt Dancho: 44:26
In its simplest form, there’s one function called, I believe it’s ensemble average, all it does is it just takes, if you’ve got five different models, it just takes these five models and their predictions and just averages them for you. But what’s cool about it is it handles all of that, the averaging process, managing the models, it handles all of that stuff beautifully for you. So you don’t have to worry about thinking about like, Oh, am I putting this into a table now, do I have to write some code to summarize it? No, it takes care of it for you and out pops a nice, pretty plot. That’s basically Modeltime Ensemble in a nutshell. 
Jon Krohn: 45:14
And then you also have round out the last we’ve gone over four of your six Modeltime ecosystem packages. The remaining two are MT H2O, which integrates AutoML. Another great way to get top performing results. Tell us a little bit about AutoML and how you implement it. 
Matt Dancho: 45:36
Sure. H2O is this company out there that has this amazing algorithm called AutoML, and there’s a lot of different… What it does is it automated machine learning and there’s a lot of automated machine learning packages. You might’ve heard of like teapot is one of them. There’s even I think, Google came out with its own AutoML, but that’s more for deep learning type of projects. What H2O does is it works really well on tabular data, meaning like spreadsheet style, you’ve got normal business data. Most of them come out of like a SQL database. So you’re going to be in like a 2D structure. That’s essentially how we forecast in Modeltime. So what’s cool is since they’ve already taken care of the automated part, all we have to do is integrate it as a backend. So it’s actually pretty simple. 
Matt Dancho: 46:32
We just take their algorithm and we add it as a new engine, so to speak, and we fit it into our framework. Once that happens, you can forecast with it. So you can actually forecast with automated machine learning. To give you a little bit of background about the nuts and bolts behind H2O’s AutoML. What they do is they implement XGBoost, random forest. They implement elastic net, which is the GLM and then deep learning as well. Now, the thing is that H2O is designed for high performance. It’s actually written in Java code, which is a very fast and that’s the nice thing about it, is that it’s super scalable. 
Matt Dancho: 47:22
When we’re talking about 10,000 plus different time series and millions and millions and millions of rows, it’s going to be able to handle it and it can be run in the cloud. It doesn’t need to be run on your computer. You just change your cluster and you point it to your H2O cluster that’s running in the cloud. That’s really the massive benefit that you get there, is productivity-wise, it’s automated, just great. You just run it overnight or you run it for 30 minutes or whatever. It comes up with some good models. And then it also does some automated stacking too, which is an ensemble approach. It just gives us another backend that we can integrate into the Modeltime system. 
Jon Krohn: 48:10
Beautiful. That leads us to the sixth and final package in the Modeltime ecosystem, which is MT Modeltime Resample, I’ve been saying MT as short forms, but… 
Matt Dancho: 48:24
No, no, it’s fine. Yeah, Modeltime Resample is designed for your back testing. Basically, what you want to do in machine learning, not necessarily Time Series, we don’t normally implement this resampling process, but with machine models, you’re always doing cross validation. You’re always trying to say, Hey, I’ve done my five-fold cross validation on average. My RMSE is XYZ, some number. That’s what this does now, is takes your time series and splits it up and then you don’t lose track of that sequence, but it helps us resample it and just put it through a workflow where you’re doing that same. It’s not called five-fold. It’s actually a different technique because you have to keep the samples altogether. But it does what’s called time series cross validation. 
Matt Dancho: 49:28
The goal is to understand how stable your models are. Do they stand the test of time when you take different sections and predict the next 12 observations or 24 or whatever it is> It helps manage some of that stuff, but at the end of the day, it’s about trying to give the tools to not just the machine learning and the cool algorithms, but also give the tools to be able to make sure that your predictions are robust, that you’re going to get good results, and you can be confident in those forecasts that you’re making. 
Jon Krohn: 50:05
Yeah. Which is hugely important, obviously of utmost importance if you were building a financial prediction. But in any kinds of time series predictions, you’re going to want to make sure that you’re not over trained to the training data that you have, over fit to the training data that you have, that your model is likely to perform well on data it hasn’t seen before, like data in the future. 
Matt Dancho: 50:25
Right. 
Jon Krohn: 50:27
Amazing. Thank you so much, Matt, for this tour of the Modeltime ecosystem, as well as some backstory behind the evolution of it. This is a hugely useful set of R packages and I hope to be able to get my hands on it to tackle a time series problem soon myself. One last question before I let you go, do you have a book recommendation for us?
 
Matt Dancho: 50:49
Yes. The book that I would recommend, honestly, one of the more influential books that I’ve read is the Steve Jobs Biography. I think it’s just cool. I’ve always been interested in entrepreneurship and I’ve built my own Data Science business. I’m also looking at other businesses to build as well. Anybody who’s just interested or fascinated in technology, just understanding how Steve grew with Apple, even the pre Apple days, understanding more about him. I view him as one of these entrepreneurs that is a once in a lifetime entrepreneur. It was a pretty impactful book for me. That’s my recommendation, Steve Jobs Biography.
Jon Krohn: 51:41
That’s a great one. If people want to stay in touch with you, if they want to follow you to learn about the latest in Time Series libraries, Time Series models, maybe even some R Shiny, what’s going on with Hadley Wickham, how can they follow you and keep track of what you’re doing? 
Matt Dancho: 52:01
I’d recommend LinkedIn, just reach out to me there. You can follow me on LinkedIn. It’s just Matt Dancho. Also, I’m pretty active on Twitter too. So if you’d like staying in touch with, getting the latest R tips, that’s my big thing. I’m always tweeting about what I’m learning. You’ll get that my Twitter handle is @M, M my initial, my last name, Dancho and then 84. The Time Series course is also another way, if you really want to learn Time Series, definitely I would recommend checking out my High Performance Time Series course. It’s gotten rave reviews and it’s also been my fastest growing course just in the short span that it’s been out. 
Jon Krohn: 52:55
Beautiful. I can’t imagine how valuable that would be to check out. All right. Thank you so much, Matt. It’s been a pleasure to have you on, I hope we will have you on the SuperDataScience program again very soon. 
Matt Dancho: 53:06
Absolutely. It’s been a pleasure and thanks for having me, Jon. I hope to be back sometime and until then, have a great day and keep on doing what you’re doing. 
Jon Krohn: 53:18
Nice. Thanks, Matt, appreciate it. 
Jon Krohn: 53:25
Holy smokes, did I ever enjoy that conversation with Matt? And boy, did I ever learn a lot about modern approaches to Time Series analysis? Wish I had known everything we discussed today back when I was a trader at a hedge fund. In today’s episode, we covered R Shiny for quickly creating web apps involving data presentation, the timetk package for pre-processing data for Time Series analysis, the Modeltime package for quickly trying out machine learning models on your Time Series data, and the other packages in Matt’s Modeltime ecosystem for cutting-edge approaches like long, short-term memory units with AWS Gluon, ensembles of models, H2O.ai’s AutoML and back-testing models via resampling. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show and the URLs for Matt’s LinkedIn profile and Twitter profile at www.superdatascience.com463. That’s www.superdatascience.com/463. 
Jon Krohn: 54:30
If you enjoyed this episode, I’d of course, greatly appreciate it if you left a review on your favorite podcasting app or on YouTube, where we have a high fidelity smiley face field video version of this episode. I also encourage you to follow or tag me in a post on LinkedIn or Twitter, where my Twitter handle is @JonKrohnLearns to let me know your thoughts on this episode. I’d love to respond to your comments or questions in public and get a conversation going. You’re also welcome to add me on LinkedIn, but it might be a good idea to mention you were listening to the SuperDataScience Podcast so that I know you’re not a random salesperson. 
Jon Krohn: 55:05
A reminder that starting with episode 465 next week, we will begin releasing guest episodes on Tuesday mornings, New York Time. Historically, we’ve released Wednesday evenings, but by releasing 36 hours earlier, we’ll be giving you two more morning commutes in your week to enjoy the episode. I can’t think of any downsides to this change, but I didn’t want you to be caught off guard. All right, thanks to Ivana, Jaime, Mario and JP on the SuperDataScience team for managing and producing another incredible episode today. Keep on rocking it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon. 
Show All

Share on

Related Podcasts