SDS 582: Model Speed vs Model Accuracy

Podcast Guest: Jon Krohn

June 8, 2022

Welcome back to the Five-Minute Friday episode of the SuperDataScience Podcast!

This week, our three-part series on strategies for extracting business value out of machine learning comes to an end. In this episode, Jon recommends starting with simple models and reminds us that model speed could be more important to your users than accuracy.

For Five-Minute Friday a fortnight ago, Jon covered his first strategy, which involved identifying a commercial problem before starting data collection or ML model development. Then, last week we dug into the data collection process that should follow.
Today’s episode is all about the steps that come after data collection. If you’re collecting more and more labeled data gradually, then you’re better off using a simple model to start. This also enables you to confirm whether there’s any valuable signal in the data you’ve collected so far. So, to start, you might use a simple logistic regression model with a handful of manually-curated features you’ve come up with as your model inputs.
As you collect larger amounts of data — the exact amount depends on the particular problem you’re solving and how much signal there is in the data relative to noise — you can start experimenting with more complex ML models, such as deep learning models, which can automatically learn the most important features in the data and can outperform simpler models with respect to accuracy. Generally speaking, you might need at least tens of thousands of data points to meaningfully make use of deep learning. For some problems, you might need millions — or, in extreme cases, even billions — of data points for a large deep learning model to demonstrate its worth and capability.
One final point that Jon makes is that when you deploy your ML model into a production system speed is almost always more important than accuracy. Users have become accustomed to receiving the results of their query in seconds or less. The model needs to be accurate enough that users are satisfied with the results they get, but waiting ten seconds instead of one second for a result, however, will definitely be perceptible.
ITEMS MENTIONED IN THIS PODCAST: 
DID YOU ENJOY THE PODCAST?
  • Do you know what’s most important to your users? Speed or accuracy? Have you been prioritizing model accuracy over speed when building your models?
  • Download The Transcript

Podcast Transcript

(00:03):
This is Five-Minute Friday on Model Speed vs Model Accuracy.

(00:19):
This episode is the final episode in a three-part series on strategies for getting business value out of machine learning. For Five-Minute Friday a fortnight ago, I covered my first strategy, which is being confident that there’s a commercial problem to solve before starting data collection or ML model development. Then, for Five-Minute Friday last week we dug into the data collection process after you’ve decided on a commercial problem to solve.
(00:43):
Today’s episode is all about what you do once you’ve collected some data. If you’re collecting more and more labeled data gradually, then to start you’re likely best off using a simple model. This also enables you to confirm whether there’s any valuable signal in the data you’ve been collecting so far. So, to start, you might use a simple logistic regression model with a handful of manually-curated features you’ve come up with as your model inputs.
(01:07):
Later, as you start to collect lots of data — the exact amount depends on the particular problem you’re solving as well as how much signal there is in the data relative to noise — but, as soon as you have lots, it could eventually start to make sense to experiment with more complex ML models, such as deep learning models, which can automatically learn the most important features in the data and can outperform simpler models with respect to accuracy. Speaking generally, you might need at least tens of thousands of data points to meaningfully make use of deep learning. For some problems, you might need millions — or, in extreme cases, even billions — of data points for a large deep learning model to demonstrate its worth and capability.
(01:48):
This brings me to my next point, which is that when you deploy your ML model into a production system for users to use, speed is almost always more important than accuracy. Users have become accustomed to receiving the results of their query in seconds or less. If you decide to use a massive deep learning model over a regression model because it provides a small lift in model accuracy, but it takes ten seconds to provide an output to your user, then you might be better off using the regression model. The model needs to be accurate enough that users are satisfied with the results they get. But, to use a touch of technical jargon, if your massive deep learning model only offers a few percent more of the area under the ROC curve relative to your regression model, that difference may be imperceptible to your users in practice. Waiting ten seconds instead of one second for their result, however, will definitely be perceptible.
(02:43):
All right, so there you have it. That is my third and final part of this three-part series on strategies for getting business value out of ML: start with simple models and don’t get too hung up on model accuracy in production — model speed could be more important to your users.
(03:01):
Cool, well I hope you enjoyed this three-part series, I had a fun time making it. Let me know what you think and keep on rockin’ it out there, folks, I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts