SDS 580: Collecting Valuable Data

Welcome back to the Five-Minute Friday episode of the SuperDataScience Podcast!

This week, Jon resumes his series on strategies for getting business value from machine learning. Part two of the series sees Jon dig into the data collection process and share his advice on the best ways to obtain and label data.

To see a SuperDataScience episode filmed live in-person with Jon Krohn and superstar Hilary Mason on Friday June 10th at the New York R Conference, you can get tickets 30% off with the code SDS30.

For last week’s Five-Minute Friday, Jon covered the first strategy: identifying the commercial problem before starting data collection or ML model development. In today’s episode, he digs into the data collection process.

Generally speaking, labeled data will be more valuable commercially than unlabeled data. However, suppose you have only unlabeled data. In that case, you can still carry out exploratory data analyses or run them through unsupervised machine learning models that may enable you to uncover some hidden structure in the data. On the other hand, labeled data can allow you to train an ML model that predicts some outcome that could be commercially useful to your target user.

The catch is that while labeled data tend to be more valuable than unlabeled data, they are also typically trickier to obtain.

If you can’t think of a way to automatically apply labels to your data, you still have options. These options could be more time-consuming or expensive. Still, they could also create a unique, proprietary dataset that could differentiate your ML model and the commercial product it’s apart of from your competitors. You could add labels manually to the data yourself or find an outsourcing service to do it for you; there are probably thousands of these labeling firms out there, so do a quick web search, and you’ll have no trouble finding one.

Jon’s final suggestion for obtaining labeled data is both automated and allows you to develop a unique, proprietary dataset.

Sounds perfect, right? The catch is that this is typically the most time-consuming and expensive option of all: You create a platform — typically a website — that enables you to collect labeled data from your users. This is expensive because it requires design, infrastructure, and software development time. But, ultimately, if you can pull it off, having platforms like this where users provide you with labeled data leads to the most valuable ML models and the most valuable companies.

Please tune in next week for Jon’s third and final strategy for getting business value out of ML, in which he will provide insight into how to get started with modeling and share the trade-offs between model speed and model accuracy.

ITEMS MENTIONED IN THIS PODCAST:

SDS 578: Identifying Commercial ML Problems

DID YOU ENJOY THE PODCAST?

Are there any new ways to obtain data that you’ve discovered through this episode? If so, can you implement them this week?
Download The Transcript

Podcasts SDS 580: Collecting Valuable Data

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 580: Collecting Valuable Data

Share