(00:05):
This is Five-Minute Friday on Collecting Valuable Data.
(00:19):
This episode is the second in a three-part series on strategies for getting business value out of machine learning. For Five-Minute Friday last week, I covered my first strategy, which is being confident that there’s a commercial problem to solve before starting data collection or ML model development. In today’s episode, we dig into the data collection process after you’ve decided on a commercial problem to solve with a data model.
(00:43):
Generally speaking, labeled data are going to be much more valuable commercially than unlabeled data. If you have only unlabeled data — such as a collection of images or a body of natural language — you can still carry out exploratory data analyses on them or run them through what we call unsupervised machine learning models that may enable you to uncover some hidden structure in the data. Labeled data on the other hand, such as where particular images you have are known to be dogs and others are known to be cats; or where particular passages of natural language are known to express positive sentiment and others are known to express negative sentiment. Having labeled data like these can enable you to train an ML model that predicts some outcome that could be commercially useful to your target user.
(01:31):
The catch is that while labeled data tend to be more valuable than unlabeled data, they are also typically trickier to obtain. It can be relatively straightforward to write a scraping algorithm to crawl the web and download images or store text for you. It can require some ingenuity to devise an automated way to assign labels to those data that you scrape. But, it can often be done! For example, let’s say you wanted to build an ML model that could predict whether a movie review is a positive review or a negative review. In that case, you could scrape the natural language of movie reviews from IMDB, the Internet Movie Database, and you could use the star ratings in IMDB to infer labels. When people provide movie reviews to IMDB, they also provide a star rating out of ten stars. You could decide that you’ll label any review with eight or more stars as a positive review, while any review with three or fewer stars you could label as a negative review.
(02:27):
So there you go. You can sometimes automate the addition of labels to your data. However, if you can’t think of a way to automatically apply labels to your data, then you still have options. These options could be more time-consuming or expensive, but they also could result in the creation of a unique, proprietary dataset that could differentiate your ML model and the commercial product it’s apart of from your competitors’. You could add labels manually to the data yourself or you could find an outsourcing service to do it for you; there are probably thousands of these labeling firms out there so do a quick web search and you’ll have no trouble finding one.
(03:04):
My final suggestion for obtaining labeled data is both automated and allows you to develop a unique, proprietary dataset. Sounds perfect, right? Well, the catch is that this typically is the most time-consuming and expensive option of all: You create a platform — typically a website — that enables you to collect labeled data from your users. This is expensive because it requires design, infrastructure, and software development time. But, ultimately, if you can pull it off, having platforms like this where users provide you with labeled data is what leads to the most valuable ML models and the most valuable companies. Many of the world’s biggest tech firms — such as Google, Amazon, and Facebook — accrued their collective trillions of dollars in value following this approach.
(03:53):
All right, so that’s this week’s strategy for getting business value out of ML. Once you’ve identified the commercial problem you’d like to solve, start collecting data for your ML model. Labeled data are more likely to be valuable and if you can somehow obtain or collect labeled data that nobody else has in an automated way, you could be on to a terrifically valuable commercial idea indeed.
(04:15):
Finally, if you live in the New York area and would like to experience SuperDataScience episode filmed live, then come to the New York R Conference, which will be held on June 8th through 10th. That’s the New York R Conference June 8th through 10th. Huge names in Data Science will be presenting there such as Andrew Gelman and Wes McKinney and to close out the conference on the afternoon on Friday June 10th, I’ll be interviewing Hilary Mason, one of the worlds absolute best known data scientists live on stage, so you can react or ask questions in real time. Should be tons of fun and I hope to meet you there or if not at this conference, then somewhere else soon. If you want tickets to the New York R Conference, you can get them 30% off with the code SDS30. Thats SDS30.
(05:06):
For Five-Minute Friday next week, we’ll have the third and final part of this three-part series on strategies for getting business value out of ML, in which I’ll provide insight into how to get started with modeling as well as trade-offs between model speed and model accuracy.
(05:21):
In the meantime, keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.