SDS 506: Supervised vs. Unsupervised Learning - SuperDataScience | Machine Learning | AI | Data Science Career | Analytics

Welcome back to the FiveMinuteFriday episode of the SuperDataScience Podcast!

This week we continue the theme of last week and discuss supervised and unsupervised learning.

Last week we talked about classification problems and regression problems. The interesting thing about both those problems is that they fall into a class of problems called supervised learning problems, which we also discussed last week. Supervised problems involve an input and output that utilizes some function to take input x and infer output y. All supervised learning problems have both input and output and employ a process to label output data. This can be seen in movie reviews where you might input a review and output a positive or negative label for the review.

An unsupervised learning model is utilized in situations where we have only input and no existing output labels. The good news is finding unlabeled data for problems like these is very easy. There’s plenty of data out there without labels. Through unsupervised learning, you learn some hidden, underlying structure about the data. This could be word vectors or word embeddings in lines of text. Another example is GANs (generative adversarial networks) which involved high-resolution images to simulate completely new photos—often misused for deep fakes.

ITEMS MENTIONED IN THIS PODCAST:

SDS 504: Classification vs Regression

DID YOU ENJOY THE PODCAST?

Can you think of examples in your work where you’ve tackled unsupervised learning problems?
Download The Transcript

Podcast Transcript

(00:05):
This is FiveMinuteFriday on Supervised versus Unsupervised Learning.

(00:27):
Last week in episode 504, we talked about classification problems and regression problems. So classification problems were ones where we have some specific buckets, some specific category that we can put all of the inputs into our data model into. So for example, we talked about having movie reviews and we talked about being able to classify those as either positive or negative movie reviews. So this is a binary classification task. We also talked about regression problems where instead of predicting a particular bucket that some data belong to you predict a particular value that some data belong to. So some specific continuous value, like so based on some data, you could predict a house price, for example. All right. So those are classification and regression problems. If you want to learn more about those, you can refer back to episode 504. One of the interesting things about both of those types of machine learning problems, whether they’re classification problems or regression problems is that they fall into a class of machine learning problems called supervised learning problems.

(01:47):
So we talked about that a little bit in episode 504 as well, but the idea here with supervised learning problems is that you have some input into your model, which we typically call X as well as some output from your model, which we typically call Y. And so our goal with these supervised learning problems, whether they’re classification problems or regression problems, is to learn some function that can take in the input X in order to approximate the output Y. So going to that movie review example again, the input X could be the natural language of the movie review. And then the output that we’re trying to guess is whether this movie review is a positive one or a negative one. So the key thing here with any of these supervised learning problems and what makes them supervised learning problems is that we have both the input data X as well as these outputs Y that we’d like to estimate.

(02:46):
And so these outputs, we typically call them labels. And that means that we need to have some process often a human process to take a bunch of input data and assign a label to it. So, for example, with a movie reviews, the natural language could come from a movie review website, like the internet movie database. And then we could use either human annotated labels to say, okay, this movie review where they say that the movie is really bad and they never watch it again. And clearly that’s a negative review and a human could assign a label to that and say, okay, everything that’s a negative review, we’re going to label with zero and everything that’s a positive review, so a movie review like this is the best movie ever. I can’t wait to tell all my friends about it. That’s going to be a positive movie review, and we could label that with a one.

(03:44):
So we have this output, this Y label that our model’s going to try to learn. So when it sees language like this is a terrible movie, it’s going to try to output, Hey, I predict that this is a very high probability of being a negative review, of being an output of the label zero. Or when it’s a movie review with natural language, like this is the best movie ever, then our model, our supervised learning model is going to try to learn that that should be assigned a label of one. Now there are shortcuts. You don’t always have to label data for supervised learning problems manually. So for example, with that movie review example, you could use people’s star ratings from the website. So you could say, okay, the internet movie database, let’s say it has star ratings out of 10 than anything with eight, or nine, or 10 stars, any reviews with eight, or nine, or 10 stars, we say that’s a positive review.

(04:45):
And anything with zero, one, or two stars, we say, well, that’s clearly a negative review. Okay. So hopefully it’s now clear what supervised learning problems are. They need to have this kind of labeled data in order to train a supervised learning model. And then we try to predict those labels. And sometimes it can be difficult to get labels. So it could be very expensive or time consuming to have humans label data. And so in many common situations, instead of applying a supervised learning model, we might want to apply an unsupervised learning model, which will still allow us to learn something about our data or do something useful with it. So we use an unsupervised learning problem in situations where we have input data only. So some X only there are no labels, no outputs, what we typically denote as Y available.

(05:41):
The big advantage with unsupervised learning is that finding unlabeled data like this is very easy to do. You can scrape lots of natural language off the internet for example. You could scrape Wikipedia articles or use a crawl of the entire internet. You can have these massive, massive datasets where all we have is the data. We don’t have labels, particularly for the data. Our goal with unsupervised learning then is to take these typically very large data sets and learn some hidden underlying structure of the data, of the input data. Examples of unsupervised learning techniques include word vectors. So also known as word embeddings. These are a natural language processing technique where we take typically big bodies of natural language, kind of like the Wikipedia example I just gave, or you could use lots of news articles or tweets. Maybe it could be relevant to your specific industry, but you gather all of these documents with all of this natural language. And we can convert that natural language into word vectors using techniques like word to VEC and glove that can then assign a particular meaning to each of the words.

(07:10):
So the meaning of words is represented by a position in a high dimensional space. The scope of the details of that are beyond this FiveMinuteFriday episode. But the idea being that these word vectors don’t require labeled natural language data, but we can do something meaningful with it anyway. Another fun example is generative adversarial networks. So with GANs for short, we use a bunch of images. So if you have a whole bunch of high resolution photographs of celebrities faces, then you can feed those into a generative adversarial network, a GAN, and those don’t need to be labeled at all. You can just collect a whole bunch of images, say from the internet, and then you can use your GAN to create, to simulate, to generate completely new high resolution celebrity photos of celebrities that don’t exist. So this is the technology that gets misused sometimes for deep fakes.

(08:17):
All right. So I hope that gives you a clear idea of what supervised learning versus unsupervised learning models are. The key point being with supervised models, we have a label for our data. We have some output that our model can learn to predict. With unsupervised learning, we don’t have that label. And so the objective is to simply learn some hidden underlying structure of the data. Cool. So supervised and unsupervised learning approaches. These are two of the biggest categories of machine learning problems, but there’s another really big one called reinforcement learning. And we’ll tackle that one in a forthcoming FiveMinuteFriday episode. All right. That’s it for today’s. Keep on rocking it out there folks and catch you on another round of SuperDataScience very soon.

Podcasts SDS 506: Supervised vs Unsupervised Learning

SDS 506: Supervised vs Unsupervised Learning

Podcast Transcript

Share on

Related Podcasts

December 5, 2025

December 2, 2025

November 28, 2025

Podcasts SDS 506: Supervised vs Unsupervised Learning

Share

SDS 506: Supervised vs Unsupervised Learning

Podcast Transcript

Share on

Related Podcasts

December 5, 2025

SDS 946: How Robotaxis Are Transforming Cities

December 2, 2025

SDS 945: AI is a Joke, with Joel Beasley

November 28, 2025

SDS 944: Gemini 3 Pro: Google’s Back on Top