SDS 504: Classification vs Regression

Podcast Guest: Jon Krohn

September 9, 2021

Welcome back to the FiveMinuteFriday episode of the SuperDataScience Podcast! 

This week I give quick hits on classification vs. regression in supervised learning problems.

 

When working with data, the kinds of problems we work with can fall into different categories. The biggest of those is supervised learning problems. With these, we have some input and some output. This is raw data or preprocessed data with the output serving as the model outcome we hope to achieve. The goal of supervised learning is to have some function that uses the input to approximate the output.
These fall into subcategories. Such as classification problems. For example, a movie review where you want the output to assign a label of positive or negative to a review. If we have two outcomes we’re predicting that’s a binary outcome. If we have more than two then that’s multiclass. An example of this is if we had handwritten digits where the pixels go in as an input and the output is the answer to what the digit is. With regression, we’re predicting a particular value. In classification, it’s an integer while regressions use load values—something with a decimal point. This might be predicting next quarter sales or the future value of a house or a stock.
I hope that gives you a nice introduction to these types of supervised learning problems. 
DID YOU ENJOY THE PODCAST?
  • Can you think of examples of data problems you tackle that fit into the supervised learning subcategories?
  • Download The Transcript

Podcast Transcript

(00:05):
This is FiveMinuteFriday on classification versus regression. 

(00:19):
When we’re working with data, the kinds of problems that we’re working with can fall into different kinds of categories. One of the biggest categories, whether we are thinking of applying a statistical model to our data or a machine-learning model to our data, one of the biggest categories, one of the biggest buckets of problems we can be tackling, are supervised learning problems. So, with supervised learning problems, we have some input as well as some output. So, inputs are typically denoted with X and outcomes are typically denoted with Y. So, this means that we have what we call a labeled data set. So, X is the input data, the raw data, or it could even be pre-processed data that flows into our model, and then the outcome that we would like our model to learn is this Y, this output, and this means that we have to have labels.
(01:21):
So as an example, the inputs into our model, the X, could be the natural language of movie reviews, and then the label is this chunk of text, a positive movie review or a negative movie review. So that label is this positive or negative, that is the outcome that we would like our model to be able to approximate. That is the Y that we would like our model to figure out given some input X, the natural language. So, our goal in supervised learning is to have some function that uses the input X to approximate the outcome Y. These supervised learning problems fall into particular kinds of sub-categories. So, one of the big sub-categories is classification problems. So that movie-review example that I just gave you is an example of a classification problem. So, in that case, we have the movie review, the natural language, as an input to our model, and then with the outcome, we’re trying to predict a particular class. We’re trying to predict, is this movie review, is this natural language, a positive movie review, or a negative movie review? There are two classes.
(02:46):
And when there’s two classes like that, we call it a binary classification problem. If we have more than two outcomes that we are predicting, more than two classes, then we call that a multi-class classification problem. So, an example here would be if we had a bunch of handwritten digits. So, there’s a famous dataset called the MNIST dataset, and this consists of tens of thousands of examples of digits, so 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9, written by hand by somebody. And with these MNIST digits, a common multi-class classification problem is to have the pixels of those handwritten digits go in as the input X, and then the outcome Y is, what digit is that? Is it a 4? Is it a 6? Is it a 7? So that’s a multi-class problem because we have 10 possible digits that they could belong to. So binary classification, two classes, like movie review sentiment; multi-class problem, more than two classes.
(03:49):
To contrast with classification problems, there are regression problems. So, these are still supervised learning problems, but instead of trying to predict a particular class like we were with classification problems, with regression we’re predicting a particular value. So, with classification problems, the outcome could be an integer. We’re just trying to predict, is this a negative movie review, which could be coded with a zero, or a positive movie review, which could be coded with a one? So, classification problems, the outcomes can be coded with integers. With a regression problem, we would typically use float value, so something with a decimal point, because we have this continuous outcome that we’re predicting as opposed to a discrete outcome. And so, with a regression problem, the number that we might predict would be the sales of a product next quarter, or the future value of a house given some characteristics that that house has, or the price that a stock could be tomorrow. So, with classification problems, we’re predicting a particular discrete class and with regression problems, we’re predicting a continuous outcome: a continuous numeric value.
(05:07):
All right. So, I hope that gives you a nice quick introduction to what classification problems are versus regression problems, bearing in mind that both of those are examples of supervised learning problems. In a forthcoming FiveMinuteFriday, I’ll dig into the differences between supervised learning problems and other classes of machine-learning problems like unsupervised learning problems.
(05:30):
All right, that’s it for today’s episode. Keep on rocking out there, folks, and catch you on another round of SuperDataScience very soon. 
Show All

Share on

Related Podcasts