Podcasts SDS 635: The Perils of Manually Labeling Data for Machine Learning Models

78 minutes
Data Science, Machine Learning

SDS 635: The Perils of Manually Labeling Data for Machine Learning Models

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

“What if we’re wrong about that?” This is a question that every data scientist will ask themselves at hopefully regular points in their career. This week, Jon Krohn speaks with Watchful CEO Shayan Mohanty about ways to rethink the term and connotations of bias in data. Together, they discuss the pitfalls of data analysis when bias comes into the equation (spoiler alert: it always does), the importance of the Chomsky hierarchy in data management, and the value of simulation engines in returning real-time results to users.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Shayan Mohanty

Shayan Mohanty is the CEO and Co-Founder of Watchful, a company that largely automates the process of creating labeled training data. He has a decade of leading data engineering teams at various companies, including Facebook, where he served as lead for the stream processing team responsible for processing 100% of the ads metrics data for all FB products. He is also a Guest Scientist at Los Alamos National Laboratory and has given talks on topics ranging from Automata Theory to Machine Teaching.

Overview

Shayan Mohanty believes that the word “bias” has been misunderstood in data science. The word actually comprises a broad swathe of notions in the field, not only regarding potentially harmful indicators such as race and gender but also simple weights and preferences. For this reason, bias can be revealing for your data analysis as long as it is recognized and defined as an explicit parameter.

Ultimately, Shayan argues that there will always be bias, and part of this has to do with labeling data by hand. Hand labelers are humans who will bring their own biases to the table, especially if they are untrained in how to label data. And this is frequently the case; the process of labeling data by hand is time intensive, so those who may know the most about the data (clinicians involved in cancer diagnostics would, for example, be the best equipped to label cancer images) are unlikely to have the time to carry out such work. These experts, therefore, present “bottlenecks” in the process, and labeling has to be outsourced. This is becoming a real problem in the field, not only regarding accuracy but also simple humanity, where cheap labor for such mechanical work as hand labeling is sourced from people living in economically deprived areas.

To combat this unsustainable and potentially negligent business method of labeling data, Shayan’s company presents an alternative to hand labeling bias with suggestion and prediction heuristics that automatically label data. Shayan notes that machine learning research makes the mistake of focusing on making ML a better student when ML researchers should be looking into ways to make the human teacher more effective. By answering questions like how experts can “download” their information into a system more quickly, hand labeling techniques can be automated, thereby reducing errors and improving their quality.

Listen to this episode to hear Shayan and Jon discuss how automated hand labeling can be used in recruitment applications, how the term “ground truth” should be used wisely when talking about data, and the potential for probabilistic labeling. And if you’re a machine learning engineer, you will definitely want to hear what Shayan and his team are looking for at Watchful.

In this episode you will learn:

Why bias in general is good [04:06]
The arguments against hand labeling [09:47]
How Shayan solves the problem of labeling at his company [24:26]
Misconceptions concerning hand-labeled data [43:25]
What the Chomsky hierarchy is [52:38]
Watchful’s high-performance simulation engine [1:04:51]
What Shayan looks for in his new hires [1:08:15]

Items mentioned in this podcast:

Iterative
Hand Labeling Considered Harmful
GNU Emacs
Notion
Ghost Work by Mary L. Gray and Siddharth Suri
The Three-Body Problem by Cixin Liu
Jon Krohn’s Mathematical Foundations of Machine Learning Course

Follow Shayan:

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 635 with Shayan Mohanty, CEO of Watchful. This episode is brought to you by Iterative, your mission control center for machine learning.

00:00:14

Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:45

Welcome back to the SuperDataScience Podcast. Today we’ve got the absurdly intelligent and eloquent Shayan Mohanty on the show. Shayan is the CEO of Watchful, a Bay Area startup that he co-founded to automate the process of distilling, scaling, and injecting subject matter expertise into machine learning models. He’s also a guest scientist at Los Alamos, a renowned national security lab in New Mexico. Previously, he worked as a data engineer at Facebook, and he was co-founder and CEO of a pair of other tech startups. He holds a degree in economics from the University of Texas at Austin. Today’s episode will be of interest to technical data science experts and non-technical folks alike as it addresses critical issues associated with creating data sets for machine learning models. Issues we should be aware of regardless whether we’re more technically or commercially oriented.

00:01:33

In this episode, Shayan details why bias in general is good. Why degenerative bias in particular is bad. Arguments against using the supposed gold standard of manual labeling for creating machine learning data sets, including because of their capacity to introduce degenerative bias, and how his company Watchful has devised a better alternative to manual labeling, including its fascinating technical underpinnings, such as weakly supervised machine learning, the Noam Chomsky hierarchy of languages, and their high performance Monte Carlo simulation engine. All right, you ready for this enthralling episode? Let’s go.

00:02:16

Shayan, welcome to the SuperDataScience Podcast. It’s great to have you here. Where in the world are you calling in from?

Shayan Mohanty: 00:02:22

Thanks so much, man. I’m calling from San Francisco. When we started the conversation, it was a little rainy and now it is sunny, so it’s very San Francisco outside.

Jon Krohn: 00:02:32

Yeah, so I saw you in person for the first time last week at the time of recording. Last week we were together at the Open Data Science Conference West ODSC West. I met you in person, and we had horrific weather the entire week. In celsius it was like five degrees in Fahrenheit it was low forties and rainy. Meanwhile, I live in New York, and I’ve flown the entire time that I was in San Francisco. It was beautiful. So Celsius, 22 degrees, Fahrenheit like mid-seventies and sunny. Everybody was messaging being like, “What are you up to? It’s so nice, now we go out”, I’m like, “I’m in freezing in San Francisco. But what’s that?” It’s a Hemingway quote that the coldest winter of my life was the summer in San Francisco.

Shayan Mohanty: 00:03:21

That is correct, yes. I will say that if you came one week before, it was gorgeous outside. We were all in shorts. We were just vibing out in Dolores. But somehow you came the one weekend where it was really, really bad outside. So I apologize on behalf of all San Francisco for you.

Jon Krohn: 00:03:41

So I had one really, really nice day in New York. And then the very next, now it’s freezing cold again. It’s just like, it wasn’t San Francisco all-

Shayan Mohanty: 00:03:48

Rough. I’m just going to blame you then. You brought the weather.

Jon Krohn: 00:03:52

Yep, that’s how it works. This is a science program. You were here first. I just bring sadness and dreary weather everywhere I go. That’s my thing. So yeah. So we met in person. That was really nice. You were giving a talk at ODSC West. It was about bias. And can you fill us in more about what you covered in talk?

Shayan Mohanty: 00:04:14

Yeah, yeah. So the title of the talk was kind of grabby, bias is good arguments for automated labeling. So kind of the core argument so to speak, is that the word bias is overloaded in data science. Not only does it mean societal impact of models that are trained based on historical biases and that sort of thing, historical stereotypes and stuff like that. It also means it’s got the connotation of weights and biases, a constant [inaudible 00:04:55] factor. So there are several different uses of this term and there’s sort of a middle ground term as well, which is, go ahead.

Jon Krohn: 00:05:04

Oh yeah, I was just going to say it from a technical perspective in that sense, bias isn’t bad. Bias is just a part of, in a neural network you have weight parameters, you have bias parameters. But you could even think of, you could describe the why intercept offset in a simple regression model is the bias.

Shayan Mohanty: 00:05:22

Exactly. 100%

Jon Krohn: 00:05:23

That’s completely harmless. I like the idea of some really naive data scientist just starting out with their first regression model or the first neural network or something and they’re like, “Should remove the bias term,” and then just never having a model that fits the data very well.

Shayan Mohanty: 00:05:45

Yeah. So our whole thing was that it’s important to lens the word bias with all of these things. But it’s not all that helpful to have a super overloaded term that’s used pretty widely across machine learning.

Jon Krohn: 00:06:02

You were going to give us a third one, but then I spoke over you.

Shayan Mohanty: 00:06:09

Sorry. First one is societal impact stuff, like model trained on stereotypes essentially. The second was your bias factor, your weights and bias. Your bias parameter really. The third is this middle ground where you actually are biasing a model towards a true representation of reality. So it’s still sort of an uplevel of the bias concept in terms of a bias parameter, but it has no negative connotation to it. So our whole thing at least at the beginning was we should just define two types of bias real quick, just that we have a basis to talk about. So basically we proposed bias is just an act of shifting the effect of a model to one side or another, or to a specific area. So limiting the space within which a model has acceptable outputs.

00:07:11

So an example of this is I want to train a model to predict spam. I could bias it towards predicting spam correctly. I could bias it by saying if the word Viagra shows up, it’s very likely spam. And that’s probably true to some extent. But it’s still very important for us to be able to talk about the societal impact of bias. So we propose another terminology which is degenerative bias, where in fact when you bias a model using these outdated stereotypes or whatever they happen to be, your personal biases and you bring them to the table, what you’re doing is you’re actually biasing the model away from an accurate representation of reality. So it’s important that we lend the discussion with those two things.

00:07:59

And the core point of that whole talk is basically to go on to say that the way supervised machine learning has evolved over time has made supervised machine learning very reliant on hand labeled data, which is one of many different sources of this bias can creep in, specifically degenerative bias. You can have hand labelers who bring their own biases to the table. You’ve got the way you’re sampling data, that could be biased inherently, the way you’re stratifying it. All sorts of things can affect the way your model’s performing. These days, most of the time, I’m not going to say all the time, but most of the time your model’s not biased because you trained in inherently biased architecture. It’s usually because your data’s biased in some way, degenerative biased.

00:08:54

So our whole argument was, “Look, we can’t really get away from labeling,” so you have to label data. But maybe there’s a better way to be more explicit about the biases we’re checking in. And if you’re explicit about them, then it’s possible to go back and change them. So now the question is what techniques do we have to be able to do that sort of thing? And weakly supervised learning is one of them. Active learning is another one of them. I had a whole slide of just various techniques and how they kind of play together. But it was meant to just be like a, “Look, you can’t really get away from bias. The whole point of labeling is the bias model. But what you do want to do is limit the amount of degenerative bias you’re introducing. And in order to do that, it has to be explicit. It can’t be implicit.” So that was sort of the core of the talk.

Jon Krohn: 00:09:45

Cool. And so something I’d love to dig into more is the arguments against hand labeling. So you mentioned how hand labeling there is one way that we can introduce generative bias as you termed it. But what are the other arguments against hand labeling? I know that that’s something that you’re expert in.

Shayan Mohanty: 00:10:04

Yeah, so there are lots. So starting with bias obviously, if you have an army of humans we’re labeling data. One of the issues with labeling fundamentally is that it is a non-deterministic process. Now as we move towards more enterprise ready machine learning, not the wild west machine learning where it’s just like let’s throw a bunch of data into a pot and stir it a couple times, and see what pops out the other end. Now we’re talking about how do we get reproducible results? How do we make sure that we have pipelines? When we start using the word pipeline, it implies some amount of determinism. But the problem is you lose that when you have hand labeling. And that amounts to a whole bunch of stuff.