Jon Krohn: 00:00:00
This is episode number 635 with Shayan Mohanty, CEO of Watchful. This episode is brought to you by Iterative, your mission control center for machine learning.
00:00:14
Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.
00:00:45
Welcome back to the SuperDataScience Podcast. Today we’ve got the absurdly intelligent and eloquent Shayan Mohanty on the show. Shayan is the CEO of Watchful, a Bay Area startup that he co-founded to automate the process of distilling, scaling, and injecting subject matter expertise into machine learning models. He’s also a guest scientist at Los Alamos, a renowned national security lab in New Mexico. Previously, he worked as a data engineer at Facebook, and he was co-founder and CEO of a pair of other tech startups. He holds a degree in economics from the University of Texas at Austin. Today’s episode will be of interest to technical data science experts and non-technical folks alike as it addresses critical issues associated with creating data sets for machine learning models. Issues we should be aware of regardless whether we’re more technically or commercially oriented.
00:01:33
In this episode, Shayan details why bias in general is good. Why degenerative bias in particular is bad. Arguments against using the supposed gold standard of manual labeling for creating machine learning data sets, including because of their capacity to introduce degenerative bias, and how his company Watchful has devised a better alternative to manual labeling, including its fascinating technical underpinnings, such as weakly supervised machine learning, the Noam Chomsky hierarchy of languages, and their high performance Monte Carlo simulation engine. All right, you ready for this enthralling episode? Let’s go.
00:02:16
Shayan, welcome to the SuperDataScience Podcast. It’s great to have you here. Where in the world are you calling in from?
Shayan Mohanty: 00:02:22
Thanks so much, man. I’m calling from San Francisco. When we started the conversation, it was a little rainy and now it is sunny, so it’s very San Francisco outside.
Jon Krohn: 00:02:32
Yeah, so I saw you in person for the first time last week at the time of recording. Last week we were together at the Open Data Science Conference West ODSC West. I met you in person, and we had horrific weather the entire week. In celsius it was like five degrees in Fahrenheit it was low forties and rainy. Meanwhile, I live in New York, and I’ve flown the entire time that I was in San Francisco. It was beautiful. So Celsius, 22 degrees, Fahrenheit like mid-seventies and sunny. Everybody was messaging being like, “What are you up to? It’s so nice, now we go out”, I’m like, “I’m in freezing in San Francisco. But what’s that?” It’s a Hemingway quote that the coldest winter of my life was the summer in San Francisco.
Shayan Mohanty: 00:03:21
That is correct, yes. I will say that if you came one week before, it was gorgeous outside. We were all in shorts. We were just vibing out in Dolores. But somehow you came the one weekend where it was really, really bad outside. So I apologize on behalf of all San Francisco for you.
Jon Krohn: 00:03:41
So I had one really, really nice day in New York. And then the very next, now it’s freezing cold again. It’s just like, it wasn’t San Francisco all-
Shayan Mohanty: 00:03:48
Rough. I’m just going to blame you then. You brought the weather.
Jon Krohn: 00:03:52
Yep, that’s how it works. This is a science program. You were here first. I just bring sadness and dreary weather everywhere I go. That’s my thing. So yeah. So we met in person. That was really nice. You were giving a talk at ODSC West. It was about bias. And can you fill us in more about what you covered in talk?
Shayan Mohanty: 00:04:14
Yeah, yeah. So the title of the talk was kind of grabby, bias is good arguments for automated labeling. So kind of the core argument so to speak, is that the word bias is overloaded in data science. Not only does it mean societal impact of models that are trained based on historical biases and that sort of thing, historical stereotypes and stuff like that. It also means it’s got the connotation of weights and biases, a constant [inaudible 00:04:55] factor. So there are several different uses of this term and there’s sort of a middle ground term as well, which is, go ahead.
Jon Krohn: 00:05:04
Oh yeah, I was just going to say it from a technical perspective in that sense, bias isn’t bad. Bias is just a part of, in a neural network you have weight parameters, you have bias parameters. But you could even think of, you could describe the why intercept offset in a simple regression model is the bias.
Shayan Mohanty: 00:05:22
Exactly. 100%
Jon Krohn: 00:05:23
That’s completely harmless. I like the idea of some really naive data scientist just starting out with their first regression model or the first neural network or something and they’re like, “Should remove the bias term,” and then just never having a model that fits the data very well.
Shayan Mohanty: 00:05:45
Yeah. So our whole thing was that it’s important to lens the word bias with all of these things. But it’s not all that helpful to have a super overloaded term that’s used pretty widely across machine learning.
Jon Krohn: 00:06:02
You were going to give us a third one, but then I spoke over you.
Shayan Mohanty: 00:06:09
Sorry. First one is societal impact stuff, like model trained on stereotypes essentially. The second was your bias factor, your weights and bias. Your bias parameter really. The third is this middle ground where you actually are biasing a model towards a true representation of reality. So it’s still sort of an uplevel of the bias concept in terms of a bias parameter, but it has no negative connotation to it. So our whole thing at least at the beginning was we should just define two types of bias real quick, just that we have a basis to talk about. So basically we proposed bias is just an act of shifting the effect of a model to one side or another, or to a specific area. So limiting the space within which a model has acceptable outputs.
00:07:11
So an example of this is I want to train a model to predict spam. I could bias it towards predicting spam correctly. I could bias it by saying if the word Viagra shows up, it’s very likely spam. And that’s probably true to some extent. But it’s still very important for us to be able to talk about the societal impact of bias. So we propose another terminology which is degenerative bias, where in fact when you bias a model using these outdated stereotypes or whatever they happen to be, your personal biases and you bring them to the table, what you’re doing is you’re actually biasing the model away from an accurate representation of reality. So it’s important that we lend the discussion with those two things.
00:07:59
And the core point of that whole talk is basically to go on to say that the way supervised machine learning has evolved over time has made supervised machine learning very reliant on hand labeled data, which is one of many different sources of this bias can creep in, specifically degenerative bias. You can have hand labelers who bring their own biases to the table. You’ve got the way you’re sampling data, that could be biased inherently, the way you’re stratifying it. All sorts of things can affect the way your model’s performing. These days, most of the time, I’m not going to say all the time, but most of the time your model’s not biased because you trained in inherently biased architecture. It’s usually because your data’s biased in some way, degenerative biased.
00:08:54
So our whole argument was, “Look, we can’t really get away from labeling,” so you have to label data. But maybe there’s a better way to be more explicit about the biases we’re checking in. And if you’re explicit about them, then it’s possible to go back and change them. So now the question is what techniques do we have to be able to do that sort of thing? And weakly supervised learning is one of them. Active learning is another one of them. I had a whole slide of just various techniques and how they kind of play together. But it was meant to just be like a, “Look, you can’t really get away from bias. The whole point of labeling is the bias model. But what you do want to do is limit the amount of degenerative bias you’re introducing. And in order to do that, it has to be explicit. It can’t be implicit.” So that was sort of the core of the talk.
Jon Krohn: 00:09:45
Cool. And so something I’d love to dig into more is the arguments against hand labeling. So you mentioned how hand labeling there is one way that we can introduce generative bias as you termed it. But what are the other arguments against hand labeling? I know that that’s something that you’re expert in.
Shayan Mohanty: 00:10:04
Yeah, so there are lots. So starting with bias obviously, if you have an army of humans we’re labeling data. One of the issues with labeling fundamentally is that it is a non-deterministic process. Now as we move towards more enterprise ready machine learning, not the wild west machine learning where it’s just like let’s throw a bunch of data into a pot and stir it a couple times, and see what pops out the other end. Now we’re talking about how do we get reproducible results? How do we make sure that we have pipelines? When we start using the word pipeline, it implies some amount of determinism. But the problem is you lose that when you have hand labeling. And that amounts to a whole bunch of stuff.
00:10:52
One is the potential introduction of bias. The second is inconsistent results when trying to re-sample fresh data and retrain. The third is just, it’s very slow to deal with drift, for instance. Especially in adversarial problems where for instance, if you’re doing spam classification or fraud classification for instance, you’re fighting an adversary. And that adversary is incentivized to change their tactics if they know that there’s a modeling process that is catching them. So if you have just an army of humans who are labeling data and your data volumes are huge, and you have to keep retraining them on new ways to catch these adversaries, you’re fighting a losing battle. It’s just an unsustainable problem.
00:11:43
And then you go into some of the issues around the cost of hand labeling. There’s several types of costs there. There’s capital. If I want to build a model that’s not just hot dog, not hot dog, if I’m trying to do something more interesting like cancer, not cancer or something like that. I can’t just pull a random person off the side of the street or even 500 random people off the side of the street and be like, “Hey, is this cancer? Is it not?” I need a bunch of doctors. And chances are doctors don’t have the time to sit down in a room and label a bunch of data for me for a month. They have doctor stuff to do. What I want to do is make sure that we can build those models because I know they’re going to be very impactful, but we have to do it in a way that doesn’t tie up 50 doctors for a month.
00:12:30
So that’s a capital cost, a capital and time constraint, quite frankly. But speaking also to time, right now, machine learning processes are very start stop. And what I mean by that is software engineering has for a very long time now benefited from what’s colloquially called flow states. So back in the day, compilation times were super, super long. Developer tools were not where they are today. Now, engineers see a huge benefit in being able to stay in the state of focus. And that’s because the time between them trying something and seeing a result is very, very small.
00:13:16
Now in a world of supervised machine learning, if one of your blocking factors is frankly a human labeling process, if you discover that one of the things you might need to tweak is your hand label data, you’re not going to get that result fast. You’re going to get it at minimum a week later, which means that you have to pause what you’re doing, wait a week, and then pick that context back up. It is impossible to calculate how much that costs us as an industry, as an AI industry overall. But I have to imagine it’s a lot because we’re coming off two decades of innovation on the software engineering side to make compilation times faster, to make partial compilation fast, to make developer tools way more powerful and so on. And I guess the final cost, there’s so many costs that I can list. But the final cost I’ll mention is there is actually a major societal cost to maintaining hand labeling structures as we do right now. There’s this book called Ghost Work, which I recommend folks read if they’re interested in this topic.
00:14:22
But it details this idea of a ghost worker or really a second class citizen in the internet age. And the idea is that to power all of this hand labeling machinery, companies have had to go out and hire a bunch of contract workers in low wage countries. So think Bangladesh, think various countries in Africa, that sort of thing. And while they’re providing economic opportunities for them, there’s no growth opportunity. They’re not building a career. They’re learning how to box cars and stop signs in a particular program, which is not really a transferable skill. So what ends up happening is there’s a race to the bottom on wages at a macro level because these companies, their margins are based on how much human time they have to spend money on. So they have an incentive to drive wages down.
00:15:24
Meanwhile, there’s more and more people who are able to do this work, because it requires very little training, very little context, and so on, which then means that there’s almost excess supply. Which again, leads wages to go down. So to me, it doesn’t make sense to have this second tier of humanity because, if they’re not already, which I know they are, but eventually they’ll get even more subjected to the shitty jobs that no one else wants. Things like content moderation where you have to watch a beheading and stuff like that. I saw this happen at places like Facebook.
00:16:07
So it’s stuff like that where our whole thing is, it’s better if we find just a more automated way to do a lot of this stuff so that we can empower the right users with the right tool. So you don’t need 50 doctors to sit in a room. You could have one doctor spend a couple hours using some software that will help get their thoughts out of their head and into something that’s programmatic, and then use that to do the labeling, and ideally tweak that. So there should be a nice feedback loop there. So yeah, long story short, labeling is hopefully on its way out. And instead we’ll have more sustainable processes.
Jon Krohn: 00:16:49
This episode is brought to you by Iterative, the open source company behind DVC, one of the most popular data and experiment versioning solutions out there. Iterative is on a mission to offer machine learning teams the best developer experience with solutions across model registry, experimentation, CICD training automation, and the first unstructured data catalog and query language built for machine learning. Find out more at www.iterative.ai. That’s WWW.I-T-E-R-A-T-I-V-E.AI.
00:17:24
This is a super eye-opening conversation for me already, because I take for granted that in some scenarios, I am going to need to use hand labeling. I can think of several research projects that my machine learning company is currently carrying out for our production platform where we’re like, “Well, we’d love to have a model that could do this. But we don’t have label data, and so how much is it going to cost for us to hand label it?” And we’re aware of some of these issues, like the bias issue. We’re like, “Okay, what can we do to minimize the bias issue?” And critically, especially because our applications are human resources, we then test afterwards, after a model is developed to make sure that no particular sociodemographic group is being affected differently by our human resources algorithm. So the bias thing is something we’re really aware of and we try to minimize and then we test. But that can be time consuming to get that right.
00:18:26
But some of these other issues, yeah, I hadn’t thought about them at all. So they’re slow to adapt to data drift. Particularly if you’re fighting an adversary, that is really thoughtful. Maybe something that I don’t have to worry about so much with the models I’ve been developing. So something that I hadn’t thought about, but for other examples, the spam filters for sure. That is something that you’re definitely going to have adversaries that are coming up with new ways of spelling Viagra to get past your content moderation algorithm. And then yeah, capital cost. This one. Yeah, definitely. Especially if you’re building a specialized model like an oncology one you’re describing. In my case, it could be I might want recruiters ideally to be labeling my data. And we work with really great recruiters. They might be making half a million dollars a year billing several million a year for their firm. They can’t take time away from that to be labeling data. No chance is anybody going to let that happen.
Shayan Mohanty: 00:19:29
What we noticed is that that idea exists across every single industry. If you think about where machine learning or AI can make the greatest impact at any company, it’s generally speaking going to be in places where their subject matter experts are kind of critical bottlenecks to the organization. They have their most expensive experts, and they only have maybe one or two of those people, and they want to just multiply them. That’s the obvious place for AI to have the greatest impact.
00:19:58
But you have this catch-22 problem there where, to train a model to replicate one of these recruiters for instance, in this particular task, or a doctor, or a financial analyst or whoever, you need that person to be involved in the training process. And that often is this non-starter because of how expensive it would be to even engage them. And let alone the amount of data they’d have to label, and the way they’d have to label it, and all sorts of stuff. So yeah, we think that that is the critical problem to be solved in AI, to see greater penetration into use cases that matter as opposed to the machine learning Hello World, which is Twitter sentiment analysis and stuff like that. We see this as we’re not lacking model architectures or frankly even modeling expertise. We’re lacking data. And that’s why we haven’t seen AI penetrate as deeply as it could’ve.
Jon Krohn: 00:20:55
That was perfectly articulated and makes perfect sense to me, and I don’t think that’s something, thank you, that I’d ever thought about is a serious bottleneck in the development of AI. I mean, I’m so glad that you mentioned that. This episode is getting better and better. I love it. A few other ones that you mentioned there, time cost, obviously that is what I’m aware of. If we could have some automation of labeling where it just happens in a few seconds instead of a few weeks, obviously that would be far superior. And then the social cost thing. It kills me that isn’t something that I was really thoughtful about before because I was thinking about it from the perspective that you started off describing where I’m like, “Well, I’m employing people that otherwise maybe this is paying them double what they’d be getting if they were doing some manual labor in a factory or a field or something.”
00:21:52
So this is actually a good opportunity and this is a good thing that I’m doing. But when you went into more detail about it in ways that I again, embarrassingly hadn’t really thought about, this idea of how they are in a race to the bottom where they are constantly going to have pressure. They’re following on with this idea of specialized models. If there were great labelers who did a better job than others, they’re not really going to be rewarded for that. The companies managing these processes are going to want to drive down price as much as possible. So yeah, it’s a pretty-
Shayan Mohanty: 00:22:36
There’s only downward pressure, and that’s the problem.
Jon Krohn: 00:22:39
Yeah. And then also like you said, it’s a dead-end career. They’re not going to be able to progress to labeling manager. Maybe there’s a small number of roles like that maybe, but probably not really. It’s not a career path like data sciences.
Shayan Mohanty: 00:22:57
Yeah. And you kind of have to get lucky with your location as well, because not every country has these computer labs set up. There’s this whole dark market, almost. Not quite dark, but a lot of the main brands of labeling companies out there that offer these types of services, they don’t always have a direct relationship with their labelers. So oftentimes, it is another company that is offering their services as a reseller or a subcontractor to these companies.
00:23:36
Which means that there’s also this interesting relationship between two companies where one company, the parent company, the one with the big brand, naturally wants to squeeze out as much efficiency as they possibly can. And they have plausible deniability on the means to achieve that because there’s an entire corporate firewall there. This is technically a totally different company. So when you get into the details, it gets very messy. And we’re just like, “That shouldn’t happen in today’s world.”
Jon Krohn: 00:24:07
So yeah, all great points across the board. So yeah, now I want something else. I want an alternative to hand labeling. And so you’re the CEO of Watchful. It’s a platform that describes itself as being for machine teaching. And so I know that it’s kind of a solution for this hand labeling problem. How do you solve hand labeling as a company at Watchful?
Shayan Mohanty: 00:24:34
Yeah, it’s a good question. We’ve been thinking about this for a long time now, so bear with me. I think the first thing I should probably talk about is what even is machine teaching as it relates to machine learning. It’s really just a subtle shift in mentality. Machine learning for the longest time, if you think about machine learning research, it’s very centered and very focused on the modeling and the modeling techniques. So if you think about it in terms of learner versus teacher, it’s really like a student. So machine learning research is very much about building the best student. And that’s a noble pursuit. That is worth doing.
00:25:20
The problem though is that those types of innovations come in sort of a stepwise fashion. There isn’t a nice smooth graph that you can look at where it’s like, “Here are all the incremental wins we got before we achieved the advent of CNNs,” for instance. It was just a one or a zero. It’s like suddenly there’s a new model architecture out there, and it is outperforming everything else. And great. Now most people use it for such and such tasks. So the problem there is you can’t really predict when the next new machine learning innovation is going to happen. So our thought and the thought of a few others, so we didn’t invent the term machine teaching. That was actually originated from Bonsai, which is now part of Microsoft. So all kudos to them. We just love the terminology, so we believe in the mantra.
00:26:13
But machine teaching is about shifting the focus away from the student and instead towards the teacher. So how do we make the teacher orders of magnitude more effective at teaching any student? And when you reframe the problem like that, you come up with an interesting and different set of solutions. Instead of trying to build an ultra clever model that will be able to predict all things all the time, instead, you think about different ways to make the human more effective. So how do we get your knowledge out of your head faster? How do we elicit that knowledge? How do we get you to discover things that were otherwise implicitly stored in your brain, but you now want to expose explicitly to the model?
00:26:59
So that that’s sort of machine teaching as a concept. And it’s worth noting that machine teaching is more of a philosophy than it is a set of specific techniques. So the techniques that we ended up aligning on are things like weakly supervised learning, active learning, Monte Carlo simulations. We have a whole bunch of different techniques that we packed into our product. But the whole point is that as a user, you don’t have to be experts or even know about any of that stuff. We chose them A, because we don’t believe there’s a one-size-fits-all machine learning solution to solving all labeling problems. So really it’s like a workflow problem. And B, our goal is to provide our users with that flow state that I mentioned earlier. We think that data science sorely needs it. So our goal is to choose techniques that lend themselves to flow state, so that we can extract that knowledge from your brain as quickly as we possibly can, and that sort of thing.
Jon Krohn: 00:27:57
Wow, wow, wow, wow, wow, wow. That sounds great.
Shayan Mohanty: 00:27:59
Thank you.
Jon Krohn: 00:28:02
Mathematics forms the core of data science and machine learning. And now with my Mathematical Foundations of Machine Learning course, you can get a firm grasp of that math, particularly the essential linear algebra and calculus. You can get all the lectures for free on my YouTube channel. But if you don’t mind paying a typically small amount for the Udemy version, you get everything from YouTube plus fully worked solutions to exercises and an official course completion certificate. As countless guests on the show have emphasized to be the best data scientist you can be, you’ve got to know the underlying math. So check out the links to my Mathematical Foundations of Machine Learning course in the show notes or at jonkrohn.com/udemy. That’s jonkrohn.com/udemy.
00:28:47
So how do you do it? What’s that like as an experience? How do I go from the world that I currently operate in personally where I want to be able to teach a machine learning algorithm to be able to do some task? So I start off by hand labeling some data myself, having data scientists on my team hand labeling data. And then if we can, finding somebody that’ll do that for us at relatively low cost and generating the amount of hand-labeled data that we need. There is no flow in that experience. It is really painful. So how do we go from that to what you’re describing with Watchful, with this nice machine teaching and flow state? Describe kind of the mechanics of what that’s like as a user experience.
Shayan Mohanty: 00:29:42
Yeah, so I’ll explain it from the perspective of two different types of users. So let’s imagine that you’re a user that is comfortable with an analytics type experience. So you’re comfortable with some loose query languages, stuff like that. You might not be a full on programmer, or you could be a machine learning engineer, you could be a machine learning scientist. That’s sort of an analytics skill set is sort of the base level, let’s say. So in that world, I’ll relate it to an HR use case. Please forgive me if I’m going to butcher it. But let’s imagine that you have just a mountain of resumes, and you want to predict potentially good fits for software engineering role, right? Senior software-
Jon Krohn: 00:30:32
Exactly what our core matching model does.
Shayan Mohanty: 00:30:34
Sweet. Okay. Well, it turns out we work with a bunch of HR companies on this as well, so I’m kind of cheating. I know what they do. So one of the things is you can kind of bring what you already know to the table using Watchful. So given a stack of resumes, and that stack could be millions high, it doesn’t really matter. Let’s say you’ve got this one person on your team who you know can identify a good match versus a bad match. So instead of going through a million resumes one by one and saying, “Yes, this is a match. Or no, it’s not. Yes, no, yes, no.” Instead what they do is they say, “Okay, if I see this experience somewhere listed, maybe I’m looking for node engineers. If they have Node.js listed on their profile, might be a good match.”
00:31:23
It’s not a rule. This is meant to be noisy. This is meant to be heuristic by nature, but you know that someone who has Node.js in their profile, more often than not, they’re probably going to be some sort of fit. Then you have to lend that with how much experience do you have. So you create another heuristic and you’re like, “Okay, if they say that their last role was a senior software engineer, then it’s very likely that they’ll also be a good fit.” So you can come up with these heuristics off the top of your head because this is what your recruiters or your experts are doing anyway. When they scan a resume, this is broadly what they’re doing internally. Even if they’re not externalizing it, they scan a resume. You have to go really quickly. So chances are your brain is picking out specific keywords that make sense. Whether you’re command F-ing them specifically or not, doesn’t really matter. But fundamentally, you are doing that work. So we start there. You can bring what you already know to the table.
00:32:26
The second layer is that as you’re doing that, Watchful can suggest things to you and be like, “Hey, you said Node is important. What about Kafka? Is that useful?” “And it’s like oh yeah, actually Kafka is predictive.” Because chances are a super junior engineer wouldn’t have experience with Kafka. So it is actually predictor for seniority. So that is useful. And Watchful will keep suggesting more stuff. Like, “Okay, what about Cassandra in the skills area?” But then in the experience area, they need to work for a company that has done machine learning. So it’ll look for machine learning in that particular column. And it’s like yeah, that is actually predictive.
00:33:04
So in that way, you’re not stuck having to come up with all these different heuristics off the top of your head. Now, Watchful is aiding you through this process. And you’re sort of in this give and take with the machine. And the cool bit is that as we talk about this, we’re not talking about how much data we have to label. It doesn’t matter. You can have a million resumes, you could have 10 million, you could have 100,000. It doesn’t really matter. Each time you create one of these heuristics, it shifts all of the labels for all of your data all the time. So it will go through and it’ll be like, “Okay, these are all the resumes that have Node in it. These are all the resumes that don’t have Node in it. Here are my new probabilistically weighted labels.” And the moment you create a new heuristic and it’s like, “Okay, here I’m going to re-shift everything.”
00:33:51
Each time you do this, Watchful gets better and better at predicting concepts. And the really cool bit is that you as the user are now up-leveling your involvement from the level of individual rows to identifying if concepts mapped to your class or not. Does the concept of Kafka mean something as it relates to a senior software engineer? Does the concept of Cassandra mean something as it relates to senior software engineers?
00:34:21
So that’s one type of experience. And you can imagine the whole goal here is to make that whole experience really fast, so you’re not sitting there waiting for a query to return. You’re just instantly, the moment you click something, you should see stuff change, you should see another suggestion. You’re just going that whole time. And you’re picking out the things that speak to you, that sort of thing. There’s another experience. And this is one we’re very excited about, it’s what we call Copilot. And this is when your users are potentially true domain experts. So they maybe don’t have that analytical skill set where they want to actually be able to query stuff or look at their data. They are in your case, high powered recruiters who make a ton of money every day. And they don’t want to have to learn yet another skill, which is querying an interface that they’re never ever going to care about.
00:35:14
So it’s like okay, great. Hand label as you always have been. And the cool bit is that we have evolved our platform to the point where from your hand labels, we can actually go backwards and predict the heuristics that would be necessary to explain those hand labels. So in that world, your users just hand label. And the output of that hand labeling are yes, hand labels, but also probabilistic labels across the entire data set and a set of heuristics that you can look at and be like, “Oh yeah, you actually automatically found that Node.js, Kafka, and Cassandra are all predictive of senior software engineers.” Let’s say. Great, these are all correct. Maybe some of them might not be perfect and you will go, you’ll want to go in and tweak them.
00:36:06 But in this way, going back to that whole bias conversation, this is an example of explicitly checked in bias. Every single thing that gets labeled by Watchful is produced by a heuristic or a set of heuristics, that layer on top of each other. So you can always go backwards from a label to figure out why that label is the way it is. And if any of that is wrong at a conceptual level, you go back and you change it.
Jon Krohn: 00:36:32
Got it. That all makes perfect sense. So you’ve got two kinds of experiences. In one of the experiences, you’re suggesting heuristics to the user to help them develop these kinds of rules that then can be automatically labeling the data. And then with a second experience with copilot, you are predicting heuristics. It’s kind of the inverse.
Shayan Mohanty: 00:36:54
Yep. That is exactly right.
Jon Krohn: 00:36:57
Yeah. So you’re hand labeling, and you have heuristic suggestions. And then ultimately, these heuristics avoid degenerative biases, because we have very specific rules each time we don’t… there’s not a mystery behind why some candidate was selected and some other candidate was not selected for a given role. We have it all in heuristics.
Shayan Mohanty: 00:37:23
Exactly. And I will say on record that this doesn’t avoid degenerative bias, to be clear. it’s just that degenerative bias will be checked in explicitly. So to your point, there’s no mystery as to why is this candidate labeled X instead of Y. You can go in and you can look. And if any of your heuristics are wrong or they’re capturing the wrong intention or whatever, you go back and you change that. As opposed to having to go back and talk to your labelers and interrogate them and be like, “Okay, why did you label X instead of Y? Is this a symptom of a deeper systemic issue in the way you’re evaluating the data? Or is this just a one-off you were tired and you hit the wrong button type of thing?” So here it’s like you give the same row to Watchful, you’ll get the same label as long as none of the heuristics changed. You don’t have that same guarantee with human labelers. And that goes back to that determinism problem I talked about earlier.
Jon Krohn: 00:38:23
Cool. All right. So that sounds great. I love that idea. How do you make it work behind the scenes? So I know that there are concepts like weakly supervised learning that you’ve incorporated into your platform to enable this to happen.
Shayan Mohanty: 00:38:36
Yeah. So when people think weakly supervised learning, a lot of the time they think about Python functions that you write. That is true. You can think of a labeling function, one of these heuristics as just a function that returns a possible classification. So it doesn’t really matter what it does. It could be simple, like those keyword matches that I talked about earlier, or it could be much more complex like a database lookup and that sort of thing.
Jon Krohn: 00:39:05
So I guess to contrast, many of our listeners will be aware of what plain old supervised learning is. So supervised learning, you already have all of the labels to work with. So with the hotdog, not hotdog example, you’ve got 1,000 images. 500 of them are labeled as hotdog, and 500 are labeled as not hotdogs. And then so the supervised learning paradigm of machine learning is to have the input, in this case, the pixels of an image go in one end of a machine learning algorithm. And then its weights adjust over the training process to accurately predict whether those pixels correspond to a hotdog label or a not hotdog label. So that’s supervised learning, and that was the simplest binary case where you have two possible outcomes. But you can have a bunch of classes. So you could have images of cats, dogs, horses, and cars or whatever. So that’s supervised learning.
00:40:05
And then, so this weakly supervised learning idea that you just described it instead of… so I guess we’re still using the labels downstream for that kind of supervised learning approach that I just outlined. But instead of having the labels be manually labeled by some process. It could be experts, it could be a labeling farm in Bangladesh. With a weakly supervised learning approach, you are instead writing functions in the simplest sense to run over the input data that you have and predict a label.
Shayan Mohanty: 00:40:47
Yeah, that is exactly right. Weakly supervised learning is about using programmatic interfaces to produce the supervision, where strong supervision is just a human sat down and looked at this thing and made a judgment. Here, you could have several millions of rows. Humans are not going to look at all of them. But they will “supervise” quote-unquote the creation of those functions that are then used to create those labels. So that is the idea behind weakly supervised learning. And the reason why it’s become so popular recently is because, oftentimes when you’re training the data-hungry model architectures that are in vogue these days, you actually do need quite a lot of data to avoid things like over-fitting and that sort of thing. And oftentimes to generate that much data by human means, it’s hard. It’s expensive. It takes time, for all the reasons that we talked about earlier.
00:41:48
So weakly supervised is very interesting because A, it can speed up your time to value, so you can move quickly from an idea to an implementation. B, you have this explicitly checked in set of sources of information, sources of signals, sources of supervision, which can be modified over time. So if people leave your labeling team that you’ve brought in-house or whatever, they’re not taking context with them. That context is instilled in the system itself. And the third point is that you’re able to output far more data at a much cheaper cost basis, even if you’re trading off sometimes for a couple points of accuracy, precision, and recall over the entire data set. By providing these models with more data to learn from, they often yield better results, despite the fact that your input data has slightly lower accuracy, precision, and recall. And again, that’s sort in the extreme case. Oftentimes, we find that you’re actually able to meet or exceed the accuracy, precision, recall that you expected from human labelers.
00:42:57
Because in practice, you are now able to have the right people in the seats labeling data instead of trying to infuse their subject matter expertise into a bunch of people who are a lower cost of lower wage folks basically, who might not have the context to begin with, but they’re learning it. So there’s some nuance there, basically.
Jon Krohn: 00:43:19
Cool. All right. So that all makes perfect sense to me. I love that. So we’re using a programmatic interface to automatically label large amounts of data in contrast to a strongly supervised approach where we would be hand labeling each item. And in this weakly supervised approach, where we are automatically labeling very, very large amounts of data, that typically provides better model results, even if the accuracy of these weakly assigned labels is lower than the strongly assigned labels. But to your point, it isn’t even always the case that happens. We can end up so that it’s kind of a misleading name.
Shayan Mohanty: 00:44:00
Yeah, it is.
Jon Krohn: 00:44:02
Because weakly supervised learning implies that the quality of the supervision is weaker. But you’re saying it’s just more automated. It’s not necessarily weaker.
Shayan Mohanty: 00:44:13
Yeah. And I mean, I could talk forever about this as well, but I’ll keep it relatively short. I think that there’s also this misconception overall in the data science community about hand-labeled data. And what I mean by that is the term ground truth or gold data gets thrown around a lot. And when you really dig into it, most of the time when people talk about ground truth, it’s not actually ground truth. It just means humans looked at it. So in the same way that hand labeling yields inconsistent results, unless you have layers of humans auditing layers of humans and so on, you can’t really guarantee that all of your labels are gold. All you know is that they’re probably pretty good because you had a bunch of people look at it, and you trust your systems. But more often than not, you’re not actually testing that, because that requires exponential growth of your people organization to stack all those projects and actually have the auditing and so on.
00:45:19
So our whole thing is we shouldn’t be using the term ground truth unless it’s inherent in the data. Back at Facebook, we did a lot of recommendation engines. And in those recommendation engines, the data that we were using to train them were things like what pages you liked, or what pages you interacted with, and what content you touched, and stuff like that. That’s ground truth. Because no matter how we look at that data, we can’t change what the user did. The user did it. It’s part of recorded history now. Whereas any time-
Jon Krohn: 00:45:56
Although.
Shayan Mohanty: 00:45:56
Go ahead.
Jon Krohn: 00:45:56
As a subtle twist on that, it is interesting to think that sometimes, what I interact with in say a social media feed is not the kind of content that I want to be shown.
Shayan Mohanty: 00:46:13
Totally.
Jon Krohn: 00:46:15
So you’re absolutely right that it is the ground truth in terms of it is definitely what I clicked on, but it doesn’t necessarily mean that that is the ground truth of what I want in my social media feed. So I might sometimes when I’m tired or had a frustrating day or whatever, I’m clicking on clickbait-ey type content, so the social media platform learns that about me and it’s giving me more. And I’m like, “No.” It was a moment of weakness.
Shayan Mohanty: 00:46:46
100%. This doesn’t imply stupidity on the modeling side. In a perfect world, we’re able to identify when you are in that fugue state almost, and you’re just trying to doom scroll, versus actually trying to engage with content. In a perfect world, we’re able to identify that. But no matter what, it doesn’t change the fact that you did view that content, and that was content you did engage with. And it’s up to the modeling team or organization to figure out how and when to serve you that content, or how to even use that information.
00:47:21
The interesting bit is it didn’t really require further analysis by humans necessarily, right? I’m not making a judgment on, I don’t know, where do I think you work, right? You could tell me you work at the CIA. I’ve got no idea if that’s true or not. I don’t have the context. But I would have to go through quite a lot of effort to go figure that out. And chances are, all the sources of information that I have at my disposal will be noisy in some way. They’ll be dated because they scraped your information from some other time. Or I might be looking at the wrong Jon, for instance. There are several different reasons why we might not get the right answer there.
00:48:04
But from a labeling perspective, since you lack the context, you lack fundamental ground truth that you compare to. The next best thing is we have a bunch of humans that make their best case judgements, and we kind of call it a day. Our whole thing is that everything about supervised machine learning has in one way or another been built up on this idea of ground truth. And we think that’s actually kind of dangerous. Because now, you’re judging your model’s performance off of an inherently flawed foundation. And it’s flawed in unknown ways. It’s not a way that you can audit. You can’t go back and check why did such and such people label this data as X instead of Y? It’s very cumbersome to figure out where your foundation is flawed.
00:48:48
So the concept of weak supervision, yes, I agree the term weak implies some degradation of quality or something like that. But in fact, I think it requires a re-lensing of the whole situation where to begin with, you didn’t have ground truth. Even if you throw a bunch of humans at the problem, as you increase the number of humans who review a thing, you also increase the likelihood that they’re going to disagree with one another. And the fact is that oftentimes labeling structures, you don’t capture that nuance. If three out of five people said this thing is X instead of Y, if you go through a labeling company, oftentimes we’ll just be like, “All right, this thing is X.” When there might be information in the fact that those two people said it’s Y, right?
00:49:34
So we actually believe in probabilistic labeling, and that allows us to capture a more realistic distribution of the label space. Instead of saying things are always black and white or no or off type of things, there’s actually distribution in the middle. And it is useful to capture that distribution. Some resumes that you review are going to be very strongly predictive of senior software engineer titles. That makes sense. Sometimes, it’ll be a little less obvious. And that’s just the real world. So you want your model downstream to pick up on that nuance. But there isn’t really a nice way to capture that nuance when you have a bunch of humans who are hand labeling. And there are lots of different variables that go in the middle there, which is why a lot of the time those labels get ceilinged or floored to zero or one. Whereas when you do this through a purely programmatic mean, you have more opportunity to explore the nuance in between. And you have more explicit definitions of how you’re weighting things, how you’re calculating these numbers and that sort of thing.
Jon Krohn: 00:50:42
That is definitely a way that I’ve rarely thought about labels, but it makes so much sense to have probabilities as opposed to discreet labels. Makes so much sense. And then, you could also be in a scenario where you could say, “Okay, I’m actually going to deliberately choose only data points where there’s a very strong opinion.”
Shayan Mohanty: 00:51:09
Totally.
Jon Krohn: 00:51:12
So if it’s some binary outcome that we’re predicting, I’m not going to take the scenarios where it had a probability of 0.51 of being a positive case, because it was almost equally likely to have been a negative case. So it was more that data point.
Shayan Mohanty: 00:51:29
Yeah, absolutely. Or alternatively, you can weight the things that have a lower confidence down. There’s no requirement that all of your rows of data weigh the same in the training process. So you can provide that context to your model and let your model figure it out. That’s kind of the point of deep learning. You just throw features and data at a model architecture that can hopefully find nuance and subtlety in those connections. And you sort of see what comes out the other end. This is sort of a crass view of deep learning, but that’s sort of the idea. You let it do what it’s good at.
Jon Krohn: 00:52:03
Super cool. All right. So we’ve talked about this idea now of weakly supervised learning thoroughly. And I love this idea of probabilistic labels that came out of the discussion. There were some other topics that you and I touched on before starting recording that I thought would be really fascinating to talk about with respect to labeling. So there’s this computer science aspect that ties into linguistics. So language design principles. Something called the Chomsky hierarchy referred to. What is the Chomsky hierarchy, what are language design principles, and how do these linguistic ideas tie to computer science?
Shayan Mohanty: 00:52:47
Yeah. So let’s start as close to first principles as we can get. Which is in our product, we decided that weak supervision was going to be one of the tools that we use to solve the labeling problem. What we realized is that weak supervision at its heart, these labeling functions are really like search functions in a lot of ways. You’re returning a true false value for a particular predicate. So given some input, you’re assessing some logic and you’re returning true or false. That same function signature exists in search engines. Do these pages have the pattern that you’re looking for? Is it similar to what you’re looking for? It’s like a yes or no question. And you can rank it, obviously. You can say how similar it is and that sort of thing. So fundamentally, if you think about it from a first principle standpoint, this is a search problem.
00:53:51
So the tools that people have been using historically to create labeling functions, and I say historically, it’s really only been a thing for the last couple years, are really like programming languages. It’s Python and stuff like that, the lingua franca of the data science world. And that makes sense. However, it, number one, raises the barrier to entry for folks who want to be able to create labeling functions that may not know Python. And two, requires a fair amount of understanding of the underlying data structure. So for instance, you could imagine a labeling function that runs a regex on some text. Great. If your input is a CSV or something, you could imagine that you’re running it over a particular field, or a set of fields, or something.
00:54:42
But if you want to index into some data, if you want to index into a PDF that’s been OCR’d and there’s this data structure that you now have to interact with, that’s a non-trivial thing for you to wire up, when really the thing you’re just trying to do is run a regex, right? So that’s really point number one. Point number two is we saw an inherent need to automate the process of weak supervision. And what I mean by that is weak supervision at its core, you can think of it as an ensembling technique. You’re creating large quantities of independent sources of signal, and you’re combining them in ways that produce labels. So really the weakly supervised process is just ensembling. And the question is, what sort of interesting signals can you introduce? And how independent are they? How many different varieties? How much of the problem do they capture?
00:55:40
So weakly supervised systems benefit from creating large quantities of these heuristics very quickly. And if you’re sitting there just coming up with things off the top of your head, it’s going to take you a long time. Even creating one labeling function will take you several minutes. But your problem might be correctly solved with 200 labeling functions. You don’t know ahead of time. So are you going to spend all that time just coming up with stuff off the top of your head? Probably not. So we need ways to generate labeling functions. And the problem is Python and pretty much every programming language today can’t easily be generated, at least not in a safe way. So this lends itself to the halting problem, for instance. Where given some program, you can’t predict ahead of time if that program will terminate or not unless running it. Every Turing complete language suffers from this particular problem. So if we were to generate just arbitrary Python, we don’t know ahead of time if it’s going to be safe to run.
00:56:47
So that means that we have to constrain the space that we’re generating somehow. And that lends itself to a different, what’s called formal grammar. And this is where Chomsky’s hierarchy comes in, where you have different types of grammars. So the extremes are, what is it, infinitely recursive? I forget what it’s called. Let me actually quickly Google this. Unrestricted grammar, that’s what it’s called. That’s that’s basically like a Turing machine. It has an unlimited amount of context that it can pull from. You have variables that are defined somewhere else in the program that are eventually used downstream. It requires a lot of context to evaluate this thing. Versus at the very bottom layer of the hierarchy, you have what are called regular grammars. Which as an example, you have regular expressions. And these are just things that can be processed by finite state machines.
00:57:46
And the reason why this is important is if you think about a regex, I have the regex ABC, all that means is in some text I want to find ABC and return true if I find it. So what I can do is I can actually take the letters ABC and turn it into a finite state machine. I can just say I start. If I find the letter A, transition to the next stage. If I find the letter B right after that, transition to the next state. If I find the letter C right after that, transition to the next state. I didn’t need to know anything. I didn’t need to retain any information about the previous values that were found, or anything that’s later on really. I can evaluate this linearly, and I can scan through the text from left to right, let’s say. And I can process this thing in line, which means that I can efficiently process this and I can calculate if this thing is going to terminate or not.
00:58:41
Interesting side note, most regex implementations in Python, Java, that sort of thing, are actually non-regular. They don’t conform to the regular grammar anymore. And that’s because people over time have added interesting features like arbitrary for lookaheads, and back references, and that sort of thing. That’s beside the point. So what we do is we had to constrain the grammar that people were able to use to query data, so that we could then simulate it. We could generate it, and we know it’ll terminate. We know we can evaluate it efficiently. So that implied that we had to basically build a new language.
00:59:19
Now before you design a language, you have to choose not to use all the other languages that exist. So we had to have very good reason to do that. And when we talk about search queries, there are lots of search technologies out there that have done exceptional jobs. Elasticsearch is one that immediately comes to mind. So the question that we often get is, “Why not just use Elasticsearch? Why did you have to invent your own thing?” And the thing that we realize is that counterintuitively, labeling is a small data problem. You have a finite, fixed data set typically that you’re interacting with. You’re usually not labeling in an indefinite way. You typically take a sample from a larger set that you believe is representative. You put it somewhere, and you have labelers interact with that. And that data set doesn’t change for the duration of that period.
01:00:17
So what that means is that Elasticsearch, on the other hand, was actually designed for very large data problems. It was designed for problems where the data actually does need to be split out across many machines, and it can’t be fit on a single machine. So they designed clever ways to interact with that sort of data. We don’t have that problem. And in fact, we have the inverse problem. We have data that can fit on a single machine, or parts of data that can be fit on individual machines, let’s say. And we want to interact with that data as quickly as we possibly can. Again, to stay in that flow state. How do we do that?
01:00:53
So things like Elasticsearch didn’t quite work. And not to mention the language itself became cumbersome for the types of things that you’d want to do from a labeling perspective. You don’t just want to be able to do things like keyword searches. You want to be able to do things like similarity searches. How close is this thing to this other thing? You want to be able to do things like database look-ups, all sorts of stuff.
01:01:12
So we had to create our own language, which then implies that we had to create our own query engine. The thing to process the language and turn it into something that we can evaluate. Which then also implies that there is a data structure that we’ve created that allows you to efficiently index in the ways that you’d want. So whether you’re doing full text classification, whether you’re doing NER, a segmentation type problem. Even into images, video, audio, etc. We believe that all these problems share the same relative footprint, where I have bytes that are laid out in some way, that makes sense to the machine and to the users. But I have bytes fundamentally. And I have a graph that points into those bytes. So nodes in the graph could be things like hand labels that people have created and said, “Okay, this part of the text is such and such label.” It could be these labeling functions that are pointing into the data and be like, “These are the parts of the data that I affect.” It could be what we call derived labels, things like labels that are predicted. So probabilistic labels that we generate from the combination of all your different heuristics.
01:02:25
Fundamentally, these are all just nodes in a graph that are pointing into data. And to access that data, you could either access the data directly or you could go through the nodes and say, “Show me everything that has a greater than 90% likelihood of being such and such class.” Or, “Show me everything that this person is hand labeled over this period of time.” All of that is just metadata stored in a graph that indexes into this data. So we have this stack of things we had to build in order to do the really cool thing, which we’re really excited about, which is that suggestion engine. And that suggestion engine, because we’ve optimized the entire stack, because we can do things like sub-millisecond latency on queries, that means that we can run over 1,000 queries every second. A user is not going to sit there and do that. A user’s not going to type out a query in a sub-millisecond time, but the machine can.
01:03:24
So while the user is doing stuff, while the user is labeling interacting with the product or whatever, we have a simulation engine that’s running in the background constantly trying all the possible queries that a user might want to do, and then weighing out the probabilities, and choosing the things that are very likely predictive of the class and showing them to the user. Or in the case of copilot, checking them in and maintaining a set. And the really cool bit about that is that in the world a co-pilot, you don’t necessarily even need a human to be sitting there driving it. If you happen to have already labeled data that you might have labeled at some point in the past by hand, you can run it through this process. And Watchful will act like a human is sitting there saying, “Yes, no, yes, no, yes, no.” But in fact, it’s just your data and the simulation engine is creating all the heuristics that it would take to basically explain those labels. So you could give it already labeled data, and you get out heuristics that you could use to label all your data, not just that sample, in exactly the same way. But now it’s all explainable. You can go back and be like, “All right, this function is not quite right. Here’s how I want to tweak it.” And that sort of thing. So yeah, we have to go really deep on this, but we ended up with a very interesting solution.
Jon Krohn: 01:04:46
Yeah, that is super interesting. Thank you. So Shayan, you talked about the simulation engine. Were there any particularly interesting aspects of creating the simulation engine in order for to be so performant? So you’re describing how it needs to be able to run in real-time. As people are doing their regular labeling work, it’s running thousands of simulations. How did you program this simulation engine to run so efficiently and provide effectively real-time results to your users?
Shayan Mohanty: 01:05:20
Yeah, yeah. So luckily it’s not just me programming this. In fact, I don’t program that often anymore. I’ve got a very smart team. The royal you. So what I will say is that we’ve tried to make certain design decisions around where complexity is introduced into the system. And this is important as you think about scaling teams and as you think about hiring new folks who don’t necessarily have the same context or background. We’ve been steeping in this problem for a very long time, so we want to make the barrier to entry as low as we possibly can.
01:05:57
So all that said, the simulation engine is intelligent. It’s designed in a smart way. But actually, most of the performance wins are not in the simulation engine itself. It’s actually in that entire stack I described earlier, where you’ve got the query language, the query engine. You’ve got this really efficient data structure. It’s because all of our queries, whether they’re simulated or not, are very fast. That allows us to create a simulation engine that is inherently very fast, because it’s not waiting for results very frequently. Every so often, it’ll hit a long tail or whatever. But that’s getting rarer and rarer as we go on.
01:06:45
In fact, a simulation engine, we want to make sure it introduces the least cognitive complexity as possible for new engineers, so that it’s maintainable. So if you look at it, it should look like most other Monte Carlo simulations you’ve seen in the past with obviously some caveats. The things that are really clever, we’ve hidden behind library interfaces and things like that within the company. So we can have people who are just dedicated to maintaining high performance code. And then the folks who are using it are just using APIs essentially. That’s more of an organizational directive than anything else.
Jon Krohn: 01:07:25
Awesome. All right. So that makes sense. So you’re doing Monte Carlo simulations and you have your own programmers developing really high speed implementations that can be abstracted away, hidden away from ordinary users who are just calling relatively straightforward APIs?
Shayan Mohanty: 01:07:41
Yeah. Yeah, exactly. And to be clear, when I say quote unquote “ordinary users,” I really mean other engineers within Watchful. As we think about growing the team, we want to make sure that we don’t have to specifically be looking for high performance computing folks, because the whole company doesn’t need to be like that. Ideally, data scientists can join and they know data science, and that’s all they need to know. And all the high performance computing stuff should happen in the background without them necessarily needing to reason about it.
Jon Krohn: 01:08:13
Okay. So on that note, are you doing any hiring, and what do you look for in the people that you hire? Yeah.
Shayan Mohanty: 01:08:19
Yeah, we’re always hiring. So I guess relevant to potentially your audience here, we are looking for I’m going to call them data scientists, but really we’re looking for machine learning researchers specifically. So folks with academic backgrounds in something like weak supervision, or active learning, or some combination of that. Very, very specific profile. We’re also looking for machine learning engineers, folks who have experience going from development to production with machine learning. And we’re calling them explicitly machine learning engineers because we’re excited about folks who can bring that engineering mindset to the table, as opposed to a pure data science, pure research background.
01:09:03
We’re also looking for software engineers. Our stack is kind of esoteric. We have a rust backend with a closure script front end. Happy to talk about why we did that. But suffice it to say it’s an interesting stack. We’re looking for people both on the front and the back end. So if you’re familiar with React, if you’ve ever used a lisp before or closure script especially, please reach out to us. We’re excited about you folks. And if you’ve ever touched Rust, or even if you have a background in C++ or something. Rust, there’s a learning curve, but we’ve gotten quite good at teaching people it. So yeah.
Jon Krohn: 01:09:44
Cool. Sounds like amazing roles, amazing technologies that you’re working with. So you clearly have a great sense of technology. Actually, before we even started recording this episode, Shayan gave me lots of great tips on audio and video setup. So you’re probably hearing that his audio quality is outstanding. And then if you’re watching the YouTube version, you’ll also see that he has a beautiful camera setup. So clearly very technical person. Things like having a Rust backend and a ClojureScript frontend are also the kinds of things that happen when you have a really technical person at the helm of a company like Shayan. So I think another interesting question to ask you would be, if there are day-to-day tools that you think our listeners should know.
Shayan Mohanty: 01:10:39
To tell you the truth, I am very bad at keeping up with the latest and greatest in terms of tooling. I feel like my tooling is actually quite dated. If you look at my day to day, most of it is having conversations with folks. A lot of my influences through conversation rather than through directive or through implementation. So these days, I spend a lot of my time on Zoom. So that is a tool I use that I think everyone else is using a lot these days. But actually I think more interestingly, I find that writing, or by extension, programming, writing, experimenting is a tool of thought for me. So I like using Emacs for instance, because it allows me to have full… The impact of what I write is almost infinite. I could write things that affect my editor. I can affect Emacs directly. I could use Emacs through org mode to jot down my thoughts and do sort of literate programming for instance. There’s a lot of interesting things that I can do, and there are a lot of interesting integrations that I’ve done with other parts of my workflow.
01:11:58
Recently, we picked up Notion as a company. And Notion is quite nice because it gives me a lot, not all, but a lot of the same capabilities that org mode gives me. But now it’s in a way that the rest of my company can integrate with, and they don’t all have to be running an org mode reader or something. So I would say that my tool usage is relatively limited these days. It’s mostly Emacs, Notion, Zoom. I’ve got Fantastical as my calendar. It’s stuff like that. I’m a very boring person these days. You asked me a couple years before. I might have more interesting answers.
Jon Krohn: 01:12:37
I think the Emacs answer was great. That is a tool that we haven’t talked about much on air, but I know is beloved as a keyboard based editor within a terminal. And only through lack of time. I have been a Vim person for my terminal editing, without having to click and point. But I would love to understand Emacs more. I know that there are lots of benefits. And I used to work with a software developer Jake [inaudible 01:13:12] who loved Spacemacs.
Shayan Mohanty: 01:13:16
Yeah. I use Spacemacs actually, which is that middle ground between Emacs and Vim. I think one of the things that I like about Evil mode, which is that Vim key binding integration for Emacs is the time between me thinking and action and doing it is as minimized as possible. Which sounds trivial when you’re talking about the difference between two keys versus one key. But for whatever reason, there’s just extra energy that you have to spend thinking about control C, control E, versus just simple space E or something like that. Sequences of keys versus combinations. So I personally like it, but I know a lot of people who are Emacs purists who would probably eviscerate me for saying that.
Jon Krohn: 01:14:10
Cool. Well yeah, so Spacemacs, definitely something to check out if you want to be the most efficient programmer that you can be. Very cool. All right Shayan, this has been an amazing episode. I’ve learned so much. It’s been fascinating. Do you have a book recommendation to lead us with?
Shayan Mohanty: 01:14:26
Oh god, yeah. I’ve got an entire sequence. So The Three-Body Problem. So that is the name of the first book by Cixin Liu. But the entire series is phenomenal. It’s incredible. It is mentally expanding. And then in the same vein, there’s another book called Diaspora. But if I had to pick one of them, it’s got to be the Three Body Problem series.
Jon Krohn: 01:14:53
Awesome. That is a great suggestion. And, perhaps unsurprising given the kinds of guests that we have on the show. Not the first time that that’s been recommended, but a great recommendation. And then yeah, clearly a brilliant guy. Lots to say on fascinating technical topics. How can our listeners follow you after this episode?
Shayan Mohanty: 01:15:15
Hey, come chat with me. I’m on Twitter. I’m on LinkedIn. So my Twitter handle is @shayanjm. And you can find me on LinkedIn at the same handle. These are the types of things I’m thinking about all the time. So if other folks have opinions, or want to sort of learn more, or provide your own thoughts, I’m always open to chatting.
Jon Krohn: 01:15:37
Nice. All right. Thank you very much for opening up that line of communication with our listeners. Shayan, thank you so much for an amazing episode. And hopefully we’ll have the opportunity to have you on the show again soon.
Shayan Mohanty: 01:15:47
Thanks again, Jon. This was a lot of fun.
Jon Krohn: 01:15:48
Wow, Shayan is such a bright spark. I was deeply impressed by his ability to clearly and confidently convey complex technical content in today’s episode. In it, Shayan filled us in on how bias in machine learning is generally good, such as when we consider it as a model parameter or a tool for healthly adjusting output. But, how degenerative bias is bad wherein our data capture an unfair stereotype from society.
01:16:19
We also talked about the arguments against hand labeling, including that it can encode that degenerative bias in our models. How it’s slow to adapt to data drift, particularly in adversarial scenarios, the high capital cost, the high time cost. And, its association with the exploitation of people offshore. He then went into how his company Watchful provides an alternative to hand labeling by suggesting and predicting heuristics that automatically label data and demystify degenerative biases. And then finally, he also told this about his love of the Spacemacs command line editor that combines the best of Emacs with Vim.
01:16:55
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Shayan’s Twitter and LinkedIn profiles, as well as my own social media profiles at www.superdatascience.com/635. That’s www.superdatascience.com/635.
01:17:12
If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel. And of course, subscribe if you haven’t already. I also encourage you to let me know your thoughts on this episode directly by following me on LinkedIn or Twitter, and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show.
01:17:34
Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another enthralling episode for us today. For enabling this super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast.
01:18:11
Last but not least, thanks to you for listening all the way to the end of the show. Until next time my friend, keep on rocking it out there, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.