Jon Krohn: 00:00
This is episode number 487 with Susan Walsh, founder and managing director of The Classification Guru.
Jon Krohn: 00:12
Welcome to the SuperDataScience podcast. My name is Jon Krohn, a chief data scientist and best-selling author on deep learning. Each week we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today. And now, let’s make the complex simple.
Jon Krohn: 00:42
Welcome back to the SuperDataScience podcast. I’m delighted to be joined today by The Classification Guru herself, the fixer of dirty data, the one and the only, Susan Walsh. She’s a ton of fun. We have a lot of laughs in this episode. Susan has worked for a decade as a data quality specialist for a wide range of firms across the private and public sectors. For the past four years, she’s been doing this work as the founder and managing director of her own company called The Classification Guru. She’s also the author of the forthcoming and cleverly titled book, Between The Spreadsheets, and she hosts her own video interview show called Live from the Data Den.
Jon Krohn: 01:23
The topics covered in this episode center around Susan’s expertise in cleaning, normalizing, and classifying data, with a special focus on procurement data and company-wide cost savings. Today’s episode is appropriate for a wide range of listeners from technical specialists like data scientists right through to commercial specialists like business managers. The content is targeted at anyone interested in cleaning up their data or saving their business money. All right. You ready for another awesome episode? Let’s go.
Jon Krohn: 01:57
Susan in the United Kingdom, welcome to the SuperDataScience program. I’m so excited to have you on the show. I’ve known about you for so long, and now we’re here in your data den. Welcome.
Susan Walsh: 02:12
Yay. Thank you. Thank you so much for having me on. It’s such a privilege. Such a pleasure.
Jon Krohn: 02:17
So, where in the world is your data den, this beautiful, sparkly room?
Susan Walsh: 02:23
Well, if there’s anyone in the UK it might sound like I’m from Scotland, but I’m actually living about 30 miles south of London in Surrey in Guildford. That’s where I am.
Jon Krohn: 02:35
Ah, yeah. So, when I did hear that plane earlier, was that a Heathrow plane?
Susan Walsh: 02:41
It probably was. I’m halfway between Heathrow and Gatwick, but it tends to be the Heathrow ones I think.
Jon Krohn: 02:48
Okay. Well, we might have that memorable noise come through as well. Surrey is a beautiful part of the world, and I know you’ve had really nice weather in England lately.
Susan Walsh: 03:00
Well, it was nice for a good week and a half, and then we’ve had a week of rain. It’s a bit sunny now, but then we’re looking at another week of rain, so it’s typical British weather. It’s a bit pants, as they’d say.
Jon Krohn: 03:16
It was two weeks ago that I recorded with someone else in the UK, and I guess at that point you might have just had that week and a half.
Susan Walsh: 03:22
It was amazing. It was absolutely amazing.
Jon Krohn: 03:25
And Surrey is particularly nice. You have just so much beautiful green space down there.
Susan Walsh: 03:31
Yeah. Well, I mean, it is obviously beautiful, but having come from Scotland where we have mountains and big crazy things, that’s also really beautiful.
Jon Krohn: 03:43
Right. But surely not pants.
Susan Walsh: 03:46
Not pants.
Jon Krohn: 03:49
Very good.
Susan Walsh: 03:50
You probably get that.
Jon Krohn: 03:52
I get that. I don’t know if all listeners get that. It just means kind of crummy.
Susan Walsh: 03:57
Yeah.
Jon Krohn: 04:00
And pants also, it means underpants in the UK, so I guess that’s where it comes from. Kind of poopy. Poopy underpants I guess.
Susan Walsh: 04:06
Yeah. Definitely. That is something that I’ve had to be aware of when I’ve been creating content actually-
Jon Krohn: 04:17
Oh yeah. Of course.
Susan Walsh: 04:19
… because I have said things that either people haven’t gotten or I’ve said things to people and it’s been slightly offensive and I didn’t realize.
Jon Krohn: 04:30
Yeah. There are some curse words that in the UK they are relatively light curses, but if you say them in the US… I can’t say them on air obviously, but if you say them in the US-
Susan Walsh: 04:41
Oh really? Okay.
Jon Krohn: 04:41
Yeah. We can talk about them after the show.
Susan Walsh: 04:45
Yeah. Definitely. But this wasn’t even a swear word. I just said someone was husky, as in they were sounding husky, but they were like, “Oh, that means you’re a big person.” And it’s like, “Awkward.”
Jon Krohn: 05:03
When the kids would tease me about my weight when I was a little kid my mom would say, “Don’t worry. You’re just husky.”
Susan Walsh: 05:11
Chubby. Chubby is the word we’d probably use.
Jon Krohn: 05:14
Yeah. Well, anyway.
Susan Walsh: 05:16
It’s amusing how these words translate differently into different places.
Jon Krohn: 05:21
Yeah, they evolve. So, Susan, you are the fixer of dirty data.
Susan Walsh: 05:28
I am.
Jon Krohn: 05:29
You are renowned for that. I came across you because you spoke at the DataScienceGO conference in April which is run by an affiliate of the SuperDataScience podcast, so I reached out to them. I said, “Have you had any great speakers at DataScienceGO recently that you’d like to have on?” And you were on their very short, short list.
Susan Walsh: 05:51
Everybody else was busy, and so here I am.
Jon Krohn: 05:57
Well, we make do with what we can get.
Susan Walsh: 05:59
That was awkward.
Jon Krohn: 06:03
You’re extremely highly recommended, and I’d actually known about you for a long time. I think we have a lot of the same connections on LinkedIn. You host, for example, in your data den that you’re in right now… If people are watching the YouTube version of this they can see that. In that data den you host a recurring show called Live from the Data Den, and so for example in May you had a Women in Data special, and that included some folks. Kate Strachnyi, whom I’ve known for years, she was on episode 441 of the SuperDataScience podcast. I think that was one of several episodes that she’s been on, but that was one with me as host.
Susan Walsh: 06:42
Yeah. She’s a force of nature that woman.
Jon Krohn: 06:45
She is. Christina Stathopoulos was also in that episode, and I haven’t met her in person yet, but we’ve been connected and she’s going to be on a forthcoming episode of SuperDataScience.
Susan Walsh: 06:59
Brilliant.
Jon Krohn: 06:59
Yeah. So, we move in similar kinds of circles, and so when your name showed up on this very short list from DataScienceGO, I immediately reached out to you to see if you would be interested in coming on the program, and you were, and this is-
Susan Walsh: 07:13
Like, “Yes, yes, yes.”
Jon Krohn: 07:16
So, at DataScienceGO you did a workshop on data organization. I guess that was about your COAT system, your patented COAT system.
Susan Walsh: 07:29
Yeah, it was. Not quite patented, but make sure your data has its COAT on, like a jacket. Imagine that having it on. And actually, that is where one of the lost in translation situations happened because having lived in the UK you probably know about get your coat, you’ve pulled.
Jon Krohn: 07:49
Oh. No. I don’t even know what that means.
Susan Walsh: 07:51
No? Okay. So, basically, if you were in a pub or a bar and a man came up to you to try and chat you up and he would say, “Oh, get your coat. You’ve pulled,” as in you’re coming home with me. So, I did a great blog all around that thinking I was really smart, and then had to put an explainer at the bottom because only people in the UK knew what it meant.
Jon Krohn: 08:16
Even that verb, pulled, that didn’t come over the ocean.
Susan Walsh: 08:23
You’d be out on the pull when we were younger.
Jon Krohn: 08:25
Yeah. Out on the pull, right? Like P-U-L-L.
Susan Walsh: 08:26
Yeah, so out on the hunt for a boy or a girl or whoever.
Jon Krohn: 08:33
Whatever you’re looking for. That was a term that I learned in the UK, and I think it’s a hugely versatile term because it doesn’t get into any detail of what’s happening, but you just kind of… Yeah, you pulled. You went out to the bar to try to pull.
Susan Walsh: 08:54
You literally pulled someone. Yeah.
Jon Krohn: 08:56
You succeeded. You pulled. Anyway. I wasn’t aware of this other part of it though where get your coat, you pulled.
Susan Walsh: 09:05
It’s charming men over here in the UK. Really are. No wining and dining. Just, “Come on. Get your coat. You’re coming with me.”
Jon Krohn: 09:16
It inspired this data COAT, so what do the four letters in COAT stand for?
Susan Walsh: 09:22
When your data has its COAT on, it’s consistent, so that means you’ve got not different but consistent terminology. Same units of measure across the globe. We quite often see the UK versus US letters problems, or even dates, what date format are you going with. Then, it has to be organized. I think that’s a wardrobe behind you. I’m hoping that it’s a nice organized wardrobe, but it might be a messy wardrobe and there might just be a few things thrown in there, and you know it’s in there.
Jon Krohn: 10:01
No one will ever know.
Susan Walsh: 10:04
No one. Like Monica’s closet from Friends.
Jon Krohn: 10:10
I just painted that on the wall. No, that is a closet. That is a clothes closet, and it is pretty organized because through COVID my-
Susan Walsh: 10:18
But I’ll never know.
Jon Krohn: 10:19
Well, I guess I could go back there and open it up. It’d be a bit weird on the show. Maybe someday. Someday people will get to see what’s in my closet.
Susan Walsh: 10:26
That could be a whole new kind of format for the show. What’s in the closet?
Jon Krohn: 10:29
Hope you’re enjoying this episode. We’ve got a quick announcement and then we’ll get straight back to it. The fourth iteration of our DataScienceGO virtual conference is coming up quickly at the end of July. This time it’s for three days running from July 23rd until the 25th. You can get your free tickets today at datasciencego.com/virtual. This iteration of DataScienceGO has an extra special agenda. We’ve got a standalone career day on Friday the 23rd where you can meet hiring companies and discover exciting job opportunities. On Saturday the 24th, you will hear from world-class speakers like Ben Taylor from DataRobot and episode number 433, Jamie Fan from TikTok, Erica Greene from Etsy and episode number 435, and you’ll also be hearing from me. I will be providing a session on the pros and cons of PyTorch and TensorFlow, the two most popular deep learning libraries, with a conclusion that may surprise you, as well as lots of time for audience questions.
Jon Krohn: 11:37
Finally, on Sunday the 25th you can attend a full day bootcamp taught by seasoned instructors like Andrew Jones, who’s also in episode number 483, Harpreet Sahota, who’s also in episode number 457, and Joe Reis. These bootcamp certifications are included in the premium ticket, which is available now for a limited time at $49. On top of all that, over the course of the conference there will be several networking sessions in which you’ll have the opportunity to connect one-on-one with data scientists from all over the world. You can get free tickets for days one and two, or the full three-day premium experience for $49 at datasciencego.com/virtual, and I’ll see you there. All right. Let’s get back to the episode.
Jon Krohn: 12:26
So, organized.
Susan Walsh: 12:29
Yeah. If you throw some things in your closet, you know they’re in there but they’re going to be hard to find the next time you need them. But if you organize everything by style of clothing, by color, then you can just go in the next time and take out what you want. So, data is very much like that. You can organize things by country, by division, by region, by anything you want, but you have to categorize it first to make it easily found.
Susan Walsh: 12:58
And then, of course, it has to be accurate. There’s no point in using the data if it’s not accurate. Depending on where you work and what you’re doing, that will be a different definition for different people or different tasks. But once you have all those three things you then have trustworthy data, so you can start to make decisions on that data. You can grow your business, change your marketing plans, increase manufacturing of a specific product or discontinue another one. Data gives you all that when it has its COAT on. But I think the biggest problem that I see is people will get me in to put that COAT on, and then I leave and nobody helps keep that COAT on. Data maintenance is really important as well.
Jon Krohn: 13:52
Really good point. So, to review, C, consistent, O, organized, A, accurate, T, trustworthy. The data COAT.
Susan Walsh: 14:04
It is.
Jon Krohn: 14:05
So, what you’re saying is you often get pulled into an engagement but-
Susan Walsh: 14:11
I like it.
Jon Krohn: 14:12
… it doesn’t have long-term staying power.
Susan Walsh: 14:15
Yeah. But actually, I’m more of a fly by night kind of service, and I purposefully set myself up as that because as an organization… This is going really sideways, isn’t it? Organizations should be responsible for their own data. They should know it and own it and manage it. It’s fine to get me in to fix it, and sometimes you have to use third parties if you don’t have the resources, but if you can, you should really be looking after your own data. I can help people train people up to manage that, but it really is better for them to do it. Your data is such a precious asset. To just give it away to a third party you have to really trust them and make sure they’re doing the right thing by you.
Jon Krohn: 15:13
I agree 100%. You pretty much need to have those data experts in-house these days if you’re going to invest in data and automation.
Susan Walsh: 15:23
And it is an investment. It’s not a cost. It’s an investment.
Jon Krohn: 15:28
Yeah. Good point. All right, so that gives us a good overview of the kinds of specialization that you have, but let’s talk about The Classification Guru, which is the name of your company.
Susan Walsh: 15:46
It is.
Jon Krohn: 15:47
So, through your company you are able to put COATs on various people’s datas. There was a double plural on data there. Other than the COAT concept, what kinds of tools, techniques, approaches do you end up applying or using as a part of your practice? I guess if you have interesting case studies we could dig into those.
Susan Walsh: 16:14
Yeah. Well, first of all, I tend to work a lot in the procurement data space, so that’s where everything started. That’s where The Guru was born.
Jon Krohn: 16:27
Procurement.
Susan Walsh: 16:27
Yeah. And they have some messy, messy data.
Jon Krohn: 16:31
Just in case people don’t know, the procurement department of a company is charged with figuring out what to buy and to try to-
Susan Walsh: 16:41
It’s everything that’s bought for the company to function as a company, so your HR services, your employee benefits…
Jon Krohn: 16:49
Oh. Even HR. Right. I kind of figured that it was goods.
Susan Walsh: 16:50
Your facilities. Yeah. Cleaning products, rent, utilities, all the IT. Professional services, so legal services, accounting services. All the travel that all those employees do. I tell you, if you get hold of company credit card spend it tells you so much. A lot of people put things on credit cards claiming it’s travel and it’s not really travel. They’ve just been having a nice time while they’ve been away on business. I can go in and I will take all that messy information. The number of different ways you can name a flight is quite extraordinary, because you’ve got air, you’ve got flight, you’ve got misspellings, you’ve got all kinds of things. You might have the airline name instead. So, you have to go through, and I will categorize that into buckets. Level one might be travel. Level two would be air. Or, level one could be travel, level two could be road, and then level three could be taxi, bus, something like that.
Jon Krohn: 18:03
And each of these categories in these levels can be classified by your classification guru-ness.
Susan Walsh: 18:12
Exactly. And the difference with myself and my team is we’re 100% non-automated. We are AI, machine learning, algorithm free. The reason that we are is because I have been doing this for a decade now, and me and my team can take the data, normalize it, classify it fully 100%, and get it back to the client in the same time that they would probably go to a third party who says they’re using AI, machine learning, et cetera.
Jon Krohn: 18:48
You say you don’t use AI and machine learning, but to get that classification, how do you get the classification?
Susan Walsh: 18:55
Aha. Well, I mean, I have tools obviously. I use some software called Omniscope, which is really unique in that it’s data modeling, ETL, and visualization all in the one tool. I guess it would be like Alteryx and Tableau together in one software, but the difference between having the Alteryx and the Tableau and Omniscope is I can have my visualizations at the top and I can have the raw data table at the bottom and live edit the data so that it changes the visualizations at the top. So, it saves an immense amount of time. I’ve developed a methodology over the years to classify and normalize efficiently, so it’s super geeky now, but I love it.
Jon Krohn: 19:47
But there are data models in the Omniscope tool. Some people might say that’s AI.
Susan Walsh: 19:53
Yeah. I mean, I call it semi-automation because we’re making sure it’s right before it’s semi-automated, and I am absolutely not against automation in any way. However, it’s so important to get the base data, the training set, 100% correct or as close as you can, because so many people just buy some new software, put what they’ve got in there already, and then it just multiplies the disaster.
Jon Krohn: 20:25
I might even dare to reframe what you’re saying quite a bit to say that you do use modeling techniques, but you’re using them very wisely. You’re not using them without deeply getting involved in the data and making sure that any of these semi-automated processes are effectively doing the job that you think they’re doing.
Susan Walsh: 20:53
Especially when normalizing. So, we will remove all the suffixes so that it limits the INCs, the PLCs, the LLCs. You can’t just do a search and replace or write some code and just remove the INC, because some of them are legitimately there or in the middle of a word, and so we’ve developed ways to work around that. Also, when you actually work with the data you start to see patterns. I can tell where people might need trainings in certain areas because it’s been classified wrong or entered wrong or whether it’s missing information. The data can really speak to you, and I think there’s such a reliance now on, “Let’s just run some code or some automation over it. We don’t need to worry about it,” but actually, you can find out so much from that data if you spend some time with it. [crosstalk 00:21:47]
Jon Krohn: 21:50
You are preaching to the choir. It’s interesting to me. Sometimes I meet people who are either early on in their data science careers or aren’t in data science at all, and they’ll say things like, “Well, but AutoML. Isn’t AutoML going to get rid of these data science jobs?” No, because of exactly the kind of situation that you’re describing. If you spend the time to understand your data well, you’re going to have night and day improvements in model quality.
Susan Walsh: 22:25
It’s not even that. You save so much time in the long run because you’re not spending so much time fixing things and figuring out where it’s gone wrong. It’s the smart option.
Jon Krohn: 22:37
So, fixing those dirty data. All right. So, now I have some sense of kind of how your business operates. Do you have a couple of interesting case studies for us?
Susan Walsh: 22:50
Yes. So, I normalize a lot of suppliers for my clients. Last year there was 43,000 suppliers, I think. Got it down to about 34,000 suppliers. That’s a lot of-
Jon Krohn: 23:08
Is it misnamed names?
Susan Walsh: 23:10
Yeah. IBM. I.B.M. I’ve seen supplier names even misspelled in data sets. Sometimes you even still see International Business Machines as a supplier.
Jon Krohn: 23:23
Right. Yeah. I almost made that joke. Turns out it’s not a joke.
Susan Walsh: 23:25
No, it’s true. So, what does that mean for them? Well, suddenly they are spending a lot more money with certain companies than they thought they were. That means they could be… Maybe they’re spending company’s money with companies they shouldn’t be spending it with. Maybe they could be negotiating better deals with the ones that they are working with because they’re spending a lot more money. Maybe they now realize that they’ve got 50 suppliers for office supplies and stationery and they only need two, one and a backup or something. And then you combine all those different suppliers together to get the volume on stationery spend and suddenly it’s hugely different, much bigger, and suddenly you are in a whole different price band and you can get a much better deal. It sounds really easy, but I deal with data from all over the world. It can be very messy. It took about five days to normalize those 40,000 suppliers.
Jon Krohn: 24:35
I mean, that sounds really impressive. Five days doesn’t sound like a huge amount of time given the complexity of that task, so you guys obviously have your system.
Susan Walsh: 24:44
And no fancy AI. Just me.
Jon Krohn: 24:52
So, they start with this list. They say, “Look, if we look at the unique strings in our supplier list we’ve got 43,000 of them,” but then you come in and through using Omniscope, through investigating the data thoroughly, you clean information in a way that allows you to see, “No. Actually, you don’t have 43,000 suppliers. You have 34,000.” And then a next step beyond that is you can add classifications, so those this kind of level one, level two classifications that you were describing that allow you to see-
Susan Walsh: 25:28
Yeah. But now I have 34,000 suppliers to classify, not 43, so I’ve saved myself time as well because they’re normalized, so there’s less of them. And it also goes back to COAT, because you then get more consistency within the suppliers. Because if you had all the IBMs separately some of them might be under IT, some of them might be under facilities, but if you’re looking at one IBM supplier you know it’s going to all be under IT.
Jon Krohn: 26:04
Yeah. It’s brilliant. You tied it right to an immediate, tangible business benefit, which is that if you all of a sudden realize you’re spending way more money with one supplier you can renegotiate contracts or get better volume deals, which the supplier often isn’t going to come out and tell you.
Susan Walsh: 26:24
It can also help with things like fraud as well, because if you’re expecting to spend X amount with a supplier and then you normalize it and suddenly it’s double, it raises questions. Hang on. What’s going on here? Or if you see suppliers in there that shouldn’t be, then it really helps to flag all those issues.
Jon Krohn: 26:48
Nice. All right, so that’s one really good case study. I don’t suppose you have another for us.
Susan Walsh: 26:52
Oh, got a cracker for you. The normalization story is great. Everyone in procurement and probably most people in data know the value, but to the business they’re like, “Well, so what?” They want to see numbers. They want to see hard figures. So, I was building a customized taxonomy for a client. I needed their spend data to build that taxonomy. While I was going through that I could see a lot of misclassified data. It had already been classified by somebody else, but a lot of it was wrong. So, I flagged it all and added up the value at the end and we were looking at about 31.7 million dollars or pounds of spend sitting in the wrong area. In this case it was spend. In another instance that could be sales or marketing or production. How many spare parts are being made. It’s huge, and again, it’s because there’s probably no maintenance going on. Everyone’s focusing on their next refresh, but actually, what about the data that’s already there?
Jon Krohn: 28:09
Right.
Susan Walsh: 28:14
What would you do with that 31 million?
Jon Krohn: 28:17
What would I do? Finally buy a closet.
Susan Walsh: 28:22
Hide the bodies.
Jon Krohn: 28:26
Well, I mean, I’d be investing it in my company. If I had 31 million more dollars to invest in, say, R&D, I mean that’d be a really nice thing to be able to do.
Susan Walsh: 28:39
Yeah. I mean, it’s a bit misleading there because the 31 million has still been spent. It’s just not been spent in the areas that they think it’s in. But the savings that they could make from that could be significant. Maybe it might be a million instead. Sorry. You have to deal with that.
Jon Krohn: 28:59
Still, more than pays of the cost of working with The Classification Guru.
Susan Walsh: 29:02
Of course.
Jon Krohn: 29:04
Many times over, so that’s another good use of that word investment when you think about these data quality issues. Investing in having high quality data can literally save you huge amounts of money. It seems reasonable to assume a million quid.
Susan Walsh: 29:22
They have to be engaged. Yeah. They have to be engaged though. I have spoken to businesses who know that they have a problem, that their data is unreliable, and yet the business won’t invest in fixing it. So, you kind of just have to sit back and wait until the time comes when it’s really hit the fan and they need some help and then it’s like, “Yeah, I’ll help you now. It’s going to cost you more because you let it get so bad.”
Jon Krohn: 29:55
I think maybe to people who are less data-centric managers who see this kind of cost and they say, “Well, I’m not interested in spending that extra money here,” they can’t see the value because I guess they don’t understand the value of that investment and the huge savings that ultimately could be made.
Susan Walsh: 30:16
Well, that’s kind of why I’ve written my book.
Jon Krohn: 30:21
Yes. What a nice segue, Susan.
Susan Walsh: 30:23
I know. Isn’t it? Slipped that right in there.
Jon Krohn: 30:26
So, your book is called Between the Spreadsheets.
Susan Walsh: 30:30
Yes.
Jon Krohn: 30:31
I’m sure our audience is shocked that we have a cheeky title from what has been otherwise a very uncheeky episode.
Susan Walsh: 30:39
Yeah. Very sensible.
Jon Krohn: 30:41
Between the Spreadsheets, it’s coming out in September. It’s being published by Facet, and it’s all about what we’ve been talking about in the episode so far, which is classifying and fixing dirty data.
Susan Walsh: 30:55
Yeah. The whole reason I wrote the book was because we’ve talked about cleaning data, but even if you go on a course and do data science, there’s no sections on data cleansing. You’re just expected to know how to do it. So, this book is for people who want to get started, people who are already in data or procurement data and want to learn some more tips on how to be more efficient, or it’s for the decision makers who need to understand why they need to invest in their data.
Susan Walsh: 31:35
There’s great examples in there of what happens when it’s wrong. There’s even a chapter called Data Horror Stories that people have anonymously shared their stories with me and then I’ve made some commentary around them. This isn’t a made up kind of, “Buy my services.” This is a real problem that isn’t really addressed. Everybody knows about it, but nobody really wants to deal with it. So, I’m really trying to bring it to the fore, but a lot of the people who pay the bills or invest or pay for data services are not data people, and so I think we really have to engage them at a different level. So, something like Between the Spreadsheets might get them interested.
Jon Krohn: 32:23
Nice. I love this idea for a book, and as you say, it’s an under-serviced art of the whole data service industry given that cleaning data and having high quality data is tantamount, is absolutely essential to have any kinds of data analytics, data models, data visualizations be effective and allow any organization or person to make the right decision. It’s crazy how much more time, if indeed like you say maybe in a lot of programs all of the time, gets spent on data modeling approaches when it’s data cleaning that is even more essential.
Susan Walsh: 33:07
And it’s back to knowing that data again, knowing where the commas and the full stops or the dashes and quotes should be or shouldn’t be. Putting everything in uppercase helps massively. If you want a tip and you need to check data, having it all in uppercase is the best way to check quickly, because anything that’s not the same will stand out quite obviously.
Jon Krohn: 33:32
That’s funny. I always put everything into lowercase, but you’re right that if I put it into uppercase you would notice things like punctuation more.
Susan Walsh: 33:41
Yeah, or even the change in different letters. Because it’s bigger, you can scroll down more. Because lowercase, I mean it’s still effective, but I prefer uppercase.
Jon Krohn: 33:59
It’s interesting. One of my first steps in creating a data model is putting everything into lowercase when I’m working with natural language data.
Susan Walsh: 34:10
Interesting.
Jon Krohn: 34:11
But yeah, there’s no reason why it couldn’t be uppercase, and you’re right that you can spot more things.
Susan Walsh: 34:14
I’m going to do a poll. I’m going to do a poll on that now.
Jon Krohn: 34:18
Yeah. I’d be interested to see what those results are. So, in the book you’ve got these data horror stories. Are there any particular horror stories that stand out to you that you’d like to share with the audience?
Susan Walsh: 34:30
Oh no. You’ve got to wait for the book. I’m not giving anything away. But what I will tell you is that I have built a dirty data maturity model, and every month I’ve been releasing steps of the model. Obviously, step one is dirty data. The next one is declass data, I think. If not, I’ve just given you an exclusive. And then I haven’t released any of the others yet, but there’s five parts to the model, and so far we’ve had for dirty data just a pile of clothes on the floor. The next phase is those piles of clothes on the floor now in a laundry basket. You’ll have to wait and see what the next phase is. But only July. July you’ll get to know the next part.
Jon Krohn: 35:31
Well, this episode will air July 13th, so listers may be able to by this release data even go and get that third step.
Susan Walsh: 35:42
Should be out by then. Again, it’s about getting people who are not necessarily data people engaged as well as the data people.
Jon Krohn: 35:49
Nice. And everybody understands messy wardrobes. Okay. Brilliant. So, we’ve got your book coming out in September. Do you have any book recommendations for us other than your own book?
Susan Walsh: 36:07
Well, obviously there’s only really one book that I would recommend, but if I had to, if you had to put a gun to my head, then obviously I would recommend my good friend Scott Taylor’s Telling Your Data Story.
Jon Krohn: 36:24
Nice. And that book recently made a splash at the time of filming. We’re filming on June 25th, and just the other day, it was June 22nd, Kate Strachnyi, whom we already mentioned earlier on in the show, who was on your Live from the Data Den program on that Women in Data special… So, Kate Strachnyi and Harpreet Sahota, who was also on the SuperDataScience podcast. He was on episode number 457.
Susan Walsh: 36:55
He owes me a lip sync. I’m just putting that out there to the world.
Jon Krohn: 37:00
Does he? That’s something we haven’t even talked about yet. So, something that everyone needs to know about Susan is that every Sunday, since the pandemic started I guess…
Susan Walsh: 37:07
Just over a year it’s been going now. Yeah. It’s crazy.
Jon Krohn: 37:12
#LipSyncSunday is the hashtag that you can check out on LinkedIn.
Susan Walsh: 37:15
Yeah. There’s a lot going on there.
Jon Krohn: 37:19
Lip syncs of some hits such as No Scrubs and Single Ladies.
Susan Walsh: 37:24
Yeah. I have done Jump Around as well. That’s a good one. Just before this I recorded Sunday’s one. No spoilers, but it’s an older song. It’s more of a classic. I’ve done some Queen before as well, and a bit of Elton John, so covering the whole range of songs.
Jon Krohn: 37:52
Beautiful. So, you can check those out on LinkedIn or on YouTube. I didn’t know. So, Harpreet owes you a lip sync.
Susan Walsh: 38:00
Well, so George Firican from LightsOnData, he challenged me to a lip sync battle. I can’t remember which song he did, but I responded with Rihanna and Heart, and then I challenged Harpreet and then it’s just gone quiet. I’m like no. No. I’m not going to let him forget.
Jon Krohn: 38:19
Okay. Well, Harpreet, if you’re listening, it’s overdue. We’re waiting for your lip sync, Harpreet. So, Kate Strachnyi and Harpreet, they co-hosted the Data Community Content Creator Awards, which was a brilliant program. So much fun, as you’d expect from them.
Susan Walsh: 38:39
And Scott obviously graciously accepted his award. If you haven’t seen it I would definitely recommend you go and check that out.
Jon Krohn: 38:52
He goes by The Data Whisperer, but if I understand correctly his award acceptance was anything but whispering.
Susan Walsh: 38:59
There’s no whispering. There’s never any whispering.
Jon Krohn: 39:01
So, Scott Taylor’s book, it won in the Data Community Content Creator Awards this year for Best Popular Book. So, the fan favorite. Sounds like a great recommendation, Susan. So, if people are looking for more great content from you, so if they want to get the next steps in the dirty data maturity model, if they want more lip sync content, how should they follow you?
Susan Walsh: 39:34
There’s a number of ways. Where I’m most active is LinkedIn, and actually, I haven’t even mentioned that I have a number of animations as well. I have loads of videos around COAT, explaining that. There’s one where I turn into The Hulk because I see dirty data and get angry. I have superpowers, and my sidekick is Scott Taylor in another one. I’ve now become Susan the Tailor as well, because I’m like a tailor. I build custom fit data for your organization. And there’s one on tail spend as well, so I’ve been pretty busy.
Jon Krohn: 40:20
And so all of that that you’ve just mentioned would be on YouTube, right?
Susan Walsh: 40:21
It’s all on The Classification YouTube channel, all on various posts on LinkedIn. I’m on Twitter and Instagram, but you’re going to get the best content on LinkedIn.
Jon Krohn: 40:34
Nice. We’ll be sure to provide your LinkedIn details in the show notes, as well as the link to your Classification Guru YouTube channel. Those sound like the ones for people to be checking out. All right, Susan. Well, it’s been so much fun hanging out with you today.
Susan Walsh: 40:50
Yeah. It’s been brilliant.
Jon Krohn: 40:51
Such a funny episode. Thank you so much for joining us.
Susan Walsh: 40:54
Brilliant. Thank you.
Jon Krohn: 41:01
I told you that was a fun episode, didn’t I? In it, Susan led coverage of her COAT system for high quality data, COAT being an acronym for consistent, organized, accurate, and trustworthy. She talked about what procurement is and how having clean procurement data can save businesses money and enable better decision making. She also talked about the Omniscope tool for cleaning, classifying, and visualizing data all in one place.
Jon Krohn: 41:29
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, and the URL for Susan’s LinkedIn and YouTube profiles as well as my own social media profiles at www.superdatascience.com/487. That’s www.superdatascience.com/487. If you enjoyed this episode I’d of course greatly appreciate it if you left a review on your favorite podcasting app or on the Super Data Science YouTube channel where we have a video version of this episode. To let me know your thoughts on the episode, please do feel welcome to add me on LinkedIn or Twitter and then tag me in a post to let me know your thoughts on this episode. Your feedback is invaluable for figuring out what topics we should cover next.
Jon Krohn: 42:11
I’d like to give a special mention to those of you listeners who nominated my work for a Data Community Content Creator Award. Thanks to you, at the award ceremony on June 22nd my YouTube channel was recognized as the favorite for learning about machine learning and artificial intelligence. In addition, my book Deep Learning Illustrated was one of the three finalists in the technical book category. Apparently, the races were extremely close, especially in the YouTube category that I won, so your individual vote may have made the difference and tipped the scales in my favor. Thank you so much.
Jon Krohn: 42:47
All right. Thanks are as well of course due to Ivana, Jaime, Mario and JP on the SuperDataScience team for managing and producing another amazing episode today. Keep on rocking it out there folks, and I’m looking forward to enjoying anther round of the SuperDataScience podcast with you very soon.