Podcasts SDS 393: The Importance of Keeping Science in Data Science

68 minutes
Business, Data Science

SDS 393: The Importance of Keeping Science in Data Science

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

We’re putting the science in data science in this episode. We discuss empathizing with our customers, workflows, data science design thinking, and data unit testing, all as a way to get scientific methods back into data.

About John Peach

A modern polymath, John possesses a unique and diverse set of skills, knowledge and experience. Having earned advanced degrees in Mechanical Engineering, Kinesiology and Data Science, his expertise focuses on machine learning, solutions to novel and ambiguous problems. As a highly skilled Data Scientist, he has developed new techniques, lead teams, developing innovative data products and is a trusted advisor to decision-makers.

As a Sr. Applied Data Scientist at Amazon, John lead the Alexa Skill Store Science team. Currently, John is a Principal Data Scientist at Oracle. He works on the Data Science service as part of the Oracle Cloud Infrastructure team. Leveraging his extensive hands-on experience building machine learning models, he is now defining the tooling to improve the data science workflow. This interest grew out of the challenges that he and his team members have faced working with data at scale in a logical, rigorous and reproducible way.

John fosters the growth of scientists by starting the Amazon Machine Learning University in Irvine and the Alexa wide Data Science Excellence program. He frequently gives talks at universities and conferences. He is working to improve upon and formalize data science best practices. The focus has been on reproducible research. To that end, he has developed an approach to improve data validation and reliability by using data unit tests. He has also developed the Data Science Design Thinking concept; to formalize and increase the efficiency of the analysis process. He also coordinates the largest R meetup group in Southern California (OCRUG).

Overview

John Peach has lived, studied, and worked across Canada, Vermont, and the Bay Area. Currently, John and his wife are located in financial tech hub Irvine, where the weather is decidedly kinder than in Canada. There, John works in data science workflows at Oracle, helping direct Oracle’s cloud products for data scientists. He defines data science workflow as a process of identifying the right questions, iterating through the process and circling back to previous questions and theories. It’s not a straight line and many current data flows, in his opinion, oversimplify the process. One of the big problems John tackles is the multiple versions of notebooks that data scientists have and helping to streamline and organize that process.

In 2019, John gave a talk on his data science design framework at DSGO. One of the pains John noticed was that there are few undergraduate programs for data scientists because it’s still a burgeoning program and most are longer graduate programs. So how can you learn to think like a scientist without spending years in a graduate program that you don’t necessarily need? There are five phases he outlines: empathize, define the problem, ideate, build a prototype through inductive reasoning and test. The idea is you don’t build your model until the very end, that it’s the icing on the cake of a lot of work previously. John has put this method to work at a startup he worked at previously for smart cup technology. The problem was most of the factors needed to be in perfect conditions to have correct readings. John dove into incrementally solving each potential problem to change the way they went about the project.

We shifted gears to a topic called literate statistical programming. The goal is a single stream of human-readable outputs that intermingle the code that generated a graph. Currently, there can be a disconnection between code and its output. Data scientists have the primary role of communication, not code generation. Which means readable documentation should be what you hand to stakeholders. When data changes, you want the report to change with it without having to adjust code and then visualization. A literate statistical program handles that process for you by saving the pipeline.

Finally, we discussed data unit testing. This is something John borrowed from the software industry where they test specific functions at the input and output level. No one wants to spend time on writing tests, but John describes himself as a “data garbage man” because of how much cleaning he does on his own data. Doing this can change the data and the output—suddenly your model is wrong. How do you detect that? Data unit testing, based on the assumptions you hold about your data, is the way to account for this in a timely manner. It’s all about trying your best to never do something twice.

In this episode you will learn:

John’s move from Canada to the US [3:37]
John’s new position at Oracle [8:31]
Data Science Workflows [9:34]
John’s solution to data science workflow exploration [12:06]
John’s data science design thinking framework [21:20]
Case study [34:21]
Literate statistical programming [43:12]
R or Python? [51:55]
Data unit testing [53:28]
What drives John? [1:00:56]

Items mentioned in this podcast:

DataScienceGO Virtual
Oracle Cloud Free Tier
Knitr
An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (Free PDF)

Follow John

Episode Transcript

Download The Transcript

Podcast Transcript

Kirill Eremenko: 00:00:00

This is episode number 393 with Principal Data Scientist at Oracle, John Peach.

Kirill Eremenko: 00:00:12

Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex simple.

Kirill Eremenko: 00:00:44

Welcome back to the SuperDataScience podcast everybody. Super pumped to have you back here on the show. Just got off the Zoom call with John Peach. We’ve got an exciting episode coming up for you. So, what you need to know about John is that he was previously a senior data scientist at Amazon, and currently he is a principal data scientist at Oracle. And also in addition to all of that, John has a strong background in research, which he brings to the world of data science. And that’s why the slogan for this podcast is putting the science into data science.

Kirill Eremenko: 00:01:22

The thing is that, as you’ll hear in this podcast, data science evolved in such a way that we’ve jumped into it and we’re bringing businesses value through it, we’ve got lots of tools in data science. However, the science component of data science actually never got to be developed like it has been developed in physics, or chemistry, or mathematics and other scientific fields.

Kirill Eremenko: 00:01:51

And so, in this podcast, we’re taking a step back as John is showing us how we can bring that scientific mindset, what kind of frameworks we can use in order to bring hypothesis testing, defining a trading ideas, empathizing with our customers, understanding the problems that we need to solve and lots of very important concepts and how that world differs from the typical way we normally do data science.

Kirill Eremenko: 00:02:23

So, in this podcast you’ll hear about data science workflow by Oracle and data science design thinking as one of the big three topics that we discuss today. Then another topic is literate statistical programming, and why it’s important and how it is different to literate programming. And finally, the third big pillar of today’s episode is data unit testing, probably my favorite topic of today.

Kirill Eremenko: 00:02:51

It’s a cool tip on how to test your incoming data to explain and avoid or prevent model drift so that your models can last longer. So, there we go. That’s what this podcast is about today. Let’s put the science into data science. And without further ado, I bring to you John Peach, principal data scientist at Oracle.

Kirill Eremenko: 00:03:19

Welcome everybody to the SuperDataScience podcast. Super excited to have you on the show. Today’s guest is John Peach, calling in from California. John, how are you going today?

John Peach: 00:03:30

I am doing great. Thank you so much for having me on. I appreciate it.

Kirill Eremenko: 00:03:33

Amazing. It’s great to have you. Great to have you. You mentioned that this is your second time moving to the U.S from Canada. Tell us about how that first time went, and why is this your second time?

John Peach: 00:03:48

Yeah. So, I moved to Vermont in ’97 to do a masters and PhD in mechanical engineering at the University of Vermont. And when I was finished with that I moved back to Canada, to the French part of Canada, I didn’t speak a word of French, where I met an amazing woman. And she was doing her PhD and I was a faculty member at a different university, so I wasn’t dating a student.

Kirill Eremenko: 00:04:18

[crosstalk 00:04:18]. No conflict over there.

John Peach: 00:04:22

No conflict. She’s a neuroscientist. So then the second move happened in 2013. She finished her PhD and got a postdoc position at University of California, Berkeley. And so, we moved to San Francisco and I left academia. I’m an academic at heart, but got into industry. If you’re in San Francisco, you do the startup thing.

John Peach: 00:04:48

So, I went into a bunch of startups as a data scientist. We moved around the Bay Area a little bit. She ended up going to Stanford for her second postdoc. So we moved closer to there. And now we’re in Southern California for her third postdoc at UC Irvine.

Kirill Eremenko: 00:05:03

Wow, third postdoc. Wow. Amazing. What’s the topic of her thesis?

John Peach: 00:05:10

So, she studies-

Kirill Eremenko: 00:05:12

Oh, sorry. Research.

John Peach: 00:05:12

Yeah, research.

Kirill Eremenko: 00:05:13

Research not thesis.

John Peach: 00:05:13

So, she studies the mechanisms in vision basically. How the different circuitry wires together in the brain so that people with a disorder called amblyopia basically where they just use one eye, how that produces vision in the brain. So, she uses a mouse model for that.

John Peach: 00:05:35

And I help her with her research. She does all the bench work and I help her with the analytics. I’ve taught her literate statistical programming and stuff like that to scale up her research.

Kirill Eremenko: 00:05:47

Wow.

John Peach: 00:05:47

She uses huge amounts of data.

Kirill Eremenko: 00:05:49

Oh, wow. You guys are like a power couple.

John Peach: 00:05:53

Yeah. We’ve heard that a couple of times.

Kirill Eremenko: 00:05:54

That’s awesome. That’s awesome.

John Peach: 00:05:56

Super nerdy couple.

Kirill Eremenko: 00:05:59

Very cool. All right. Okay, cool. So, how are you enjoying Irvine?

John Peach: 00:06:04

I love it. It’s an amazing place. The weather is spectacular, especially coming from Canada. I love Canada, but I have not had to shovel snow once in Irvine. My house might fall down or burn to the ground, but I haven’t had to shovel snow.

Kirill Eremenko: 00:06:18

Gotcha. Gotcha. I was at Irvine once in, I think end of last year or somewhere around October last year, right after DataScienceGO in San Diego. And I learned a lot about this city. It’s not too far from L.A, about two hours south. And it’s highly underrated in terms of its significance for the tech world. I believe a lot of companies have offices there, like Amazon, a lot of big four consulting firms, I think Ernst & Young, maybe I’m mistaking which one, but quite a big presence of companies and even headquarters of different organizations there. What is going on in Irvine? Why is it a tech hub that not many people know about?

John Peach: 00:07:06

Yeah. It’s kind of interesting. It’s a real big hub for the financial services industry. Like Experian is here and there’s, like you said, Amazon and Oracle. And there’s also a large number of small to medium size startups in the tech field. And you’re right, it’s just really underrated. It’s a great place to live and there’s lots going on and we have a great tech community that’s happening here and it’s just outside of L.A.

Kirill Eremenko: 00:07:40

And UCI, University of California, Irvine, fantastic place. When I was there, the campus is so big; they have their own restaurants on campus. You can walk from one restaurant area to another. There’s, I think, I don’t remember the number, but tens of thousands of students go to that university. It’s huge.

John Peach: 00:07:59

It’s huge. And it’s like living in a park. I actually live on campus.

Kirill Eremenko: 00:08:04

Oh, really? Wow.

John Peach: 00:08:07

Yeah. Yeah, it’s a great school. They do a lot of really good AI research that’s happening here that’s just not well-known. People know about the UCI machine learning repository for data and stuff like that, but there’s a very large research group here in AI.

Kirill Eremenko: 00:08:26

Mm-hmm (affirmative). Wow. Fantastic. Fantastic. Well, I’m very excited to have you on board. And as I mentioned before the podcast, congrats on the new job at… Well, relatively new, seven months at Oracle. How are you enjoying your time there?

John Peach: 00:08:40

I love it. It’s been a big change. Before that I was at Amazon and that was a great job. I moved to Oracle basically because I have an interest in the data science workflow and they just started up a new service, a data science service on the Oracle Cloud. I want to get down on the ground floor and help direct that product so that data scientists would have better workflows.

John Peach: 00:09:09

It’s a great company to work for, work with great people doing lots of really amazing things. We’re very early days in our product. It’s general availability. There’s not a whole lot of past history so that we can go and do amazing things and help data scientists work better and faster.

Kirill Eremenko: 00:09:32

Wonderful. What is a data science workflow in a nutshell?

John Peach: 00:09:37

Yeah, that’s a very good question. You see all kinds of graphs where, you see this life cycle of data science or you see a data science pipeline. And I think those very much are oversimplifications. The data science workflow really starts with identifying the right questions, which is actually probably the hardest thing to do, like figuring out the right thing to ask.

John Peach: 00:10:08

Even if your boss is saying, “Hey, can you go figure out X,” X may not really be what they need. And it’s a very iterative process. It’s not like you do your data collection, then you do your analysis and then you build your model. That’s not really the way it works. It’s you kind of follow that path but you often circle back to earlier stages within that. So, you think about the problem, you ideate over it and then you develop your hypothesis and then you collect the data and prototype it.

John Peach: 00:10:47

As you’re doing that, you may go like, “Oh, maybe I didn’t ask really the right question. Maybe I need to rephrase that question.” Or, “Maybe I don’t have the data to answer that question, but I have data that’s proxy to it so I can answer this other question that gets us closer to what we really truly want to know.”

John Peach: 00:11:04

So, it’s this very iterative process. And I think there’s a problem around tooling, around expertise and being able to track your work when you’re doing this iterative back and forth, branching off, checking out an idea and then coming back and then branching off again. I think that’s really the data science workflow that people do and not this, like you do one thing then the next. It’s not a waterfall process.

Kirill Eremenko: 00:11:33

Mm-hmm (affirmative). Gotcha. Gotcha. It kind of ties into what you spoke about at DataScienceGO 2019 last year, Design Thinking, right? Data science design thinking that you got to iterate and talk to your clients as well. But let’s talk a bit more about this data science workflow, to the extent, of course, you can share without disclosing any secrets or giving spoilers about what’s coming up for us in the Oracle space. How are you going to solve that problem? What is your vision for a tech solution, or online solution, or whatever kind of solution it is for people to help data scientists with this data science workflow exploration process?

John Peach: 00:12:21

Yeah. So, right now what we have is, the core product is a Jupyter Notebook. And there’s a couple of advantages in the service and that you can scale your compute, but we really want to go beyond that. So, in the Cloud, you can integrate with other services that allow you to have the flexibility to share your work to have immutable models.

John Peach: 00:12:48

So, we have a model catalog where you create a model and you push that model to the model catalog. And what comes with it is the immutability of the model. So, the model can’t change. And also metadata, like what notebooks, what data was used to build that model so that you push that model into production and you might have four or five models in production that are all slightly different, that came out of slightly different Notebooks.

John Peach: 00:13:14

And then you want to go back and you want to look like, “Hey, this one’s doing well here. And this one’s failing here. Can I get back to the exact data that was used, the exact code that built the model so that I can do a postmortem on it?” So, bringing that metadata into the data catalog is really important.

John Peach: 00:13:38

So, it’s building tools around that integration with Git so that you can keep your notebook. I mean, how many of us have Notebooks that say final notebook, final final notebook, final final notebook? So, integrating with Git changes all of that. You can tag like, “Hey, this is the version that I sent off to my boss for analysis. And then this was the second version that I sent to him.” And, “Hey, this was another version that I sent to another person that has some information retracted or had an extra little bit of analysis.”

John Peach: 00:14:14

But it’s all in the same compartment. It’s all in the same, Git Repo. Any models you’ve push out, you can pull that back and say, “Hey, this is the notebook that I used it in.” And then we want to tie into other services. So, we have all the large scale compute, Spark and databases and stuff like that.

John Peach: 00:14:34

And you can have all of that in your notebook tied to the data that was there so that you have that ability to do a postmortem to analyze what you did. If you had five as an answer somewhere, you can go back and figure out exactly how that five was calculated.

Kirill Eremenko: 00:14:53

Mm-hmm (affirmative). I now remember I had a podcast with Greg Pavlik who is, I think, one of the executives in this space of Oracle Cloud. And we spoke about this. I just didn’t know it was going to be called a data science workflow, or maybe that just escaped my memory.

Kirill Eremenko: 00:15:15

So, I remember being super excited about this. This is definitely something that the world of data science needs, because as you said, there are so many variables, like the model version, the data version that are changing, why did you do that? If all of that can be somehow archived or somehow traced, that will add a lot of certainty to these data science projects and how we experiment with things. Definitely I think a lot of rework will go away from that. Have you had any early feedback from testers?

John Peach: 00:15:55

Yeah. So, the feedback that we’re getting has been very positive on the service. We’re growing our customer base. There’s definitely pain points and we’re very actively fixing those pain points, but yeah. We also have an accelerated data science library that comes with it that is meant to automate a lot of the mundane stuff, like exploratory data analysis, most of the time you’re just generating dozens of graphs and looking at your features and stuff like that. We have a single command that will do that for you-

Kirill Eremenko: 00:16:36

Nice.

John Peach: 00:16:37

… and that type of thing. People really appreciate that-

Kirill Eremenko: 00:16:39

Nice.

John Peach: 00:16:39

… so they’re not sitting there running the same plot 50 times.

Kirill Eremenko: 00:16:44

That sounded so like, we have a single command that just doesn’t like, it’s like, you guys have got it also sorted, huh? How come more people are not using this? I wish more people had access to this somehow to be able to do things like that with a single command. Right? How convenient would that be?

John Peach: 00:17:04

Yeah. I think there’s a real tooling issue in the data science community. And I think that comes out of science. I mean, most people that go and get their PhD, which is where data science started, they’re used to rolling their own. They come up with solutions that are one off. There’s not standardized tooling. The lab down the hall that’s doing basically the same thing is using different software. Even within the same lab people use different software and do things differently. There’s not this sense of, we need to productionalize our workflow. We need to streamline as much as possible.

John Peach: 00:17:44

And you see people coming out of a CS background that they’re doing that. CS workflows are very streamlined, standardized tooling. They understand the cost of doing something differently than the guy beside them. And we’re starting to see this slow movement towards doing that, but the tooling just isn’t there. And the tooling that we have now is largely CS type tooling, which is quite often unfamiliar to data scientists coming out of university. They’re not used to using Git. It’s a strange, weird thing to them.

John Peach: 00:18:26

And wrapping your head around Git, it takes a bit of a learning curve. So, they just don’t do it. So, they produce final report, final final report and like, “Oh, I’ll just change this now and I’ll remember that I made the change,” and they don’t. They’re not used to sharing.

Kirill Eremenko: 00:18:45

Got you. Just quickly. I’m sure we spoke about this with Greg on the podcast, but if you could reiterate for maybe those who haven’t heard that episode, is this tool available for free, is it paid, what’s the price and things like that?

John Peach: 00:18:57

Sure. So, it’s part of the Oracle Cloud infrastructure. And if you go to the Oracle website, you can sign up and get 30 days of access where you get $300 free to do whatever you want on the service. After that, there is some services that have a free tier. So, you can get a small database that’s free forever. You can get some small compute instances that are free forever. Unfortunately, on the data science side we need bigger machines, so after 30 days you would have to pay. But during that 30 days you get $300 worth of free credits-

Kirill Eremenko: 00:19:38

Yeah, got you.

John Peach: 00:19:39

… which is a lot.

Kirill Eremenko: 00:19:40

Great to try things out and then if it makes sense, especially on a business side of things, it’s good because startups can get into that, right? Like startups-

John Peach: 00:19:48

Exactly. Yeah.

Kirill Eremenko: 00:19:49

… can scale up their way. Hope you’re enjoying this amazing episode. We’re going to break for a quick announcement and we’ll be straight back to it. This episode is brought to you by our very own virtual data science conference called DataScienceGO. If you haven’t been to DataScienceGO yet, if you haven’t heard of DataScienceGO, check it out at datasciencego.com/virtual.

Kirill Eremenko: 00:20:09

There you’ll see a recap of the incredible event we had in June this year. We’re hosting DataScienceGO Virtual number two in October. Make sure to be there. We’re going to have amazing speakers, amazing energy, and we’re going to have virtual networking, three minutes sessions to connect with your peers, with mentors, with hiring managers, with mentees, with whoever is at the conference, random lightning networking, three minutes each. You can stay in touch with these people, expand your data science network. Be there. It’s in October. It’s absolutely free. The best part, it’s absolutely free. Just go to datasciencego.com/virtual, register for the event today. All right, let’s get back to the podcast.

Kirill Eremenko: 00:20:54

Awesome. Well, let’s talk a bit about your data science design thinking framework, because it ties really well into this data science workflow. That was a title of your talk at DataScienceGO 2019. And I think this framework can help people understand better and save time on the data science workflow and be a bit more rigorous about it. So, yeah. So, what is data science design thinking?

John Peach: 00:21:23

So, before I answered that, let me bring up what I think one of the problems is that I’m trying to solve. So, typically new fields start with people that come out with PhDs and then you start seeing university programs develop PhD programs around them, and then you start seeing masters programs around them. And then when the field is fairly well-defined, you start seeing undergraduate programs around them.

John Peach: 00:21:49

You haven’t seen that in data science. The university started with master’s programs. We’re now starting to see a couple of undergraduate programs in data science. And what’s getting missed is the science aspect of data science, which I think is extremely important. And when you were in grad school doing research, you don’t take a class in how to think. But you learn to think in a certain way, you learn to think as a scientist. And some of that is getting missed, so how do you take somebody that wants to get into data science but doesn’t want to spend eight years getting a PhD, which makes sense.

John Peach: 00:22:34

You don’t need a PhD to be a really good data scientist, but what do you need to do is learn to think like a scientist. And there are several different ways you can do that. Like if you’re an undergrad, I encourage people to go volunteer in a research lab and just get some exposure to it. But if you’re out working in industry or don’t have the time as an undergrad, is there something that you can do to learn to think like a scientist? And that’s where the data science design thinking comes from.

John Peach: 00:23:06

I’m stealing it from the design industry. They have this thing called design thinking. It’s a process that they use that overlaps really well with the way that a scientist works. So, I’ve basically taken that and adapted it. There’s five primary components to it. There’s the empathize, define, ideate, prototype and test. So, in the empathize phase, this is, I think, one of the really hardest phases to get right.

John Peach: 00:23:38

You need to understand what the customer needs. So, you may have an ask like, “Hey, can you go figure out X, Y, Z?” And that’s really not what they need. It’s understanding what the real question is and then getting the customer on board with that so that you’re defining for them something that they really want to understand. And that takes a lot of practice and it takes a lot of gumption. It’s hard to go back to the customer or the boss and say, “Hey, I don’t think that’s really the right question. Can we talk about these other areas that I think may be the right question that we really want to get at?”

John Peach: 00:24:20

And then the next phase is to define the problem. So, problems can be either in design thinking terms, wicked or tame, or I like to refer to them as ill-defined or well-defined. So, an ill-defined question would be something like, how do we get our customers to click on the blue button, where a well-defined question would be, what features or attributes do we need on the website that would encourage customers to click on the button so that we see a 10% increase? Where you’re defining exactly what you’re going to do and you have something there that’s a bit of a null hypothesis test, right? Can we increase the number of clicks value went up by one, did you increase it? You need something that you can run a null test against.

Kirill Eremenko: 00:25:15

Measurable.

John Peach: 00:25:15

It’s measurable and objective.

Kirill Eremenko: 00:25:17

Mm-hmm (affirmative). Mm-hmm (affirmative).

John Peach: 00:25:19

And then next phase is to ideate. And this is to really think about the problem. To go talk to experts, think really broadly about the problem. I’m a big believer in stealing. So, don’t reinvent the wheel, go read literature. Most problems you’re working on have been solved or lots of research has gone into them. So, read the scientific literature. That’s a hard skill to get used to because most of that is really opaque.

John Peach: 00:25:53

But I think it’s something that anyone can learn to do. Go read patents. Most people skip over patents and don’t think about them. Patents are a great place. Patents fully disclose exactly how something is done, where the scientific literature it’s often hard to reproduce what somebody has written in a paper because of the restrictions on length. But patents have to tell you, by definition, they have to tell you exactly how something is done.

John Peach: 00:26:23

So, if you can go find a patent that says, “This is how we figured out how to get people to click on red buttons and you’re working on blue buttons,” there’s probably a lot of really good information in there. And then obviously brainstorm. You might have good ideas, but a collection of people have better ideas than you do. And so, go talk to people and brainstorm. And don’t be afraid to propose dumb ideas. And then I think one of the most important things is to sleep on it. Yeah. To give yourself-

Kirill Eremenko: 00:26:54

[crosstalk 00:26:54] framework.

John Peach: 00:26:58

Sleep on it. Sleep on it. Sorry boss, I’m not coming in today. I’m sleeping on it.

Kirill Eremenko: 00:27:01

This is part of my job. I’m doing my-

John Peach: 00:27:04

It’s part of my job.

Kirill Eremenko: 00:27:06

… seven hours of sleep.

John Peach: 00:27:10

Rarely has my first idea been the right idea.

Kirill Eremenko: 00:27:12

Yeah.

John Peach: 00:27:15

Mulling it over in the back of your brain for a couple of days goes a long way. You may think that you have a great solution and then a couple of days later you realize, “Oh, I need to go back and redefine this problem because there’s some issues with the way that I actually defined the problem.”

John Peach: 00:27:36

And then once you understand what the customer needs, you have a testable hypothesis, you know what you’re going to do, you use abductive reasoning to build your prototype. So, most people are from the-

Kirill Eremenko: 00:27:51

What is abductive reasoning?

John Peach: 00:27:53

Yeah. So, most people are familiar with deductive reasoning, which is, A leads to B, it leads to C, abductive reasoning is based on observations and it’s not about the logical conclusion like this has to be the answer. It’s, I’m moving in the direction of the answer. I don’t know what the answer is, but what I know now is better than what I knew before. So, you don’t think of solving your problems as a one shot thing. You’re going to iterate.

John Peach: 00:28:22

So, it’s all right to have uncertainty, build your models even though your models aren’t perfect. Don’t prematurely optimize. Just build something that’s a little better. Start with the null model, the model that just is the average. And then say, “Can I do a little better than the average?” And then learn from that.

John Peach: 00:28:44

So, once you have that prototype, you go and you test it. And a lot of people try to get the model perfect the first time. George Box, the famous statistician that we all know from his box plots, he has-

Kirill Eremenko: 00:29:02

Was his surname actually Box?

John Peach: 00:29:04

It was actually Box. Yeah.

Kirill Eremenko: 00:29:05

And so they’re called Box plots because of him?

John Peach: 00:29:08

Yeah. Even though the plot looks like a box.

Kirill Eremenko: 00:29:09

Yeah, oh my gosh. I never knew. Wow. That is so cool.

John Peach: 00:29:15

So, he has a quote that I live by and it’s like, “All models are wrong, but some are useful.” A model is approximating our interpretation of the world. We’re never going to get it perfectly right. But can I glean some information from the model? Does it help me solve the problem better than not having that model?

John Peach: 00:29:38

So then once we do test, most people look at the tests and like, oh, accept the null hypothesis, reject the null hypothesis. And yeah, that’s important, but I think more importantly is to look at your model, and hopefully you’ve released multiple models so that you can run multiple experiments simultaneously. And you go like, “Where did this model do well? Where did this model fail?”

John Peach: 00:30:05

And then you use that information to go back and think about the problem again. Emphasize with the customer like, “Are we asking the right questions?” Go and redefine the problem, ideate again, prototype again. Or maybe you just need to go back and prototype. Like you think things are good, it’s just, “Hey, the model is not really good in this particular segment. So, let’s just go back and see if we can fix that.”

John Peach: 00:30:32

And so, you build new models and you rinse and repeat. And it’s this very iterative process that we go through. It’s not linear, it’s not circular. You’re bouncing all over the place, but you’re using the tests to learn about the data and the system that you’re working on. And those tests guide your further decisions.

Kirill Eremenko: 00:30:57

Reminds me a lot about lean startup approaches too. Like sprints, agile instead of a waterfall methodology.

John Peach: 00:31:08

Exactly, exactly. They’re very similar.

Kirill Eremenko: 00:31:11

They’re very cool because I was actually thinking recently, is there an agile manifesto, but not for developers or for product creation, but rather for data science? And it sounds like this is what you’re describing here. And do you have any insights, have you used this or others used this in real data science projects and what’s the feedback from that?

John Peach: 00:31:40

Yeah. So, I’ve been developing this over the last decade or so. And I’ve used it in A series startups or startups that have been around, are fairly stable like C series startups. And I’ve used it in large corporations like Amazon. And I’ve been tweaking it along the way and the feedback that I’ve been getting from the people that I’ve been teaching it to has been extremely positive. And they’ve given me some good feedback on how I can make it better and how I can rephrase some things. And when I talk to people that have PhDs, a lot of them are like, “Eh.”

Kirill Eremenko: 00:32:18

That’s not how we do it.

John Peach: 00:32:23

No, it’s like, “Yeah, that’s not what I do.”

Kirill Eremenko: 00:32:25

I knew it all along.

John Peach: 00:32:26

Yeah, I knew it all along, but I just didn’t know that I knew it. And I think the real strength is coming from people that… and I shouldn’t be saying PhDs, people that have research backgrounds. I’m saying PhD for shorthand for research backgrounds. People that don’t have solid research backgrounds, the feedback has been really positive. They’ve been like, “I’ve been doing something like that, but not quite.”

John Peach: 00:32:48

And it reemphasizes some of the things that they need to do. There’s a tendency for people to have an ill-defined question, just jump in and build a model. And really the modeling is the last little thing you do at the end. You’re almost done and you know you’ll build model.

Kirill Eremenko: 00:33:08

Got you.

John Peach: 00:33:09

And trying to teach people that the model is the icing on the cake. And it’s all the stuff that goes before that that’s really important.

Kirill Eremenko: 00:33:17

I got a slogan for you, putting science into data science.

John Peach: 00:33:22

Yeah. I like that.

Kirill Eremenko: 00:33:24

Exactly what you said at the start. It all clicks now in my head. What you said at the start that data science is missing that science piece, right? We jump in, we start building models, it’s useful for the business and so on, but really if you approach it as a scientist, there’s a lot of questioning you have to do at the start, understanding the problem, refining the problem, experimenting, looking different directions and things like that. There’s a lot more frameworks that can be built around the creative element of data science. It’s not just a rule book that you just have to follow.

Kirill Eremenko: 00:34:03

There’s a lot of creativity, but hey, let’s put some… How do we normally do it in science? It’s through these hypothesis testing, through brainstorming, as you said, abductive reasoning. Super cool. Super cool. I think more data scientists should be thinking about that for sure.

Kirill Eremenko: 00:34:20

What’s an example, if you have, of a project among your experience that you’ve mentioned, of a project that you used this data science design thinking approach on but had you not used it, had you done it the typical waterfall approach or another method, your results would have been different or not as good? Something where you can say that this method really made a huge difference and added a lot of value.

John Peach: 00:34:47

Yeah. So, I worked for this startup that failed, because that’s what startups do.

Kirill Eremenko: 00:34:54

Most of them, yes.

John Peach: 00:34:55

Most of them do. Yeah.

Kirill Eremenko: 00:34:56

Yeah.

John Peach: 00:34:58

We were building a smart cup. So, the idea was that this cup would sense what you’re drinking and it would log it for you, and we had intelligence built into it. So let’s say you want to control the amount of caffeine that you were drinking or if you wanted to make sure that you were staying hydrated throughout the day. And we’d tie it into your Fitbit.

Kirill Eremenko: 00:35:18

Love that. That’s so cool.

John Peach: 00:35:19

Yeah. It was a spectacular project. And one of the problems that we had is we had to figure out how much liquid was in the cup. So, the way that the engineering was done was, there was a capacitance sensor. Sorry, yeah, a capacitance sensor along the side just like the touch sensors on your smartphone. And when you pour a liquid into it, it will sense the ionic charge in the liquid.

John Peach: 00:35:57

And it worked really great if the liquid you poured in was at room temperature, and it didn’t move, and the cup was at room temperature, and you let it sit there for five minutes. Not particularly helpful if you’re drinking from it and your liquid is hot or cold. So, we were getting these erroneous readings.

John Peach: 00:36:20

And so, I was like, “Well, what’s the real problem here? How can we figure out with the sensors that we have how much liquid is in the cup?” So, I talked to the engineering team, it turns out capacitance sensors are temperature sensitive. And so, there’s these little thermal couples, thermal resisters along the side that measure the temperature and the capacitance sensor is supposed to adjust for that. But then, this is where my engineering background kicked in, is like, well, you have a rate of heat transfer through the cup.

John Peach: 00:36:58

And so, we use those capacitance sensors to adjust what the reading was so that we were getting a better measure of the liquid in the cup. And here it’s understanding the engineering that’s going into it and not just, I need to measure the liquid level. Right? It’s really thinking about what the problem is that we’re having and what the physics behind it is and then diving into that.

John Peach: 00:37:33

And we got better results, but it wasn’t great. So, then I got thinking about it more and there was other sensors on the cup that we were reusing for different things. So there was a sensor for touch. So if you touch the cup and tilted the cup and that type of thing, so the capacitance sensor would give us weird readings if you tilted the cup because the liquid level would drop.

John Peach: 00:37:56

So, it was thinking about that and iterating and just solving little bits of the problem. So, how do I deal with the temperature? I did that. And then, is there other things I can do to give me more accurate results? So, I started using other sensors that were meant for something else. And then I started looking at the accelerometer to get the tilt of the cup so that we could adjust it so it didn’t look like somebody drank just because they tilted the cup. Or the cup was in their backpack and it’s sloshing around. We just look at the accelerometer.

John Peach: 00:38:26

So, it’s incrementally solving each little problem, breaking it down, diving into the literature to figure out how to do different things. So, I had to learn about the circuitry. I had to learn about heat transfer, all this type of stuff, reading patents on people that have done this before and brainstorming and building little prototypes and tons and tons of models that failed and didn’t do well. But learning from each failure and going through this process.

John Peach: 00:39:05

If I wasn’t using design thinking to do this, it would have been an insurmountable problem. We would have just said, “Oh, the technology is wrong. We have to go back and re-engineer this.” And that would have cost us a fortune.

Kirill Eremenko: 00:39:18

Mm-hmm (affirmative). Why is that? Is that because it would have been a lot of fear in front of this problem?

John Peach: 00:39:27

Doing engineering is extremely expensive when you have to make a change, so just replacing a sensor would have required an entire redesign of the cup, changing the entire manufacturing pipeline, the supplier pipeline, that type of thing. So, it’s easier to work within the constraints that you have than it is to go back and physically re-engineer something.

John Peach: 00:39:55

You might have to, and you might get to that point like, “Hey, we just can’t solve this problem.” But using this data science design thinking, we were able to redefine problems and redefine solutions and step outside of our comfort level by getting into the literature and seeing what other people had done. And you notice what I didn’t talk a whole lot about there was the modeling.

John Peach: 00:40:19

The modeling was important, but that wasn’t the hard part. Modeling has gotten to the point that it’s often just function call and having a really good understanding of the results that you get back. But the hard part is defining the problem and then figuring out a way to test it.

Kirill Eremenko: 00:40:34

Interesting. Would you say that’s the main difference that the standard on unconscious data science approach is to just model, model, model, whereas in data science design thinking you prioritize the empathizing and defining?

John Peach: 00:40:51

Yeah. That’s it exactly.

Kirill Eremenko: 00:40:53

Mm-hmm (affirmative). Amazing.

John Peach: 00:40:53

Make sure you’re asking the right question.

Kirill Eremenko: 00:40:56

Gotcha. Why did the project fail ultimately?

John Peach: 00:41:00

The company basically ran out of money. Startups are tough. Startups where you’re building a product are even tougher. The costs are generally the engineering costs upfront and that can run into a hundred million dollars to produce a consumer electronic good.

Kirill Eremenko: 00:41:22

Wow. Wow.

John Peach: 00:41:23

And then you make your money after the fact. And it’s a little different when you’re talking about a software product, you can release your MVP, your minimum viable product, and get some customers on board. And that’s what we’re doing at Oracle. We didn’t wait until we had the perfect product. We had a good product and we released that. And then you start getting some income from that and then you start building out the product more and more. Consumer electronic good, it’s a one off. It’s either a good product or not good product and that’s when you ship.

Kirill Eremenko: 00:42:01

Mm-hmm (affirmative). Gotcha. Gotcha. One thing I’ve been just thinking about when you were talking about this cup was like, “Gosh, I hope nobody ever puts those into a microwave.” Imagine that.

John Peach: 00:42:16

It would be an expensive cup to list.

Kirill Eremenko: 00:42:20

Wow. Okay. Gotcha. So that was very useful and something that definitely should take a center stage for data scientists. Is there any place where people can find out more about data science design thinking?

John Peach: 00:42:41

I’m thinking of writing a book about it.

Kirill Eremenko: 00:42:44

You should. You should totally do that.

John Peach: 00:42:45

Yeah. There’s the DataScienceGO talk that I gave is probably the best talk on it. You can sign up for that and see a ton of great talks.

Kirill Eremenko: 00:42:58

Awesome.

John Peach: 00:42:59

And one of mine is there.

Kirill Eremenko: 00:43:01

That’s good. Yeah, that’s the 2019 conference. Awesome. Thank you. For sure, write the book. I’d love to have a copy of that. That sounds like a great book. Okay. Let’s shift gears and talk a bit about literate statistical programming. I am very curious what this is and what it’s all about. Could you please explain? You’ve done a talk on this and sounds like a very interesting concept.

John Peach: 00:43:29

Yeah. So again, this is around trying to work on solving problems with the way that data scientists work. And the history of it is, there’s something called literate programming that Donald Knuth from Stanford, a very famous computer scientist, wrote a book on and pretty much every software developer does it today. And the idea basically is that your documentation and your code sit side by side.

John Peach: 00:44:03

So, the old school IBM approach was you spent two years writing specifications and documentations, and then you threw that over the wall and the software developers implemented it. We don’t do that anymore. This is where the agile comes in, where you iterate. And the idea of having the documentation right beside the code is when you change the code, you just change the documentation. It’s right there. The software developer is responsible for it.

John Peach: 00:44:30

And you have a single stream of human readable and machine readable code living together. And then you perform a process of tangling and weaving. So, the weaving process is where you run the code and what you get out is a human readable document. And tangling is when you run it and what you get out is machine readable code.

John Peach: 00:44:58

So, Jupyter Notebooks, which is what we have in the Oracle service, does a decent job of this. You have two types of cells. You have a markdown cell where you would put the human readable output, and then you have your code. So, the traditional approach would be, you make some graph and then you cut and paste that graph into a word document and you write some text around it that explains the graph.

John Peach: 00:45:24

And literate statistical programming, you don’t do that. The code and the graph sit right beside each other. So, if you want to understand how that graph was made, you just go look at the code. If you want to make a change to the graph, you change the code and the graph changes right there. And you don’t have this disconnect between the documentation and the code.

John Peach: 00:45:49

So, the big differences between literate programming and literate statistical programming is that in literate programming code is primary and the documentation is optional. Hopefully you’re documenting your code, but code is what you want to produce. Data scientists we write code to do our analysis, but that’s really not what we’re there for. Our primary role is to communicate what the data is saying.

John Peach: 00:46:16

So, the human readable documentation is primary. You need to have the code there, but you shouldn’t be handing over reports to your boss that has code in it. He doesn’t want to look at that. He wants to look at a nice document that he can read that explains the problem to him. And this is where the tangling and the weaving happens.

John Peach: 00:46:41

So, you don’t want them separated from each other. You don’t want to be cutting and pasting statistical analysis. When your data changes, you just want everything to change with it. So, the end product is going to be a written report, or it may be a model, but like we were talking about before, we don’t want our model and our code decoupled from each other, we want them sitting together.

John Peach: 00:47:07

So, literate statistical program, that workflow, the handles that for you. And what you’re saving is not the output, what you save is the pipeline. So, you’re trying to pipeline everything that you do as much as reasonably possible. Don’t prematurely optimize on something. Other things that I do is I’ll write a code that will generate a graph. And then what I do write below that is I write my interpretation of the graph. I just don’t say here’s the graph.

John Peach: 00:47:39

If I’m going to spend five minutes interpreting it, I’ll spend a minute writing that up so that when I go back, I don’t have to spend another five minutes interpreting it. So, we’re trying to address a bunch of different problems. And one is the fragmentation problems. So, the code, the analysis and the model live all together in one thing. And so, Jupyter Notebooks in the Oracle data science platform does this for us, the analysis is there, how they get the code is all there. We save the model all on the same platform.

John Peach: 00:48:12

And what we’re saving is the state of the system. And we can check this in to Git so that when we make changes, we know those changes what happened.

Kirill Eremenko: 00:48:22

Gotcha. Gotcha. So, this is what we talked about at the beginning of the podcast.

John Peach: 00:48:26

Exactly.

Kirill Eremenko: 00:48:27

Okay. Understood. Understood. Interesting. Why such a complex name though? Literate statistical programming. Is there any reason for the etymology of this concept?

John Peach: 00:48:43

Basically it’s a play on literate programming and I just stuck the word statistical in it. So, I didn’t come up with this idea. I had heard about it from one of the professors at John Hopkins. And I’ve been expanding on it over the years. And other people do it, like knitr in R Studio is an amazing tool for doing literate statistical programming. They just don’t call it that. They just call it knitr, but that’s a product. And what I want to convey is, we should be agnostic to the product and it’s the workflow. It’s the mindset.

Kirill Eremenko: 00:49:18

Okay. Okay. Gotcha. Gotcha.

John Peach: 00:49:19

If you have a better name, I’m open to it.

Kirill Eremenko: 00:49:23

No, got you. It ties in quite well. So, Jupyter Notebooks, different types of cells. Do you think there is a way of doing it even better than Jupyter Notebooks currently is able to save that information side by side?

John Peach: 00:49:38

Yeah. So, knitr in R Studio does a really good job. So, knitr ties a bunch of technologies together. So, you basically write your documentation in a markdown. And if you’re not familiar with markdown, it is a very, very simplified version of a markup language. You can learn markdown in 15 minutes. There’s only a couple of things you can do with it, but it produces nice output.

John Peach: 00:50:07

And then the R part of R markdown that knitr uses is that you can put code right in this markdown document. And there’s two ways you can put code. You can put a code block where you can have large chunks of code and do complicated analysis and generate graphs. And you can turn that off so that your boss doesn’t see it, but it’s sitting right there.

John Peach: 00:50:31

They also have inline, so you can do things like the P value was, and then you write some R code that prints out the P value so that when the data changes, you don’t have to go back and remember that you need to change that P value. It’s automatic for you. It’s part of that pipelining process. And it also allows you to use Latex. And Latex is a bit tough to learn, but I really encourage people to learn. It allows you to write mathematical equations and have a little better formatting, but I think you should avoid doing a lot of formatting stuff and let markdown just do it for you.

John Peach: 00:51:10

So, it does a really good job and my typical output from that is a beautiful production-worthy PDF that I can send off to my boss. He asks for a change, it’s just quick little change and boom, I send it off to him again. Or I get new data. I can just re-run it. So, that’s a really good workflow. The Jupyter Notebook workflow is fairly similar. But it doesn’t have the Latex ability and it doesn’t have the ability to turn off code chunks.

John Peach: 00:51:43

So, that the output is not as production-worthy. I wouldn’t hand that over to an external customer, but it might be fine for my boss.

Kirill Eremenko: 00:51:53

Mm-hmm (affirmative). Gotcha. Okay. Do you prefer R or Python?

John Peach: 00:51:58

I prefer the right tool for the job. So, R is a purpose-built language for doing analysis, and statistical analysis. So, when I’m doing that type of stuff, I much prefer R. It’s just a beautiful language. If you come from a traditional programming background, you might find it very uncomfortable. It’s a functional language as opposed to an imperative language. So, it’s odd, but if you’re coming from a non-programming background, it’s a great language to use. And the development environment is really good. R Studio, it’s not the only choice, but it’s a really good choice.

Kirill Eremenko: 00:52:44

Absolutely.

John Peach: 00:52:45

It’s better tooling than what’s available on the Python environment. If you’re doing more engineering type of coding or you’re doing natural language processing, Python is a much better language for that. Python is a general purpose computing language. It’s not designed for statistical analysis. It’s had some really good packages like NumPy and scikit-learn bolted onto it. But it’s not purpose built for that.

Kirill Eremenko: 00:53:19

Gotcha. Absolutely. And John, one more topic. I know this one’s quite advanced, but let’s see how much we can cover off in the remaining time. Data unit testing, what does that? Tell us about data unit testing.

John Peach: 00:53:34

Yeah. So again, borrowing ideas from other industries, software developers do this thing called unit testing, where they write a function and then they write tests that just test that function. So, they test what the input is, what the output is, they test the error conditions. And often you spend more time writing tests than you actually do writing the real code.

Kirill Eremenko: 00:54:04

Absolutely. Yeah. Yeah.

John Peach: 00:54:05

Absolutely.

Kirill Eremenko: 00:54:06

Our developers spend half their time on tests.

John Peach: 00:54:10

Yeah. And most people hate writing.

Kirill Eremenko: 00:54:11

Yeah.

John Peach: 00:54:13

But they’re very important. So, the problem that I’m trying to solve is, we do the same thing in data science, but we don’t capture those tests. We spend a lot of time doing exploratory data analysis. That’s the biggest part of what we do. My quip joke is that I’m a data garbage man because I spend all kinds of time cleaning the data, and understanding the data, and visualizing it, and saying, “Oh, okay, this data here looks like it’s a Poisson distribution and this data over here looks like it’s normally distributed. And this data, it’s dirty in this way. I need to write some code to clean it up.” I don’t manually clean my code. I write code to clean my code. And so, when I’ve done all of that, the data’s going to change underneath me. What I thought was a Poisson distribution may change to be a more normal distribution.

Kirill Eremenko: 00:55:23

Or log normal or something.

John Peach: 00:55:24

Or a log normal or something like that. And now my model’s wrong. Right? I built my model assuming that was a Poisson distribution. And how do I detect that? Do I go back and do my exploratory data analysis? Not going to happen. It’s just way too time consuming. Or I was expecting that this data has these three things in it, and now there’s a fourth one that’s been put in, but my model doesn’t account for that.

John Peach: 00:55:54

So, I would write tests, and this is the data unit testing where I would write tests based on all the assumptions that I have in my data. I’m assuming that there is three categories in this data, so I write a test on it. I’m assuming that these categories are evenly distributed. So, let’s say my original data was fairly balanced between male and female, but then the data shifts over time and there are more females than males. I need to capture that. That just shifted my model. My model is assuming a certain state of the data, I never know them.

Kirill Eremenko: 00:56:32

It’s not like you, by doing your EDA through manipulations, you inadvertently changed the data. What you’re saying is that over time, populations may change, the underlying assumptions may change, and that’s what you need to capture. Is that true?

John Peach: 00:56:46

Exactly. Exactly. So, we talk about model drift and we monitor model drift and like, oh, the model was 80% good and now it’s down to 75, now it’s down to 70. Then you need to go back and figure out, well, why is it 70? You have to do your EDA all over again. If you had written your tests, you would go, oh, when the model hits 78%, it was because we have a demographic shift. So, let’s go fix that right now. I know exactly where that problem is. And it’s a quick fix.

Kirill Eremenko: 00:57:16

Gotcha. Gotcha. Wow. That is such an insightful tip. So, when you’re building a model originally, note down all the assumptions that you’re putting into your model and not just on paper, but actually you build out the test, build out the code to verify and then with the help of tools like what you’re building at Oracle, this data science workflow, you’re going to be able to capture a moment in time what did your tests tell you when you originally built the model?

Kirill Eremenko: 00:57:46

Then you can rerun them again after six months and explain the model drift, because through the exploratory data analysis or data unit testing that you had prepared for this specific validation.

John Peach: 00:58:04

Exactly. And it doesn’t have to be six months down. When I was at Amazon, we implemented this and we had a batch processing where every night we re-updated a bunch of stuff and we ingested a whole bunch of new data. And all we did was run the tests on the new data.

Kirill Eremenko: 00:58:18

Wow.

John Peach: 00:58:19

So, we would know ahead of time. It wouldn’t stop production. It would still go to production, but I would get an email saying, “Hey, there’s been this shift.”

Kirill Eremenko: 00:58:27

Okay, interesting. What happens if there is no shift in your data unit testing, but your model does drift, what does that mean?

John Peach: 00:58:35

You probably didn’t write enough tests.

Kirill Eremenko: 00:58:37

I love it because there’s no other explanation. Right?

John Peach: 00:58:40

Exactly.

Kirill Eremenko: 00:58:41

Why would it drift if your underlying distributions in your data are the same, then it should be predicting the same way.

John Peach: 00:58:50

Exactly.

Kirill Eremenko: 00:58:51

It’s very interesting. That helps you go back, re-do your EDA and come up, okay, which tests do I need to add this time? And so, for the next drift, you’re more prepared.

John Peach: 00:59:02

Exactly. And you don’t have to redo the whole EDA. You basically know where the model drifted.

Kirill Eremenko: 00:59:08

Wow. Could you give us some of examples of these data unit testing? You already mentioned distribution of a category. What are some of the typical ones that come to mind?

John Peach: 00:59:17

So, there’s, you want to test data quality coming in, that’s a big problem. So like, I’m assuming this is an integer. I’m assuming that it’s between 1 and 15. I’m assuming that I will have no more than 3% loss of no value in this data. So, they’re all really, really simple tasks. And this is what software developers do. They write really simple tasks.

John Peach: 00:59:44

It’s like, call this function with a bad value and I’m expecting this error. If they’re not complex things they’re all one liners, it’s basically if something’s true. And so, you want to check all the assumptions that you have in the model. So, the distribution, the volume of data that you’re getting, that type of thing.

Kirill Eremenko: 01:00:06

Mm-hmm (affirmative). Mm-hmm (affirmative). Okay. Very interesting. A good diverse range of tests. How many tests would you say is required for… I know it will depend on the complexity of the model, but from your experience on average, how many data unit tests? Is it in the tens, is it in the hundreds, is it in the thousands that are required per model?

John Peach: 01:00:27

Every feature that you have, it would be 10, 20 tests at a minimum.

Kirill Eremenko: 01:00:35

Per feature?

John Peach: 01:00:36

Per feature.

Kirill Eremenko: 01:00:37

Okay. Gotcha. Well, it’s quite a lot. Quite a lot.

John Peach: 01:00:40

It’s quite a lot, but a lot of them are cut and paste.

Kirill Eremenko: 01:00:42

Yeah. Yeah, yeah.

John Peach: 01:00:43

Usually there’s not a huge difference from one feature to the next.

Kirill Eremenko: 01:00:47

Got you. Understood. Thank you very much. John, I just have one more question for you. So, today I learned from you data science design thinking in conjunction with the data science workflow, then about literate statistical programming, and about data unit testing.

Kirill Eremenko: 01:01:08

I want to understand, what drives you? It’s clear that you pick up these topics, you think about them, you develop them in your mind, you do talks and presentations. Where does this passion and these ideas come to you from? These three things you’ve been developing synchronously over the past couple of months, years, where do you find the drive?

John Peach: 01:01:34

Kind of a fear of boredom and being inefficient. I hate doing things twice and really, really, really dislike doing things three times. And so, I try to automate as much as I can. I also have a really poor memory. So, I have to document everything that I do. And that has pushed me in this direction of being a better scientist by not having the old traditional lab notebook where you write down everything that you did in the experiment. But taking that over into, I write down everything that I did by coding it. And then when I code it, what problems can I solve around that?

John Peach: 01:02:19

And that falls into these workflow issues that we’ve been talking about and the different techniques that I’ve been working on to solve those issues and communicate those issues out to people so that they can be better data scientists. I have a strong desire to help people get better at what they do because the end result is that I get better at what I do.

John Peach: 01:02:44

I want to be the best data scientists that I can be. I want to drive real value for the customer and the faster I can iterate on the model, the better the question I can ask, the more value I’m going to bring. And that’s exciting to me and therefore I don’t have boredom.

Kirill Eremenko: 01:03:03

Fantastic. Love it. Thank you so much, John. Very insightful and inspiring as well. I’m sure a lot of people will pick up inspiration from here. Thank you so much.

John Peach: 01:03:13

Well, thank you very much for having me on. I very much enjoyed this.

Kirill Eremenko: 01:03:15

Awesome. Thanks, John. John, before you go, please tell us where can our listeners contact you, connect with you or find out more about the amazing products that you guys are developing it at Oracle?

John Peach: 01:03:28

Yeah. So, the best place to connect with me is on LinkedIn. So, if you search for John Peach. Peach just like the fruit, Peach, you’ll either come across me or my father. He’s the handsome one, but I’m the younger one. The address for that is linkedin/in/jpeach. And to find out about the Oracle service, just go to oracle.com and sign up for an account. And when you log in, there’s the data science service on the left hand side of the console menu.

Kirill Eremenko: 01:04:00

Awesome. Fantastic John, and one final question for you today, what’s a book that you can recommend to our listeners?

John Peach: 01:04:07

Yeah, I think I read a lot of books and I think one of the ones that had had the most impact in my early days was a very popular book called Introduction to Statistical Learning: With Applications in R. It’s not really an R book. The problems at the end of each chapter are in R, but it’s written by Trevor Hastie and Robert Tibshirani. The first author is Gareth and Witten. They wrote at Stanford.

John Peach: 01:04:39

The book is really good in that it provides a lot of intuition into how different algorithms work. So, it focuses on statistical learning, which is like basically all data science except for machine learning. Basically machine learning based around statistics. And it doesn’t ignore the math, it gives you the math equations. But it doesn’t to do proofs and stuff like that.

John Peach: 01:05:07

It’s great for understanding how a decision tree really works and what’s the bias-variance trade-off? And why were random forest built to compensate for some of the problems of decision trees and what’s the trade-off in a random forest, that type of thing? It’s, I think, a great book. You can download it for free or you can buy it on Amazon. And again, it’s called Introduction to Statistical Learning: With Applications in R.

Kirill Eremenko: 01:05:32

Fantastic. Fantastic. Gareth James, Daniela Witten, Trevor Hasty, Robert Tibshirani, the book.

John Peach: 01:05:40

Yeah.

Kirill Eremenko: 01:05:41

Wonderful. John, once again, thank you so much for coming on the show and sharing your insights. It was a pleasure.

John Peach: 01:05:45

Thank you very much. Appreciate it. You have a great day.

Kirill Eremenko: 01:05:46

Thank you very much everybody for sharing this hour with us. I hope you enjoyed this episode and got as much value out of it as I did. I learned quite a lot of interesting things. I love the whole notion of putting science back into data science or putting science into data science. My favorite part was the data unit testing that we discussed at the end. When you think about it, it’s such an obvious tip, but somehow it’s not as popularized as it should be.

Kirill Eremenko: 01:06:29

Why not have these statistical tests? Why not have these, or what, as John calls them, data unit tests that will help you better understand model drift when, or if that does happen? Or not if, it will happen. When it happens, you’ll have these data unit tests to test your model drift. Very excited about what they’re creating at Oracle. The whole notion of the data science workflow can help capture these data unit tests among other things in a moment in time.

Kirill Eremenko: 01:07:01

So, to help then look back at what was going on, what was the dataset? What was the model version, what were my comments? What were the results of these statistical tests and so on? So, very exciting times. And thank you very much to John for coming and sharing his insights. As always, you can get the show notes for this episode at www.superdatascience.com/393, that’s www.superdatascience.com/393.

Kirill Eremenko: 01:07:27

There you’ll find the transcript for the episode, any materials we mentioned on the show, and of course, a URL to John’s LinkedIn, make sure to connect, and a URL to where you can check out more about the solution that Oracle is working on, the data science workflow. And you can sign up if that is something that you are interested in.

Kirill Eremenko: 01:07:49

If you enjoyed this episode, then make sure to share it with somebody you know, love and care about. Very easy to share, just send the link, www.superdatascience.com/393. On that note, I appreciate you being here today and I look forward to seeing you back here next time. Until then, happy analyzing.

Podcasts SDS 393: The Importance of Keeping Science in Data Science

SDS 393: The Importance of Keeping Science in Data Science

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 393: The Importance of Keeping Science in Data Science

Share

SDS 393: The Importance of Keeping Science in Data Science

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

SDS 942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025