Podcasts SDS 629: Software for Efficient Data Science

71 minutes
Career Tips, Data Science, Python

SDS 629: Software for Efficient Data Science

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Have you ever wondered whether a developer advocate role is for you? Our guest this week, data science developer advocate for JetBrains, Dr. Jodie Burchell, is here to shed light on her responsibilities and explain why you may or may not want to consider taking on this role. Jodie also dives into the ins and outs of reproducible data science workflows and shares her tips on working efficiently with messy, real-world data.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Jodie Burchell

Jodie is the developer advocate in data science at JetBrains. She completed a PhD in clinical psychology and a postdoc in biostatistics, before leaving academia for a data science career. She has worked for 7 years as a data scientist in both Australia and Germany, developing a range of products including recommendation systems, analysis platforms, search engine improvements and audience profiling. She has held a broad range of responsibilities in her career, doing everything from data analytics to maintaining machine learning solutions in production. She is a long time content creator in data science, across conference and user group presentations, webinars, and posts on both her own and JetBrain’s blogs.

Overview

Developer advocate roles have grown in popularity in recent years, yet the responsibilities that come with the position remain largely undefined. That’s why we’re happy to welcome Dr. Jodie Burchell to the show to shed light on her day-to-day responsibilities.

As the data science developer advocate for JetBrains, the company behind PyCharm and Datalore, she’s tasked with building links with the developer community and ensuring that the tools are developer-friendly and well-suited to their needs.

The role keeps Jodie’s skills sharp, but it also allows the company a chance to give back to the community. And while each developer advocate role varies from company to company, Jodie explains that data advocate roles are shaped by the goals of the organization and the background of the advocate. She adds that the position is best suited for a self-directed individual who has already found their niche in data science. You also should love learning and teaching others.

When she’s not speaking at conferences worldwide, Jodie is busy using and testing JetBrains’s tools in-depth, making her an absolute expert in PyCharm, Dataspell and Datalore. PyCharm, an all-in-one package for Python engineering needs, is JetBrain’s flagship product, but in the last couple of years, JetBrains has released tools that are much more suited for data scientists, namely Dataspell and Datalore. She offers an in-depth tour of these to listeners.

Tune in for more from Jodie, including her tips on preparing real-world data, her favorite Python libraries and the importance of reproducible data workflows.

In this episode you will learn:

Jodie’s background in psychology [2:22]
Jodie’s tips for real-world data preparation [6:55]
Tour JetBrains’ developer tools: PyCharm, DataSpell and Datalore [10:41]
What is a data science developer advocate? [38:47]
The books that Jodie’s co-authored [46:18]
Jodie’s favorite Python libraries [58:33]
How to have reproducible data science workflows [1:01:36]

Items mentioned in this podcast:

Follow Jodie:

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 629 with Dr. Jodie Burchell, data science developer advocate at JetBrains. This episode is brought to you by Iterative, the open-source company behind DVC.

00:00:11

Welcome to the Super Data Science Podcast, the most listened-to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now, let’s make the complex simple.

00:00:42

Welcome back to the Super Data Science Podcast, we’ve got the magnificent Dr. Jodie Burchell on the show today. Jodie is the data science developer advocate for JetBrains, the beloved developer tools company behind PyCharm, one of the most widely used integrated development environments for Python there is, and Datalore, their new cloud platform for collaborative data science.

00:01:02

Previously Jodie was data scientist, or lead data scientist at several tech companies developing specializations in search, recommender systems and natural language processing. She co-authored two books on data visualization libraries, namely The Hitchhiker’s Guide to Ggplot2, and The Hitchhiker’s Guide to Plotnine. Prior to entering industry, Jodie was a postdoctoral fellow in biostatistics at the University of Melbourne, she holds a PhD in psychology from the Australian National University. Today’s episode is primarily intended for a technical audience as it’s packed with practical tips and software for data scientists. In this episode Jodie details what a data science developer advocate is, and why you might want to consider it as a career option yourself, how to work effectively, efficiently and confidently with real-world data, her favorite Python libraries such as ones for data vis and natural language processing. How to have reproducible data science workflows, and the subject she would have majored in if she could go back in time. All right, you ready for this stellar episode? Let’s go.

00:02:09

Jodie, welcome to the Super Data Science Podcast. Where in the world are you calling in from?

Jodie Burchell: 00:02:15

I am coming from Berlin today.

Jon Krohn: 00:02:17

Nice. So Jodie, you hold a PhD in psychology where you research topics ranging from infidelity to child stress. And for many years you worked in several research roles leveraging healthcare data from Australian hospitals in use cases from mental health, to cancer, to cardiology. How did these experiences shape your understanding of real-world data?

Jodie Burchell: 00:02:42

Yeah, that’s a great question. Basically, when you sort of do an education in psychology in Australia, because we can’t directly measure anything, like how do you measure love for example, or infidelity, or jealousy, or hurt? Like, how do you measure these things and be relatively sure that you’re measuring them? We get drilled into our head from basically our first year that you really need to have rigorous methodology and statistics training. We do classes in a topic called psychometrics, which is all about how to measure, and you basically have it drilled into you like, you need to be super careful about firstly how you collect your data and then how you prepare it. So I had so many classes where we were taught how to look for missing data, or we were taught how to screen for the relationships between variables, we were taught how to work out if we were validly measuring something.

00:03:37

So that was sort of the background, and it’s actually how I got into data science. I completely fell in love with the statistics and methodology part of psychology, which is not usual. Then sort of when I got past my PhD, when I was in my postdoc, that was my first time working with what’s called routinely collected data in this context, what we would normally work with in data science. And that was really interesting for me, because I didn’t actually understand the origin of some of these fields, I didn’t understand the intent. So that taught me a lot of lessons about firstly applying that stuff that I’d learned, you know?

Jon Krohn: 00:04:13

Mm-hmm.

Jodie Burchell: 00:04:14

Can I be sure that this is measuring what I think? But also a lot of lessons about how to process data to get the information you need out of it, rather than having to collect that yourself.

Jon Krohn: 00:04:27

Cool. And so, are there a lot of big differences in your experience between the kinds of real-world data that you get from psychology studies for example, relative to the kinds of datasets that you see in statistics or methodology or machine learning textbooks?

Jodie Burchell: 00:04:49

Yeah, yeah. This is something I’ve really been noticing, so obviously when you’re learning machine learning you are going to start with the algorithms, because there’s a lot to cover and you really need to sort of have very gentle datasets. So if you’re going to learn clustering, you’re going to learn it on Iris. If you’re going to learn regression, you’re probably going to learn on the Boston House Dataset. And these datasets have been curated within an inch of their lives, they are beautiful, they’re just like, the clusters fall out perfectly. But when you come to working with real data, oh, man, it’s just a mess, and you really need to be I guess aware of how to get data to tell you the story that it wants to tell you, because it won’t be necessarily obvious initially.

00:05:35

So yeah, there’s a lot of also quite nasty pitfalls that real-world data can have that can actually completely mess up your models and your analyses. And yeah, they are definitely not there in those toy datasets.

Jon Krohn: 00:05:50

Right. So then I guess it’s pretty important for data scientists to be involved in the data that are collected in the data generation process?

Jodie Burchell: 00:06:01

Yes, or at least have someone at the company. Because most of us who are working in industry, you’re going to be working with data that was not… It wasn’t collected for your models, like you’re important but you’re not that important. So basically you’re going to be using data that was collected, I don’t know, as part of the core running of the business, or that was collected for, I don’t know, some other monitoring. And you may not necessarily be involved in the collection of that data, you might not even have a say. But you at least need to, in the best cases, have some sort of definition, or someone to explain how that data was created, and the sort of I guess assumptions behind how it was put together.

Jon Krohn: 00:06:51

Mm-hmm, mm-hmm. As people are preparing data themselves, or collecting data that somehow their company is generating, maybe through user behaviors, that kind of thing, are there particular challenges or particular tips that you have for data preparation using these real-world datasets?

Jodie Burchell: 00:07:12

Yeah, so I think there’s… Probably the first, most fundamental thing is actually making sure that the things that you think, or the fields that you’ve collected are measuring what you think they measure. So for example, you might have a field, and it has, I don’t know, a particular name, it could be engagement or something, and you maybe make an assumption about what that means. But it may not mean exactly what you think it does, so maybe one of the data engineers can explain how that field is created, or one of the BI analysts. There may also not be someone there to explain what this field actually means, and that’s when you have to get a bit creative and maybe use statistical techniques.

00:08:01

So for example, you would expect that engagement, if we’re talking about something like click behavior or purchasing behavior, would correlate really highly with either of those. And if it doesn’t, then it’s probably not measuring exactly what you think it does.

Jon Krohn: 00:08:18

Yep, that is a really good one. And then given your background in Australia, particularly working with healthcare data, what’s the quality of data like from that scenario? It’s a genuinely open-ended question, I don’t have any biases going into what Australian healthcare data may or may not be like.

Jodie Burchell: 00:08:39

You’ve got no agenda, you’re not attacking the healthcare industry in Australia, yeah. So there’s an interesting thing in Australia, so the health is actually, or healthcare is handled by the states. And so each of the states have their own way of justifying the funding that they get from the federal government. So I was working in Victoria, for those of you who are not familiar with Australian geography that’s where Melbourne is. And Melbourne, or Victoria has what’s called case mix funding. So basically what this is is, the reason you collect sort of records is so you can go, “Okay, this particular person was admitted with this, they had this particular procedure done, and this procedure pays this certain amount.

00:09:22

“So we did this amount of procedures, so the government should give us X amount of money for that for the next year.” So this is where our data came from. So it’s actually really high quality, because it’s audited, because it’s really important, because it’s how they get funding. But, there was a very interesting thing that happened with that data. So, because this is obviously like some of the most sensitive data you can work with, the departments that handle it actually break the data down into de-identified datasets. So you don’t have like patient identification numbers, what we would call our Medicare number linked to each patient record. So you have to do probabilistic matching, and sometimes it fails and you’ll have people who died multiple times.

00:10:13

It’s not really funny that they died, but you’re like, “Okay, there’s like a little bit of error in this fuzzy matching of compiling one person’s hospital records.” But generally it’s pretty reliable, but that’s a small amount of error in it.

Jon Krohn: 00:10:27

Interesting. But Jodie, leveraging your rich experience, some of the PhD stuff that you’ve talked about, and you’ve also had other roles as data scientists which we’ll dig into a bit later in the episode. That rich experience as a data science practitioner has led you now to be a data science developer advocate at JetBrains. And so for listeners who aren’t aware of JetBrains, JetBrains is one of the best-known companies for developing developer tools, not just for data scientists but for software developers in general as well. Probably the best-known tool of all from JetBrains is PyCharm, so my company Nebula, we have a PyCharm license for every software developer in the company, we think it’s an invaluable tool for allowing people to write code more easily than otherwise, so we wouldn’t be doing this.

00:11:20

And this is really a bottom-up thing, this is like, everybody in the company wants a license to PyCharm, and we’re very happy to provide it to them, is that the productivity lift is well worth the cost of the tool. But then in addition JetBrains has developed other products, DataSpell and Datalore. So maybe it would be helpful, Jodie, to talk us through PyCharm in more detail than I could, and then also introduce us to DataSpell and Datalore.

Jodie Burchell: 00:11:56

So PyCharm, as Jon already gave a lovely introduction to, is the flagship Python product at JetBrains. Basically PyCharm is an all-in-one package for Python engineering needs, and this does include some data science support. But over the years what we realized is we have a lot of data scientists in this field, I’m one of them, as you can see my background is not engineering, who are not amazingly comfortable with heavy engineering tools. And the company realized that we needed to develop tools that were much more fit for purpose for data scientists. So Datalore and DataSpell are both our data science tools, and this is a relatively new initiative by the company, it’s just been the last couple of years.

00:12:44

So I’ll really talk about DataSpell first, but I wanted to kind of get into Datalore a little bit more. So DataSpell, the way I describe it is it’s like the little sibling of PyCharm, specifically focused on data science. So it’s much more lightweight, and we have a really heavy focus on Jupyter capabilities, so there’s a lot of support for really cool sort of Jupyter addons. One feature I really, really like is basically, I’m so used to working in Jupyter workbooks at this point, I find it actually really frustrating to work in Python scripts because I’m really not on the engineering side of things, I never have been. And there’s this really cool feature within DataSpell, and you can also do it in PyCharm as well, where you can add these separator lines, like sort of a special string, in the middle of your Python scripts.

00:13:34

And then you can execute the script not against the non-interactive shell, but against a Python… Sorry, against a Jupyter console, and that basically allows you to interact with the Jupyter variables that have been created without messing with your Python script and having to comment things out. So yeah, that’s a really nice feature. Yeah, it’s cool.

Jon Krohn: 00:13:54

Yeah. This episode is brought to you by Iterative, the open-source company behind DVC, one of the most popular data and experiment versioning solutions out there. Iterative is on a mission to offer machine learning teams the best developer experience, with solutions across model registry, experimentation, CI/CD training automation and the first unstructured data catalog in query language built for machine learning. Find out more at www.iterative.ai, that’s www.iterative.ai. And so to kind of summarize what you just said back to you, so PyCharm is a software engineering IDE designed for serious software development.

Jodie Burchell: 00:14:42

Yes.

Jon Krohn: 00:14:43

And yeah, the most advanced software developers on the planet, it’s often their first choice for working with. Lots of bells and whistles to make it easy, or to make a lot of development easier, to allow for lots of efficiencies for developers. But all of those bells and whistles can also, like you’re saying, feel heavyweight, and feel foreign to data scientists, particularly as PyCharm is, notwithstanding the example that you just gave where you can be having special lines that are more like the interactive Jupyter Notebooks, the PyCharm IDE is designed for handling Python scripts, and so is the way that software developers are used to working with code. It’s not necessarily the way that data scientists are used to working with code, unless they’ve come from a software developer background.

Jodie Burchell: 00:15:38

Exactly.

Jon Krohn: 00:15:39

So data scientists tend to be, the way that we tend to learn, and the way that we tend to be taught, is in Jupyter Notebooks these days, that’s kind of like the modern way of doing things, and these Jupyter Notebooks are highly interactive, they lend themselves really well to data analysis, data science, because you can explore variables very easily, you can plot variables very easily, you can print out tables of information very easily. And all these kinds of things, they’re not ways that software developers think. So even though PyCharm does have tricks for allowing that kind of interactivity, it’s not the kind of native way of working in it. DataSpell, on the other hand, takes all of the best kind of ideas around an integrated development environment, an IDE that you have in PyCharm, but it’s Jupyter Notebook-first, it’s interactivity-first.

00:16:35

Allowing data scientists to have that experience that they’re familiar with, realtime execution, and yeah. So, that makes a lot of sense to me. Now, tell us about this Datalore product. So we’ve covered PyCharm, the engineering IDE, we’ve talked about DataSpell, which is the IDE for that kind of interactive Jupyter experience. And then yeah, what’s this latest tool, Datalore, all about?

Jodie Burchell: 00:17:03

Yeah, so Datalore is a little bit of a new direction for JetBrains. Most people would know us for our locally-downloaded, individual use IDEs. Whereas, Datalore was designed to overcome these problems where you need to set up data science infrastructure for teams. In my previous job we were doing that using Jupyter Lab, which is quite a good solution but it does have some limits. And if you’re doing it locally, it does require that you generally will have an ops team that’s going to manage those resources for you. So Datalore is our own cloud-based solution which is designed to do everything that Jupyter Hub does, but manages a lot of these things like connecting to databases, connecting to cloud buckets, getting access to resources in a much easier way.

00:17:58

And it also allows realtime collaboration, so team members can enter your notebooks and actually see everything that you’ve been working on up to that point. So, that whole problem of being able to share results and collaborate with team members. And then in addition to that, you can pivot reports off what you’ve been doing. So you can basically, directly from a notebook take all of the outputs, choose which ones you want, arrange them in a dashboard, and that becomes a interactive or non-interactive report that you can share with anyone. So it’s this sort of like, all-in-one tool that’s designed to overcome these issues that you have in the research space of data science, and take a lot of those annoyances that come with dealing with the ops side of things, or I don’t know, data privacy and access that can come up in organizations, and just build them all into one tool.

Jon Krohn: 00:18:56

Nice. Okay, yeah, so I understand that now. So PyCharm is this local engineering IDE, DataSpell is an IDE designed to be used locally by data scientists, and then Datalore is now a cloud-first tool, it’s a collaborative Jupyter environment for teams. It’s kind of like a Jupyter hub, but it doesn’t require a dev ops team in order to be implemented.

Jodie Burchell: 00:19:25

No.

Jon Krohn: 00:19:25

So then we’ve already talked in this episode a lot about best practices for finding and collecting and using data. How can Datalore help with some of these challenges?

Jodie Burchell: 00:19:40

You can probably tell from my background that a big part of my process when I start working with data is, I spend a lot of time screening for these little pitfalls and traps that I talked about earlier. And I also just spend a lot of time getting to know the data, because I think I said it earlier but I feel like before you start modeling you should try to understand the story that your data’s trying to tell you. And it won’t all become obvious, because obviously you have complex interactions in models, but you can get a sense of what story is going to be there. And so something I really like about Datalore is out the gate, it’s sort of pushing this exploration. So immediately when you load in a data frame, let’s say we’re working in a Python kernel.

00:20:28

You will have several tabs which will appear when you print out the data frame. And so the first is just the raw data, but something that’s really cool is you can scroll through that data with full interactivity. And we now have the ability that where you can also do exploratory stuff like sorting, or filtering, and it’ll actually export those changes that you’ve done to code in another cell.

Jon Krohn: 00:20:54

Cool.

Jodie Burchell: 00:20:55

So, just from the start it’s prompting you to play.

Jon Krohn: 00:20:57

Wow, so you can interact with the platform like a click and point tool, and then it automatically converts that into code so that it’s reproducible and easily sharable with your colleagues?

Jodie Burchell: 00:21:11

It’s so cool. Like the whole tool, I feel like, has reproducibility and data exploration in mind. Apart from the collaborative aspects, which are also a strength. I just think it’s so brilliant. And then the other kind of cool thing is, so I’ve told you that you have the tab with the raw data frame and the infinite scroll. But there are two other tabs which are amazing, so the first is my favorite, which is the statistics tab obviously. So for every single field in your data frame, you’ll basically get like a whole bunch of important statistics. So you’ll get things like other outliers, what’s the min, what’s the max, what’s the distribution? It even has a little violin plot for your continuous variables, you have the counts of the levels of the categorical variables.

00:22:00

And the amount of missing data of course, which is also important. So at a glance you’re basically able to go, “Ooh, that doesn’t look right. That count data shouldn’t have negative values,” or you know, “That value seems pretty implausible.” The other cool tab is a visualization tab, so just off the data frame that you’ve just read in you can do point and click, and start creating visualizations to explore it. One of those that I particularly like is a pairwise correlation chart, so you’re basically able to get the strength of the relationship between every single continuous variable just at a glance.

Jon Krohn: 00:22:42

Great.

Jodie Burchell: 00:22:43

And similarly, you can export those charts using the underlying code, and you can start customizing those further.

Jon Krohn: 00:22:51

Very cool, yeah. I can see how this would make working with data frames really easy. So right now, not using Datalore, when I’m in a Jupyter notebook I have to kind of think through, or maybe have even built up myself these functions of, okay, what are the key stats that I want to look for in any variables in my data frame? And then I have to also think through like, “Oh yeah, so the things that I need to look for with categorical variables are different than ones with continuous variables.” So, you kind of have to have these different sets of functions that you’re used to using and exploring the data with. And I think the thing that probably happens most often, and I’m guilty of this myself, is that instead of having a rigorous function library that you’ve built yourself, you do it off the cuff.

00:23:39

You’re just like, “Oh, let’s look at these data, okay, I’ll look at it this way.” But that doesn’t have the kind of rigor that if you… People at JetBrains have already spent a lot of time thinking about for a continuous variable, for a categorical variable, what are the key statistics that you should know? And then you can just go to that stats tab and have all that information right away, you don’t… So it isn’t like using Python, it’s enormously arduous to figure out what the min and the max is of a variable. But it’s just something that you kind of have to remember to do, and if you don’t remember to do it you might not notice that, oh, there’s negative values in here? That doesn’t make any sense.

Jodie Burchell: 00:24:22

Yep. Yeah, like it’s not inviting. I love pandas, do not get me wrong, I think we all do. But it’s not inviting to do it, and it also… If you’ve got a big data frame, that output is so long. Like my usual workflow outside of Datalore would be literally just, divide up the categorical and continuous variables, and then just do a loop for the value counts, or the categorical ones, and you just end up with this ridiculous amount of output. And you’re just like, “Oh God, this is tedious.”

Jon Krohn: 00:24:55

Nice. So yeah, that sounds like an invaluable tool for getting those key stats, and then same kind of thing for the visualizations, to just be able to go to a visualization tab. And once again, it isn’t enormously arduous to grab your favorite plotting library, your matplotlib, or seaborn, and make a histogram, or decide to do a pairwise correlation chart. But that could end up being, even with a relatively simple dataset, it’s going to take a few hours to do that, and kind of try to think comprehensively about what all the key ways of visualizing individual variables, as well as interactions between those variables. So yeah, this sounds like a brilliant tool to save a lot of time for data scientists.

00:25:44

It sounds pretty similar to the PyCharm situation, where I know that the lift that the developers at my company get is enormously greater than the cost of a license each year. And this is a perfect example, with Datalore it sounds like you could basically offset the cost of it in a day of a [inaudible 00:26:07], yeah.

Jodie Burchell: 00:26:09

Yep. It’s also like, because I pretty much primarily use those three tools that I talked about, when I have to go back to vanilla Jupyter without all the code completion, and the built-in documentation, all these extra nice tools, it’s like, “Ah, how did I work this way for so long?” You just, yep.

Jon Krohn: 00:26:31

Yep, yep, yep. That’s something the code completion and the documentation references, it is so critical. I spend a lot of time using Google Colab, and that code completion, yeah, it’s really… Like the hover over and bit of information, yeah, those kinds of things, enormously valuable. But yeah, the interesting thing about Google Colab is despite its name, it actually isn’t amazing for collaborating. And then the other big thing about Google Colab is that you’re stuck with whatever libraries are loaded in it at any given time, so that’s like… Yeah, I think that’s a big difference here, like that lack of flexibility around selecting exactly what software libraries are available in it.

00:27:21

So Colab is amazing for teaching, I love teaching with it because I can open up Google Colab, I can ask the students in the class to open up Google Colab, and they can either upload their notebook or they can just start typing, and we’re all on the same page. But the big downside of that is some day, I’m going to get burned really bad by coming into a lecture that I’m giving with hands-on tutorials that I’ve prepared well, and they’ve worked for years, and they’re just not going to work because a library version under the hood will have changed. And so I haven’t been burned really badly by that yet, but I know it’s just a matter of time. Usually [inaudible 00:28:04] that happens, I can like figure it out. I’m like, “Okay, there’s like a warning, so let’s quickly resolve this.” You don’t have to take like a coffee break.

Jodie Burchell: 00:28:13

Yes, yes. And it’s actually, so not to kind of say that Datalore’s doing everything better, but another thing that I really do love that Datalore does, because reproducibility’s really important to me, I think it’s a super important topic. And obviously one of the major ways in which analyses can become non-reproducible is your environment, it cannot be replicated anymore. And I find some packages are so fragile, I would say particularly Tensorflow can be very fragile.

Jon Krohn: 00:28:47

Oh, yeah.

Jodie Burchell: 00:28:51

Yeah. And so a really nice thing about Datalore is the environment is one-to-one with a notebook, and it’s completely fixed, it’s fixed to the Python version, it’s fixed to all of the versions of packages you installed. And you’ve got complete freedom, you can install whatever you like as long as it’s a Python package.

Jon Krohn: 00:29:09

Right, that’s really cool. All right, so that gives us a really good sense of some of the collaboration features that Datalore has. Is there anything else that we haven’t mentioned that’s pretty critical?

Jodie Burchell: 00:29:22

I would say some other just nice little features it has, I’ve never seen this anywhere else but you can actually use Sequel and Python in the same notebook. So basically, you can have these native Sequel cells, and they’ll have co-completion, and they’ll have syntax highlighting. And then as soon as you’ve completed that query, it’ll dump out the results into a pandas data frame, so you can start using it immediately.

Jon Krohn: 00:29:50

Nice.

Jodie Burchell: 00:29:50

Which I really like, because writing it-

Jon Krohn: 00:29:53

That’s what you were going to do with it anyway.

Jodie Burchell: 00:29:55

… in a string… Exactly, because you’ve probably written it as part of a connector package, and you’re going to use pandas to read that in. What else is nice? So we have the realtime collaboration, as I talked about. And so that means basically liking Google Docs, you can work in the same cell at the same time.

Jon Krohn: 00:30:13

Oh.

Jodie Burchell: 00:30:14

Yeah, yeah, yeah.

Jon Krohn: 00:30:17

Wow, that’s cool.

Jodie Burchell: 00:30:18

So cool. And okay, imagine this, imagine I just spent five hours training a model. You can basically come into my notebook, and because that’s a Jupyter variable you can start using my model and making predictions from it without needing to do a thing. Because, you have essentially entered the exact environment that I am currently in.

Jon Krohn: 00:30:38

Oh wow, that’s wild.

Jodie Burchell: 00:30:40

Yeah, it’s-

Jon Krohn: 00:30:40

I’ve never had an experience like that before, that sounds really cool.

Jodie Burchell: 00:30:43

It’s really fun.

Jon Krohn: 00:30:44

That is so amazing on a Zoom call, trying to debug some aspect or just explore some data together, yeah, yeah.

Jodie Burchell: 00:30:51

Exactly, yes.

Jon Krohn: 00:30:54

In this now increasingly remote world that data scientists and software developers live in. That sounds like a dream, wow.

Jodie Burchell: 00:31:03

This is a thing, I wish I had this at my last job, because we were at home most of the time, because it was still COVID days. There were times where I needed to onboard onto a project, or help out someone with an analysis, and we’d basically have to go and open each other’s notebooks, and then re-run everything, and that was an okay solution but it definitely wasn’t optimal. So the last thing I wanted to talk about is accessing to compute resources, because I think this is a real pain in the butt for most data scientists. So pretty much the way that it’s set up in Datalore is you can attach different types of machines depending on your plan, and if you like some of those machines can be GPU machines.

00:31:52

So that means that within the same notebook you can basically connect to a different type of machine just with a click of a button. So in [inaudible 00:32:01] to go and do some complicated setup, you can just with one click get access to GPU, and then when you don’t need it switch back to a CPU machine.

Jon Krohn: 00:32:12

Right, very cool. What do you think about the Super Data Science Podcast? Every episode I strive to create the best possible experience for you, and I’d love to hear how I’m doing at that. For a limited time we have a survey up at www.superdatascience.com/survey, where you can provide me with your invaluable feedback on what you enjoy most about the show, and critically, about what we could be doing differently, what I could be improving for you. The survey will only take a few minutes of your time, but it could have a big impact on how I shape the show for you for years to come. So now’s your chance, the survey’s available at www.superdatascience.com/survey, that’s www.superdatascience.com/survey.

00:32:56

Yeah, so that could come in super handy if you’re training a deep learning model, like a machine vision model with evolutional neural nets, or with transformer architectures and natural language processing. These kinds of model architectures are really well-suited to GPUs, so you can get often a 10X speedup by popping over to GPUs as opposed to relying on CPUs alone. Because those GPUs are adept at massively parallel, so hundreds of thousands of parallel cores doing very simple matrix algebra operations that are abundant in those kinds of models, like convolutional neural networks, transforming architectures. So yeah, that sounds really cool. So you could be developing your model architecture while in the CPU, the typically much lower-cost CPU-only environment. And then you could get to a point where you’re like, “Okay, this code runs. Now let’s set some model training going, pop over to the GPU,” and set that running so that you’re only using the GPU, the more expensive GPU when you need it.

Jodie Burchell: 00:34:07

Exactly. And it’s super nice, it’ll automatically reinstall your full environment on that GPU machine and CPU machine every time you transition, so you don’t need to do a thing.

Jon Krohn: 00:34:19

That does sound handy. We have a data scientist on my team at Nebula that, so much of the time when I need to be like all of a sudden using GPU resources, there’ll be some reason, and it’s always a new reason why I can’t get it to work properly, and I have to get in touch with Grant and be like, “Grant, what am I doing wrong?” And he always makes me feel really dumb. He’s like, “Oh, why didn’t you do this?” I’m like, “Okay.” So he’s kind of like-

Jodie Burchell: 00:34:50

“Well, if I knew how to do that then I would have done it.”

Jon Krohn: 00:34:54

Exactly. I don’t know how he’s always on top of these things, but he’s a data scientist on our team, but he’s also like our resident MLOps expert, and [inaudible 00:35:04] constantly to be coming in. So I’ll get stuck late at night my time in New York, and then have to wait until he’s up in London the next day to get the GPUs running in some slightly new scenario that I’m unfamiliar with.

Jodie Burchell: 00:35:22

I feel your pain, I was constantly that person asking questions at my job.

Jon Krohn: 00:35:27

Yeah. So yeah, sounds really cool. So yeah, we’ve ended up digging a lot into Datalore features here, because they genuinely sound super useful to me, and hopefully to a lot of our listeners. So it’s like a Jupyter Hub replacement, but it doesn’t require a dev ops team to implement. You get key stats on all of your variables very easily, generated automatically, same kind of thing for visualizations. You get code completion in both Python and Sequel. Unlike Google Colab you can fix library versions to a specific notebook, that specific notebook and the environment you can flip between the resource scales that you want pretty easily, even if that means flipping from CPU-only to an environment that has a GPU, which can often be tricky in my experience.

00:36:21

And then I think the coolest thing of all, and I didn’t grasp fully this aspect of what Datalore does until just now, and I’m really excited about this, is the ability to collaborate in realtime with your colleagues, just like you can on a Google Doc. I think that’s something that I’ve become so used to with so many Google Office products, is this ability to just see each other, and be able to work alongside each other very easily. Yeah, I mean I’m basically remote all the time now, so pre-pandemic it was something where I’d just spin around in my chair and work with a data scientist on my team on this problem. But now, something like this sounds brilliant for collaborating with them. So, thank you for giving us that tour, Jodie. Is there anything else that I missed that you think we’d need to cover that listeners need to hear about?

Jodie Burchell: 00:37:19

I don’t think so. I think I’ve given you the highlights of the product.

Jon Krohn: 00:37:26

Nice.

Jodie Burchell: 00:37:27

All I would really say is basically, if you want to try it we do have a free Community Edition. So I would highly recommend go and register, you can play with… We obviously have some restriction on the machine types and also some of the features. But it definitely will give you a good idea of how the tool feels, and personally I just think it’s so much fun. So yeah, I would recommend giving it a go.

Jon Krohn: 00:37:51

Nice, that’s a great tip, yeah. I mean, if you can try it out for free why not? And yeah, and it makes sense that you have some restrictions on resource limits, otherwise you’d just have a bunch of our listeners going out and mining Bitcoin.

Jodie Burchell: 00:38:03

Yes, true, true. We do also, if you need it… Obviously not everyone’s going to want to user our servers. We do also have some enterprise plans, but you can read about those on the website, and install it on basically your own servers, or on-prem if you like.

Jon Krohn: 00:38:22

Nice. Yeah, so on-prem or use the cloud resources, I guess?

Jodie Burchell: 00:38:27

Yeah, your own cloud resources. So if you’ve got an AWS, GCP, Microsoft Azure.

Jon Krohn: 00:38:32

Nice. Very cool, all right. So thank you for guiding us through that tour. Now, something that I’d love to know more about is your title and what it all means. So you’re a data science developer advocate, it’s been a long time since we’ve had anyone with a role like that on the show. So what does that mean exactly, and how is it related to a developer relations role, which I think is potentially something similar that I also hear about. And for our listeners out there who might be interested, what kind of data scientist or software developer becomes a developer advocate, how do you get into it?

Jodie Burchell: 00:39:19

Yeah, so developer advocacy is sort of different at each company. So basically what it means at JetBrains is we’re a point of contact between our community and the company. So what we do is, we go out and we talk to people in our community, we act as their kind of advocate within the company, hence the title, and we make sure that the tools that are being developed are actually suiting their needs. So you can see in my day-to-day, I’m using the tools extensively because I need to understand what works, what doesn’t. I also spend quite a lot of time going out and talking to people in the community, so I’ll go to meetups, I’ll go to conferences, I might work the booth in the conference.

00:40:09

But it’s also very important that I remain as a data scientist, so I also spend a lot of time researching topics and presenting on them. So some of those topics might be related to the tools, but a lot of the time it’s not. So for example I just spent two and a half weeks going between conferences and just doing conference presentations, and they had nothing to do with the tools really. So yeah, it’s sort of a way of making sure that obviously I keep my skills sharp, and I understand what people are actually doing. But it’s also a way for the company to give back as well, especially within Python it’s super important that we contribute back to a really huge, open-source community. Cool.

00:40:54

So yeah, to answer your question about the difference between developer advocacy and developer relations, my understanding is the titles are kind of interchangeable. You kind of get it from the name, like it’s about relationship, it’s about advocating. It really depends on the company, what they want their developer advocates to do. So I work a little bit with our marketing department, but I by design stay separate from them. Whereas in some companies, you will have a much closer relationship with the marketing department. So I think it’s whatever your company needs, the size of the company, and the choice of the individual advocate, what they like doing.

Jon Krohn: 00:41:40

Nice. So it sounds like potentially a great opportunity for listeners who have already become expert as data scientists. So like, you have a extremely strong background in data science, you have a PhD in applying data science to real-world data, and then several data science roles over the years at a number of great companies. And so you can then move from that kind of role into one where you’re keeping sharp, which sounds really cool, I’d love to just be able to invest time every week into being sharp on my data science skills. And then you get to present on those to audiences at conferences, on podcasts like this, and it’s awesome that at least in your developer advocacy role you get to have this great level of independence where you’re not tied directly to marketing, and where you can be doing presentations it sounds like on whatever you think is really exciting right now. It doesn’t have to be directly related to products that JetBrains’ developing.

Jodie Burchell: 00:42:44

Yeah. But to be honest, for me it’s kind of a dream job.

Jon Krohn: 00:42:49

There you go.

Jodie Burchell: 00:42:50

I remember about three months after I started a friend of mine referred me to the company and I remember telling her it feels like academia without the parts that I didn’t like as much, because obviously you don’t have to write grants, you don’t have to do publications, which I didn’t enjoy so much. Yeah, if it’s a job that you’re interested in getting into, I would definitely say just sort of establish with the company what they’re looking for. But if you’re a person who’s very self-directed, you’re okay working… Because you won’t really be on a team, you’ll be more by yourself. So you’re okay doing that, and you really like the idea of constantly learning and also teaching, then I think it could be a really interesting role for anyone out there who yeah, as I said, already has found their niche, and wants to sort of see what their next career move could be.

Jon Krohn: 00:43:44

Awesome, all right. So in your journey towards your dream job, is there anything that you would have done differently?

Jodie Burchell: 00:43:54

Yeah. Do you know, I had this very kind of… When I left academia I had a lot of conflict about what I had done my PhD in, I was like, “Oh God.” Like, “This is like, I am so behind all of these people who have these engineering backgrounds.” So to be honest, I wouldn’t change my PhD now. I think actually those skills led me to exactly where I needed to be. But, it would have been nice if I had maybe done some more computer sciences or engineering-focused stuff during uni.

Jon Krohn: 00:44:29

I feel exactly the same. My PhD is in neuroscience, and I self-taught machine learning to be-

Jodie Burchell: 00:44:37

Yes, I know.

Jon Krohn: 00:44:37

… analyzing large datasets, identify patterns in neuroscientific data that I was working with in my PhD. But yeah, if I could go back in time… And even things like during the PhD, I could have been doing computer science courses, like just auditing them-

Jodie Burchell: 00:44:57

I know.

Jon Krohn: 00:44:57

… or taking them online, I had so much time.

Jodie Burchell: 00:45:01

Yep. In Australia you can do dual degrees, so you can basically complete two bachelor’s degrees simultaneously.

Jon Krohn: 00:45:09

Very common, it seems like almost everyone from Australia has a law degree and something else.

Jodie Burchell: 00:45:14

Yeah. I didn’t do law, I didn’t take law.

Jon Krohn: 00:45:16

I know, it seems like something I see a lot though.

Jodie Burchell: 00:45:20

Yeah, yeah, yeah, because it’s like one extra year, and you get a whole another degree, so why would you not? But my extra degree, I loved it so much but it was in evolutionary biology. I have some great stories from that degree, but not so many transferrable skills. So, really kind of wish I’d maybe done com sci, or statistics, or engineering instead. But oh well, we live and learn. The decisions we make at 17, right?

Jon Krohn: 00:45:48

Yeah, you were wild, out of control, rebellious, studying psychology.

Jodie Burchell: 00:45:55

Yeah, studying, and evolutionary biology.

Jon Krohn: 00:45:59

Exactly.

Jodie Burchell: 00:46:00

Oh, no.

Jon Krohn: 00:46:01

Nice. So it’s nice to kind of get this insight into what you do differently, and it’s reassuring that it’s similar to what I would have wanted to do differently. We’ve talked a lot about your background, but amazingly we’ve gotten this far in the episode without talking about the books that you’ve co-authored. So you’ve written two, there’s the Hitchhiker’s Guide to Ggplot2, and The Hitchhiker’s Guide to Plotnine. These books are both interesting to me in different ways. I guess let’s start with ggplot2, because my question is more straightforward. So ggplot2 stands for grammar of graphics-

Jodie Burchell: 00:46:48

Graphics, yeah.

Jon Krohn: 00:46:51

… plot, yeah. So it’s this library that I loved from R, which I seldom use these days. If I do pull up R these days it is literally to use ggplot2, and some people have tried to develop like ported versions that are somewhat similar to Python. But at least a few years ago when I was last looking at that, they weren’t very well-developed, they didn’t have all the functionality of ggplot. And so, I really miss that about R. So one, maybe you want to tell us a bit more about ggplot2 and why you like it so much. But then also, we haven’t talked about R in this episode at all. So I-

Jodie Burchell: 00:47:29

No.

Jon Krohn: 00:47:29

… don’t know if you want to give us some insight into when you’re using that, or if you’re using that at all these days.

Jodie Burchell: 00:47:36

Yeah. I think we probably had a similar path, because I started my data science career with R, because I started using it during my postdoc. And I was sort of getting into R when the whole tidyverse thing was taking off, so I think dplyr had come out relatively recently when I started using it. And I think actually the piping operations had just been introduced, so yeah.

Jon Krohn: 00:48:06

Yeah, yeah, yeah, yeah, yeah. That made a huge difference, yeah, that and… We have had Hadley Wickham on the podcast, so he’s the brilliant mind behind this tidyverse and the idea of piping. So episode 337, Hadley Wickham was on to discuss that. And also I should mention just while we have me mentioning old episodes, is you were talking about pandas earlier in the show, and we had Wes McKinney, the creator of pandas in episode number 523. Anyway, while I have my index of old episodes up I thought I might mention that as well. But yeah, the tidyverse and the ability to pipe, I mean that’s also something… So we also talked a lot about this in… There’s an episode with Matt Harrison.

Jodie Burchell: 00:48:57

Mm-hmm.

Jon Krohn: 00:48:58

Oh, you know him as well?

Jodie Burchell: 00:49:00

Yeah, we did a webinar together in June I think. I met him at PyCon US actually, yeah.

Jon Krohn: 00:49:07

So, amazing Python content developer. He’s in episode number 557 of the Super Data Science Podcast, and he talked a lot about piping, which is something that you can now also do in Python. And he recommends it as a best practice, because it allows you to chain together operations, and you can kind of see this flow that is very human-understandable. It’s like watching a data processing pipeline step-by-step, for example. And so yeah, so I got on this whole tangent because R having that piping ability, similar… And literally that terminology comes from Unix, where you [inaudible 00:49:51] functions in your Bash shell. And so yeah, super helpful functionality, and yeah. So you were saying before I interrupted you that kind of like-

Jodie Burchell: 00:50:01

That’s fine.

Jon Krohn: 00:50:02

… the piping, dplyr, the whole tidyverse that allows for Tidy code, and also Tidy data frames in your R code, was just coming about. So yeah, so you were into R in your postdoc.

Jodie Burchell: 00:50:15

Yes. So my first industry job, I was actually working in R and Sequel, obviously. I haven’t actually used it for a long time now, and one of the reasons I bring up the tidyverse is I’m well aware that if I pick up R again I’m going to be so embarrassingly antiquated with the things that I do. Because I can see on Twitter-

Jon Krohn: 00:50:37

It moves so rapidly, yeah.

Jodie Burchell: 00:50:38

… it moves so fast. But yeah, going back to the piping first, and then we’ll come back to ggplot, it was interesting because this piping was something that I got really used to when I was first using R. And it was super logical, like this idea of… You’re essentially doing an ETL, like you’re doing your whole transformation. And funnily enough, when I went to pandas I got into this habit of just saving everything to interim variables. And it wasn’t until I started using Spark, where this chaining is very common, that I got back into doing it with pandas. And so then it was very interesting, because when I met Matt I was like, “Hey, you are like the only other person I met who writes pandas code like this.” And he’s like, “Yeah, this is like my core philosophy, like this whole function mutation.” Or function composition, sorry.

Jon Krohn: 00:51:35

Oh, I didn’t know it was that rare. It’s so obviously to me the way to go.

Jodie Burchell: 00:51:41

I know, I know, because you have so many fewer side effects, yeah.

Jon Krohn: 00:51:46

Yeah. Maybe we should be using this Super Data Science platform more as a platform for advocating for piping, piping people.

Jodie Burchell: 00:51:54

Yes, come on.

Jon Krohn: 00:51:55

Doesn’t matter what language you’re using, Bash shells-

Jodie Burchell: 00:51:58

Doit.

Jon Krohn: 00:51:59

… R, doit, yeah. Python, doit, Matt Harrison’s screaming somewhere as he hears this.

Jodie Burchell: 00:52:05

He’s very excited we’re advocating for this. And Matt actually, I don’t know if we covered it in our webinar together but he has some really cool tricks for actually being able to debug inside of your chain without breaking the code open. So yeah, definitely I would-

Jon Krohn: 00:52:22

Wow.

Jodie Burchell: 00:52:23

… recommend his book. Yeah, it’s a really good book, and he’s a very sort of approachable teacher. So anyway, just spruiking Matt’s book, he’s going to be happy with me next time he sees me. So yeah, back to ggplot.

Jon Krohn: 00:52:38

Something really cool, this is going to be irrelevant to listeners by the time that this recording gets to them. But at the time of recording, I’m going to get to meet Matt Harrison in-person for the very first time next week at ODSC West in San Francisco, the Open Data Science Conference.

Jodie Burchell: 00:52:54

That is so cool.

Jon Krohn: 00:52:57

I’m going to have to do… So something that will be relevant to listeners is, I’m hoping to be able to record like a whole bunch of short episodes like we do… Jodie, on the Super Data Science Podcast we have long episodes on Tuesday, like this episode will be, this guest episode. And then we have these shorter episodes on Friday, which sometimes we call them Five-Minute Fridays, and it’s just me talking for about five minutes on something. But increasingly we’ve been having guests on those, so I’m hoping to be able to pull people aside at the conference and just ask them like one or two questions. People like Matt Harrison, I know he’s going to be there, I know Ben Taylor’s going to be there at ODSC West, I know that Sadie St Lawrence is going to be there. And so these are some of the most beloved guests that have ever been on the show, Peter Abbeel is going to be there.

00:53:44

And we are recording a whole guest episode live onstage with Dawn Song, who is huge. Yeah, she’s been on Lex Fridman. Anyway, what a tirade. But yeah, I think there’s really cool things ahead as there’s more and more kind of real-life conferences. I know it’s inevitable that at some point you and I will also be able to meet in-person, and yeah, it’s really exciting. And then the way that we can take advantage of in-person interactions to create cool podcast experiences, I’m so excited to explore that more and more. Anyway, we’re not here to just talk about the podcast. So yeah, so you were talking about piping, Matt Harrison’s book. Where did I cut you off most recently?

Jodie Burchell: 00:54:42

So, we were going to chat about plotting, and yeah.

Jon Krohn: 00:54:48

Oh yeah, that’s [inaudible 00:54:49] anything for a long time.

Jodie Burchell: 00:54:48

That’s where we started, we went on a little wander.

Jon Krohn: 00:54:51

As we were talking about your book, The Hitchhiker’s Guide to Ggplot2. So yeah, so ggplot2 is an R library, you can correct me if I’m wrong. I don’t think that there is anything that approximates it in Python.

Jodie Burchell: 00:55:04

Well interestingly, that’s what the second book that I wrote is about. So plotnine is I think the most comprehensive port of ggplot. So there you go, I’ve answered your question about, do you still need to use R?

Jon Krohn: 00:55:22

Oh no, all the R lovers out there are like, “Oh no.” We still had ggplot2, that was like one really good reason for some people to be using R, and now… Yeah, so plotnine.

Jodie Burchell: 00:55:35

They’ll be like, “Damn it.”

Jon Krohn: 00:55:35

Before researching for your episode I hadn’t even heard of it. So how is it that this enormously popular library, ggplot2 and R, it’s kind of like the default way that people plot these days in R. How come I haven’t even heard of plotnine in Python?

Jodie Burchell: 00:55:51

This is the thing, I can’t even remember how I came across it. I think I was basically just frustrated. Okay, I’m going to do full disclosure here, I’m someone who has never learned matplotlib properly. I always just used seaborn, yeah.

Jon Krohn: 00:56:08

Yeah, I mean, that’s fine. It’s pretty clunky, like every once in a while I go and use it. But if you want to create nice-looking plots, it is hard to do in matplotlib. They are functional and functional only, whereas seaborn does pretty easily create beautiful charts.

Jodie Burchell: 00:56:26

Yes. So I was like a total seaborn stan, but I was so used to the syntax in ggplot. I was like, “It would be really nice if there was something that was kind of close.” And I can’t remember how I came across it, but I came across plotnine. And yeah, basically when I wrote the book I would say that the API was relatively complete. There were still some things that needed to be added in, the maintainers are still working on it, it seems to get better and better every time I use it. And to be honest, I still use it as my default plotting package in Python. The one thing I will say is with larger datasets it does seem to use a lot of memory, so it can be very slow to render the plot. But for small datasets for quick plotting, it is excellent. So if you are used to ggplot, it’s a pretty seamless transition to plotnine.

Jon Krohn: 00:57:26

Wow, super cool, I can’t wait to try that out. That is another big thing that I’ve learned from you in this episode. It goes to show how somebody who’s in a developer advocate role like yours gets to spend so much time exploring different tools, and then can come on a podcast like this and be like, “You’ve got to try all these things.” Oh man, that’s really good.

Jodie Burchell: 00:57:50

See, that’s the best part of the job though, telling other people about it, because then we all get excited.

Jon Krohn: 00:57:55

Yeah, and you’re doing everything you can to be spreading the word about plotnine. I mean, writing a book about it, that’s as committed as you could be. So yeah, hopefully it’s going to take off. You heard it, maybe listener, you heard it here first.

Jodie Burchell: 00:58:14

Yes, yeah. Unless you already own my book, you know? In which case, yeah, yeah.

Jon Krohn: 00:58:21

Right, exactly, yeah. There will be a fraction of the audience that certainly had heard of it first, before here. But then, do you have any other Python packages that you’re really excited about, that we need to know about?

Jodie Burchell: 00:58:35

Yeah. This is not going to be like a secret or anything, but at the moment my latest obsession is the Transformers package from Hugging Face. So I spent a big chunk of my career in NLP, so obviously a big favorite of mine is Gensim, and we used to use Gensim pretty intensively when I was working in two jobs ago. We did like a huge amount of natural language processing work. For those of you who don’t know Gensim, it’s probably the most mature and well-established package for creating word embeddings, so like word2vec, Glove, those sort of models. But, Transformers is the package to implement a lot of the latest generation stuff. The docs are amazing, they have these incredible videos just explaining everything really simply. So it takes something that I think is potentially very intimidating, very a bit scary for people to try.

00:59:38

Breaks down exactly what they’re doing, and then packages it up in this beautiful, easy to use API, great tutorials. So yeah, I’ve been having fun using some BERT models in order to do classification lately, and…

Jon Krohn: 00:59:54

Nice.

Jodie Burchell: 00:59:54

Man, I built this clickbait classifier, I got 99% accuracy straight out of the box.

Jon Krohn: 01:00:00

Oh, wow.

Jodie Burchell: 01:00:01

It was so good, it was like, it was super easy. I just adapted a tutorial, I was like, “This is rewarding.”

Jon Krohn: 01:00:10

Cool, yeah. And we have had the CEO of Hugging Face, Clem Delangue, on the show. So that was in episode number 564, so you can hear more about Hugging Face. Although it isn’t like… He’s the CEO of the company, not a technical person. So it’s not like we went into even the level of detail that you just went into around building a classifier with the BERT architecture. Super cool. So I imagine that’s going to be something that’s kind of available as a learning resource from you in the future? Like, are you going to be presenting on it at a conference or something?

Jodie Burchell: 01:00:51

I actually just presented on it at a conference.

Jon Krohn: 01:00:55

Oh, nice.

Jodie Burchell: 01:00:57

So yeah, basically my conference presentation was at PyCon Portugal, and there’s also some associated notebooks. So if you are curious about it you can check my Twitter, I actually posted it relatively recently. And yeah, you can access all the notebooks on GitHub.

Jon Krohn: 01:01:16

Nice. Yeah, we will look that up and be sure to include the link to your GitHub for those notebooks.

Jodie Burchell: 01:01:25

[inaudible 01:01:26].

Jon Krohn: 01:01:27

Very cool. All right, and then so, another topic that you have presented on a fair bit at conferences in the past is these ideas of applying rigor and skepticism to data validation, and to create reproducible data workflows. So succinctly, tell us about these very broad concepts.

Jodie Burchell: 01:01:59

Of course, of course, I’ll cover it in five minutes. Yeah, so I would say reproducibility, it’s a real ongoing topic not just in data science, but in science in general. I first started talking about this when I was pretty fresh out of academia, and one of the reasons I started talking about it is there was a very well-documented reproducibility crisis that went on in psychology, in medicine, in economics, and probably in other sciences but pretty much it got looked at most thoroughly in the social sciences and in medicine. And I can kind of see why it happens, because the usual workflow is not designed for people to actually inspect the data that you used, to inspect the code that you used.

01:02:50

And it’s really only been a recent thing in academia, where we’ve started to think about it. And I think a lot of data scientists, maybe coming from academic backgrounds, don’t really tend to think about it in our work. And so it means like you do an analysis, while you’re in the middle of it you’re like, “Okay, this is fine, everything’s working.” And then you just drop it and you’re like, “I’m done.” And then maybe you need to come back to it six months later and you’re like, “What did I do?” Like, “I can’t even get the environment to work.” So for me, I think I’ve always seen it as not really… It’s not really a matter of not caring, I think it’s a matter of, there’s already so many things that you need to master.

01:03:39

It’s just an additional overhead that, it’s not going to stop you doing your work in the immediate moment. But in order to add in checks and balances to ensure reproducibility can actually increase that overhead, and maybe you think, “Well, it’s not worth it.” I mentioned earlier, I’m not going to kind of go on about it, but mentioned earlier that I think Datalore is perfectly set up for reproducibility, so that’s one way in which tooling can enable that. But there’s also sort of other ways you can use tooling, so make sure if you are writing code you document it as well as possible. You have docstrings, you include the input and return types, you make naming clearer, you tidy up your code as much as possible.

01:04:26

You use the markdown properly, you actually explain your process, and just make sure you’ve documented your environment somewhere. It’s just sort of small practices that should be changed. So yeah, I think it’s not necessarily the easiest thing to get into the habit of doing, but it’s definitely important. And I think especially if something really important comes out of your research, you really should be able to reproduce it end-to-end later.

Jon Krohn: 01:05:00

Super cool. So that was an amazing litany of habits that you just ran through, that would allow us to be better data scientists producing reproducible code. I love that, and it’s amazing that you were able to just do that off the cuff.

Jodie Burchell: 01:05:17

I talked about it a lot though, so this is why it’s like, up here.

Jon Krohn: 01:05:22

“Every night I wake up in a sweat, thinking about this list of things.”

Jodie Burchell: 01:05:26

No, it’s like my nightly prayers, you know? The five tenants of reproducibility.

Jon Krohn: 01:05:33

Nice. And so, all right, this episode has been amazing, I’ve learned so much. We’ve gotten such a great overview of tools that are useful for data scientists to be doing their job more effectively, more easily. As we get now near to the end of the episode, Jodie, other than your own books, do you have a book recommendation for us?

Jodie Burchell: 01:06:02

Yeah. I haven’t really been reading, I would say, data science stuff outside of work, because I get to learn a lot on the job. So, now I get to have fun in my spare time. I did just finish reading a book on my recent travels called The Shortest History of Germany, by James Hawes. Obviously because I am not a native German, I don’t know if you could tell that by the name and the accent, I’m always kind of interested in learning more about Germany, and finding out a bit more of the history beyond World War Two and Cold War. And yeah, this was a great little book, it’s like 200 pages and very entertaining.

Jon Krohn: 01:06:48

Nice, that’s a cool recommendation. It is a country that has a lot of rich history for sure, has shaped a lot of world history and continues to today. So, sounds like a very interesting read, Jodie. All right, so how can people stay in touch with you after this episode? Clearly you have a lot of really valuable tips for data science listeners. How can they keep up with them after the episode?

Jodie Burchell: 01:07:16

Yeah, so my main sort of social media platform is Twitter, unsurprisingly. I think most people in tech are on Twitter. So yeah, if you give me a follow on Twitter, I post pretty much everything I’m doing there. If you want to message me you can reach out on LinkedIn, I don’t check it so often but I will get back to you eventually. And I also have a blog, I’m sometimes better at maintaining it than not. But I’ve been maintaining that-

Jon Krohn: 01:07:45

Standard error.

Jodie Burchell: 01:07:46

Yes, yes. So if you want to give that a look, there’s… I think I’ve been writing for that blog since about 2015, so it’s my first kind of brave attempt to leave academia at the beginning, and yeah, just write about things that interest me at the time. It could be math, it could be AWS stuff, could be NLP, there’s a whole bunch of stuff on there.

Jon Krohn: 01:08:15

Nice. Well, some great resources there to check out, we’ll be sure to include them in the show notes. Jodie, thank you so much for this super informative episode at the Super Data Science Podcast. We’ll have to have you on again in a couple of years or something, so you can give us a refresh on all of the coolest libraries and tools that we need to be aware of.

Jodie Burchell: 01:08:38

And thank you so much, it was an absolute blast. Yeah, I really enjoyed talking to you, and yeah, please reach out if you want to chat more NLP or data science stuff. I would love to hear from you.

Jon Krohn: 01:08:56

Nice, another outstanding, highly practical episode today. In it, Jodie filled us in on how you need to understand what all of the fields in your data mean, or use statistics to infer what they mean in order to use them effectively, how Datalore enables data science teams to collaborate in real-time within Jupyter Notebooks, how developer advocacy might be a great role in data science for you if you like working independently, presenting technical content, and constantly keeping sharp on the latest data science tools and approaches. How she believes plotnine is the best ggplot-like Python library, how the Hugging Face Transformers package is exciting for training state-of-the-art natural language processing models, and how if she could go back in time she would have studied computer science or software engineering to optimally prepare for a data science career.

01:09:44

As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Jodie’s social media profiles, as well as my own social media profiles at www.superdatascience.com/629, that’s www.superdatascience.com/629.

01:10:01

Every single episode I strive to create the best possible experience for you, and I’d love to hear how I’m doing at that. For a limited time we have a survey up at www.superdatascience.com/survey, where you can provide me with your invaluable feedback on the show. Again, our quick survey’s available at www.superdatascience.com/survey.

01:10:20

Thanks to my colleagues at Nebula for supporting me while I create content like this Super Data Science episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara and Kirill on the Super Data Science team for producing another awesome episode for us today. For enabling this super team to create this free podcast for you, we are deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsor’s links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast. Last but not least, thanks to you for listening all the way to the end of the show. Until next time my friend keep on rocking it out there, and I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

Podcasts SDS 629: Software for Efficient Data Science

SDS 629: Software for Efficient Data Science

Podcast Transcript

Share on

Related Podcasts

November 7, 2025

November 4, 2025

October 31, 2025

Podcasts SDS 629: Software for Efficient Data Science

Share

SDS 629: Software for Efficient Data Science

Podcast Transcript

Share on

Related Podcasts

November 7, 2025

SDS 938: Frontier AI Agents for Data Science, with Sphinx’s Rohan Kodialam

November 4, 2025

SDS 937: How to Design AI-First Products, with Marc Dupuis

October 31, 2025

SDS 936: LLMs Are Delighted to Help Phishing Scams