Podcasts SDS 557: Effective Pandas

81 minutes
Data Science, Python

SDS 557: Effective Pandas

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this episode, Matt Harrison joins us to reveal his best Pandas tips, tricks and best practices to help you get the most out of the Python data analysis library.

Thanks to our Sponsors

About Matt Harrison

Matt is a world-renown expert on Python and Data Science. He has a CS degree from Stanford University. He is a best-selling author on Python and Data subjects. His books: Effective Pandas, Illustrated Guide to Learning Python 3, Intermediate Python, Learning the Pandas Library, and Effective PyCharm have all been best-selling books on Amazon. He just published Machine Learning Pocket Reference and Pandas Cookbook (Second Edition). He has taught courses at large companies (Netflix, NASA, Verizon, Adobe, HP, Exxon, and more), Universities (Stanford, University of Utah, BYU), as well as small companies. He has been using Python since 2000 and has taught thousands through live training both online and in person.

Overview

With multiple Python or Pandas books to his name, expert Matt Harrison is here to provide the best tips for mastering the programming language and library. After a short introduction about his collection of books, the O’Reilly Python and Pandas master kicked off the episode with six valuable tips for using Pandas masterfully. These include chaining, his most controversial tip, working with your raw data; executing Jupyter notebooks from top to bottom; avoiding the use of ‘apply;’ typing the columns of your DataFrames; and lastly, becoming adept at aggregating and pivoting data.

When Matt’s not writing books, he’s busy teaching and leading organizations like Netflix, Stanford, and NASA via his consulting company MetaSnake, which offers custom live Python training for teams. What he finds interesting is that when training includes the use of client data, learning happens at a much faster rate. “Students come in, and they’re already experts on their data, and now they’re getting incredible insights…and creating visualizations, and having conversations with colleagues that probably wouldn’t have been able to happen had we just used canned data,” he says. “Live training for corporations is the best way to level up your team.”

But when it comes to people learning new topics independently, Matt looks to high-performance athletes or public figures like Steph Curry or presidents, who all had good coaches by their sides to help them learn. “If you have someone who’s a master who can take you along, that’s great,” Matt shares. But if you can’t find a mentor, he recommends building a clear path to follow and applying what you’re learning practically.

While it may come as a surprise when Matt admits that “a lot of the time [he’s] not motivated,” he shares a few productivity hacks for those working on their own. For example, he suggests limiting the scope of your projects and making your commitments public for the sake of accountability. And with four kids at home, he is also a big fan of deep work and always fits in a walk or a nap when the time is right.

Lastly, Matt closed off the episode by delivering valuable answers to your community questions.

In this episode you will learn:

Pros and cons of self-publishing and working with a publisher [5:05]
Matt’s six tips for using Pandas [17:13]
The best way for corporate teams to level up their skills [40:04]
How to learn anything effectively [47:14]
Matt’s tricks for staying motivated [50:00]
Matt’s recommendations for using Git and the Unix command line [1:00:14]
Matt’s recommended software libraries for working with tabular data [1:19:45]

Items mentioned in this podcast:

Monte Carlo
Data Science at the Command Line
Effective Pandas by Matt Harrison (42% discount for SDS listeners applied)
Effective Book Authoring
SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)
Git version control
Spark
Dask
Modin
The Programmer’s Brain by Felienne Hermans

Follow Matt:

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 557 with Matt Harrison, bestselling author and expert on the Pandas Library. Today’s episode is brought to you by Monte Carlo, the data observability leaders, and by the new course from Data Science Workshops called Embrace the Command Line.

Jon Krohn: 00:00:21

Welcome to the SuperDataScience podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now, let’s make the complex, simple.

Jon Krohn: 00:00:52

Welcome back to the SuperDataScience podcast. We are joined today by Matt Harrison, a world-leading expert on Pandas, the most popular software library for working with two-dimensional tables of data. Matt is the author of seven bestselling books on Python and machine learning, with three that specifically focus on the Pandas Library. His most recent book, Effective Pandas, was released in November. Beyond being a prolific author, he teaches exploratory data analysis with python at Stanford University. Through his consultancy MetaSnake, he’s also top Python at leading global organizations like NASA, Netflix, and Qualcomm. Prior to focusing on writing and education, he worked as a CTO and software engineer, and he holds a degree in computer science from Stanford.

Jon Krohn: 00:01:34

This episode focuses primarily on Matt’s top tips for effective Pandas programming. On top of that, we discuss how having a computer science education and having worked as a software engineer has been helpful in his data science career, how to squeeze more data into Pandas on a given machine, and his recommended software libraries for working with tabular data once you have too many data to fit on a single machine. Today’s episode will appeal primarily to practicing data scientists who are keen to learn about Pandas, or to become an even deeper expert on Pandas, by learning from a world leading educator on the library. All right, you ready for this? Let’s go.

Jon Krohn: 00:02:15

Matt, welcome to the SuperDataScience podcast. It’s awesome to have you on air. Where in the world are you calling in from?

Matt Harrison: 00:02:23

Thanks, Jon. I’m in Utah. I’m in snowy Utah right now, and happy to be on. Thanks for having me.

Jon Krohn: 00:02:32

Utah, famous for its Alpine skiing this time of year. Is that something you engage in, or do you take advantage of the great outdoors you have around you out there?

Matt Harrison: 00:02:40

It is. Yeah. I live about 45 minutes away from, I think, six ski resorts.

Jon Krohn: 00:02:50

That’s nice.

Matt Harrison: 00:02:51

Yeah. I can’t complain, other than this year. The snow has not been great. So, I’m going to cross my fingers that it snows a little bit more.

Jon Krohn: 00:03:00

There just hasn’t been enough snow. That’s annoying.

Matt Harrison: 00:03:02

Yeah. January’s been the third lowest producing snow month for Januarys in record.

Jon Krohn: 00:03:10

So. Yeah. Well, we’re recording early February, so hopefully by the time this episode airs in March, you’ve been dumped on and heavily and you get some nice powder days out there.

Matt Harrison: 00:03:22

Yeah.

Jon Krohn: 00:03:23

So we’ve known each other indirectly for a while. We both teach in the O’Reilly platform and we both teach data science courses in the platform. We both follow each other on LinkedIn and Twitter, and we engage in each other’s content on social media. And Noah Gift, who was in episode number 467, he kind of introduced us. He didn’t formally introduce us, but there was this group of collaborators or potential collaborators where he said, “You guys might like to talk to each other.” But we never had. We’d never had a phone call. We never met in person. So, we’re meeting on air, and it’s really exciting because your expertise is something that I’m super keen to dig into. I’m sure our audience is as well. You’ve written seven bestselling books. So, you’ve written The Illustrated Guide to Learning Python 3, Intermediate Python, Learning the Pandas Library, Effective PyCharm, The Machine Learning Pocket Reference, and The Pandas Cookbook, which is now in its second edition.

Jon Krohn: 00:04:28

So, that’s six. Number seven just came out in November, and it’s called Effective Pandas. So, we’re going to dig into that in this episode. And if you were watching the YouTube version and you might want to check it out, if you’re just listening, because Matt just picked up all of those seven books in a row as I was going through them, and showed them to the screen. So, you’ve written a ton of content. Congrats on that. And I should be asking you questions later in the episode about how you set out to do these books. But one question for you first is some of your books are self-published, whereas others are with the most well known publishers in technology, like O’Reilly. So, what are the pros and cons of self-publishing versus working with a big publisher?

Matt Harrison: 00:05:20

Yeah, that’s a great question, Jon. So, people write books for different reasons. Some people, I’ve heard them say that they literally just want to have a book that has an animal on the cover. And if they get that, then that’s the badge that they want, which is great. And some people have different reasons for publishing. Maybe they just want their content to be open, and then they’re going to self-publish it. Other people do some cross of that, where they have open content, but then it eventually gets published. So, I would say one of the nice things about working with a publisher is that you’ve got someone maybe dangling a carrot in front of you, or slapping you with a stick and making you make progress. A lot of people like, “I want to write a book. I want to write a book.” And they might say they want to write a book, but when you get into it, writing a book is a lot of hard work. It’s a lot of getting up early or staying up late, and just putting in the work. It’s sort of like, I guess, training for a marathon. A lot of people are like, “I want to do a marathon.” Well, are you out training for it? Are you out running almost every day? If you’re not, it’s going to be painful to do a book.

Matt Harrison: 00:06:45

And so, you do get a lot of support from your publisher, especially on the editing. Another nice thing about a publisher is, generally, they make you make a proposal at the start, where you put in some thought about a path that you want to take someone down. And with self-publishing, it’s anything goes. It’s easy to self-publish these days. Anyone can do it. But I still think, if you’re doing that way, you should do it very structured. You would want to think about the path that you’re taking someone along. You would want to work with people who can do editing. You would probably want to also think about what your distribution mechanisms are as well. So, a lot of people are like, “Well, what is the big pro and the con? I would say, for a lot of people, it does come down to money, and I’m not really going to get into specifics here, but generally, when you are working with a publisher, you’re getting 10-15% royalty. If you self-publish and you just put something on Amazon, and if it’s priced in certain price points, you get between 30 and 70% royalty. That, in and of itself, might be sufficient, just because Amazon is the big elephant. They’ve got large distribution.

Matt Harrison: 00:08:21

And so, that’s something to consider. Also, you might want to consider your platform. A lot of people, the publisher can do some marketing blast and whatnot, but at the end of the day, they’ve got multiple books they need to push. And after a few months, your book is not really on the top of their mind. They’ve got other books that they’re working on. And so, a lot of the marketing comes down to you and what your platform is. So, those are some of the pros and cons. I actually have a course on this. We can link to it in the show notes.

Jon Krohn: 00:08:56

For sure.

Matt Harrison: 00:08:58

I’ve interviewed over 12 tech authors who have written books in the tech space, and I’ve gone through and interviewed them, and asked them what their process was. Some of these are published authors. Some of them are self-published. Some of them have done both. So, if you are considering writing a book, that might be something that you might want to check out. But yeah, a lot of people say they want to write a book, but they don’t really want to write a book. They just think they do.

Jon Krohn: 00:09:32

Yeah. It is a really arduous process. And certainly one of the most stressful of my life, but also, then, when it’s done, it’s one of the most rewarding things in life. So, I guess that’s how you end up doing it seven times.

Matt Harrison: 00:09:44

Yeah. I guess, for me, it’s an itch that I wanted to scratch. In programming terms, there’s this thing called bike shedding, where people like to have their bike shed painted a certain color or whatnot. And it’s infamous for people rewriting the code and saying, “I really wanted me to write the code,” basically. And I guess I can say that I bike shed books. This is the book that I wanted. It’s the pink book instead of the green book.

Jon Krohn: 00:10:15

Right, right, right. So, yeah. Let’s talk about that a little bit. So, how do you choose what book you’re going to write next? Why did you write Effective Pandas? Why was that the shed that needed that color at this time?

Matt Harrison: 00:10:32

I like that question. So, maybe I can give it a little bit history about my Pandas books. My original Pandas book, Learning the Pandas Library, came out in 2015 or 16. And one of the earliest Pandas books out there. And I had been using Pandas almost since it was released publicly. Prior to Pandas, I had started a company that was doing business intelligence reporting. And so, we were doing basically retail sales reports for a large chain of companies. And I had basically written something very similar to Pandas in Python, for generating these reports. And there was a server that was running, they could query the server. And then there was batch processes that were running, that would generate these reports that eventually got spit out as huge spreadsheets with thousands of rows and often hundreds of columns that people wanted to do sales report.

Matt Harrison: 00:11:43

And when Pandas came out, I’m like, “Oh, that’s cool. It does a lot of what I’m doing, has some overlapping features, some different features that are different. But the key part of Pandas was that it was leveraging NumPy, and mine was just pure Python. And so, by leveraging NumPy, you get a pretty big, huge performance increase. So, for those who are listening, who aren’t aware, Pandas is a library for manipulating structured or tabular data. And so, I started using Pandas. I paid for my education tutoring. And so, my background is software engineering, but I consider myself an educator. Every time I learn something, I’m like, “How would I want to learn this? And so that’s the impetus for me, for writing books, is what gaps are there? What holes need to be filled? What are some hints or best practices that I would want to learn if I was learning this?”

Matt Harrison: 00:12:59

So, that was basically the impetus for my first Pandas book. After that, I started MetaSnake and started doing a bunch of consulting and training in Python and Pandas. And I was approached to do the second edition of the Pandas cookbook. So, I’m not the original author of that, but I did read that because, as someone who’s interested in Pandas and as an educator, I tend to try and consume a lot of content around that. And I like the book, however, I don’t feel like it’s my book. As I did the second edition, I added a few chapters. I rewrote or tweaked a lot of the code, but it didn’t feel like my book. At that point, my opinions around Pandas had become, I guess, a little bit more opinionated and stronger about proper ways of using it. And so, I was like, “Okay, I need to revisit my original book and redo that.” And so, that’s what Effective Pandas was supposed to be. It was supposed to be second edition of Learning the Pandas Library.

Matt Harrison: 00:14:13

But as I started doing that, and this, again, goes back to the writing a book, you think planning out how long software takes to write is hard. You should plan out how long writing a book takes. So, I’m like, “Oh, this is going to take a couple weeks. I’ll just tweak a few things here and there, and then I’ll be done.” And I started looking at them like, “I’m basically going to throw out everything and start over.” And so, that was the birthplace of Effective Pandas was, after many years of teaching and consulting and seeing a lot of people’s Pandas code, seeing how people are using it, what is the book that I would want that I could point them to that would help them write better Pandas code.

Jon Krohn: 00:15:03

Perfect.

Jon Krohn: 00:15:05

Struggling with broken pipelines, stale dashboards, missing data? You’re not alone. Look no further than Monte Carlo, the leading end to end data observability platform. In the same way that New Relic and Datadog ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of data downtime. As detailed in episode number 499, with the firm’s brilliant CEO Barr Moses, Monte Carlo monitors and alerts for data issues across your data warehouses, lakes, ETL, and business intelligence, reducing data incidents by 90% or more. Start trusting your data with Monte Carlo today. Visit www.Montecarlodata.com to learn more.

Jon Krohn: 00:15:55

So, yeah. So Pandas, as you mentioned, easy for working with popular data. Not only that, it’s the most popular software package for working with tabular data today, and doing it in Python, which is the most popular programming language period, but also the most popular programming language for data science in particular. So, makes sense to be focusing on Pandas in general, and then super cool to hear how you’ve made this journey towards this most recent book, Effective Pandas. And yes, it is thoroughly comprehensive. So, you cover how to get started with Python and Pandas, and just how to get it on your computer and work with it. You talk about data structures. So, there are series for working with one dimensional data in Pandas, data frames for working with two dimensional data, how to effectively apply operations, how to plot, memory usage considerations, how to export data, how to debug. It covers everything from end to end. And that sounds like the kind of thing you would be able to write if you had created a consulting company and were teaching people all the time on how to be using Pandas and Python as effectively as possible. So, super cool that reference exists now. And yeah, definitely encourage folks to check it out.

Jon Krohn: 00:17:12

So, something that we can probably we do on air, Matt, is then dig in to your top Pandas tips. So, we actually had the creator of Pandas, Wes McKinney, on a recent SuperDataScience episode, on number 523, and we talked about the genesis of Pandas. And then we talked a lot about libraries that he has been working on since, and companies that he’s built. We didn’t actually talk about Pandas and how to use it, and what his top tips might be. I’d love to hear yours. And I think you have some great ones for us.

Matt Harrison: 00:17:46

Sure. Yeah. So, maybe I’ll start off with what might be the most controversial tip that I see. So, as I said, I think my thoughts around Pandas and the proper care and feeding of Pandas have grown stronger as I’ve used it and seen it. But my first one, it is to leverage what is called chaining. Now, if you go to my Twitter, I don’t generally post a lot of cat photos, or if you go to LinkedIn. I often will post code as images, and you’ll see a lot of my Pandas code. And this tends to elicit a strong response, either positively or negatively, the code that I post. I’ve had people say, “This is the worst code that I’ve ever seen.” I’ve had people say that I would never work with you. And then, on the flip side, I get people like, “This is awesome. This changed how I write code. My life is much better after doing that.”

Matt Harrison: 00:18:47

So, let me maybe just explain, basically, chaining and how I see that. In Pandas, we have basically two data structures. We have a series, which you can think of, if you’re thinking in a database, that’s a column from a database. And then we have the data frame, which if you’re thinking of databases like a table. So, those are the two main structures that we have, and most operations on either a data frame or a series, and there’s about 400 different methods, if you look at those. On both of them, most of those will return back. One of those objects, a series or data frame, or sometimes if they’re reducing, they might return a scaler object by that. So, if you think about, from a data science perspective, especially data janitor work, cleaning up your data, prepping it for machine learning or whatnot. Most of the data in the wild is not great as is. It needs a lot of nudging, cleaning up, maybe reformatting it or restructuring it, how it is. And I like to write those restructuring instructions as a series of, basically, steps, step-by-step.

Matt Harrison: 00:20:04

So, what I’ll do is I’ll actually put a parenthese at the top of my code, and I’ll put a parenthese at the end, and parentheses mean a couple things in Python. But in this case, this is a parenthese for basically a parenthetical, like if you’re doing a math operation, you would add two numbers before multiplying them. And what it allows you to do in Python is it allows you to basically escape white space rules. So, I can say, “I’m going to start off with my raw data, and then I’m going to go to the next line. And then I’m going to just put the single operation that I’m going to do on the next line.” I actually have, on my screen here, I’ve got some Pandas code that I’ve written for a sales report that I generated. And so, maybe I can just describe it for-

Jon Krohn: 00:20:49

A sales report of your own.

Matt Harrison: 00:20:51

Yeah.

Jon Krohn: 00:20:51

Sale?

Matt Harrison: 00:20:52

Yeah, of my own sales from the MetaSnake website. And so, I guess I can put that on the screen here, for YouTubers.

Jon Krohn: 00:21:03

Amazing.

Matt Harrison: 00:21:04

You can see that, at the top there, I’ve got sales. So, I’m just going to take my sales data frame, and then the next thing I’m going to do is I’m going to query it. So, I’ve got a single call to the query method, and I’m just filtering sales that were actually paid and I’m filtering the bundle. And so, that is going to return a data frame. So I’m going to just keep operating on that. The next thing I’m going to do is I’m going to do a group by. This is one of those super powerful features of Pandas. Lets you basically pivot the data, and I’m going to group this by the date, at a month frequency, and I’m also going to pivot it by a category. So, this is going to give me what’s called a hierarchical index, or multi-index, when I return this. And then, on that grouping, I want to apply these two aggregations. So, that’s the next line, is the aggregations. I’m going to total the sales, and then I’m going to count the number of items in there.

Matt Harrison: 00:21:56

That’s going to return me another data frame. Now, this is going to, because of how I grouped it, I grouped it with date and category, it’s going to have a hierarchical index. And so, I want to do the next line here, is I’m going to unstack that, which is going to rotate one of the indexes into a column, and then the next thing I’m going to do is I’m going to pull off one of those columns because this is now hierarchical columns. I’m going to pull on the columns, which is total sales, and then I’m going to pull off the course column from that. And this should give me, basically, the sum of all the core sales and the count of all the core sales. And then I’m going to add some exponentially weighted moving averages to do a very simple prediction on that, which is the next line after that. So, this is all written. There’s about eight steps in this, which is basically I want to do a simple prediction on what my core sales will be like. But I’ve written it in this chain, and generally, when I’m writing this, I can step through it one line at a time.

Matt Harrison: 00:23:03

I’m starting at the raw data and then I’m just building this chain, which, if you get used to this, starts looking like a recipe and then your code is very easy to come back to. What I find generally is that 90 plus percent of people, when they’re writing Pandas they’ll write each of these eight steps, is like either a new individual line and store the intermediate data frame or they’ll put them in different cells and they might even not put the cells in the right order. And so what I find is that makes it really hard to understand and it also makes it really hard to come back to. Hard to understand because our brains have limited capacity. You look at the working memory, what you can store in your brain, commonly people say seven plus or minus two.

Matt Harrison: 00:23:45

And if I’ve got all these intermediate variables that I’m just keeping around, to me that’s just digital noise, it’s getting in the way. I don’t really care about those, I care about the end result. It’s like saying, I’m going to put some restrictions on how you code and it might feel like those restrictions are harder, but after you embrace the restrictions it’s going to force you to write better code. What I can do after I’ve written this chain is I can then take that whole chain, and oftentimes I’ll make a chain to clean up, just do janitorial work, and I’ll take that chain, put it into a function and put it at the very top of my notebook. And then when I come back to my notebook, all I have to do is load my raw data and then run this function that cleans it up and I’m good to go. I don’t have to worry about running these cells in arbitrary order or keeping track of these intermediate things.

Matt Harrison: 00:24:37

So, that’s probably a practice. And I find it similar to… Like I also teach on Python, and when I teach people about Python, one of the things that’s weird in Python is white space. It’s novel for the language, and a lot of people, especially those who have years of, see your job experience, come to Python and maybe their company is saying, we’re using Python, because like you said, it’s the most popular language for a certain application, so we’re going to learn Python even though you’re a Java expert. And a lot of these people are like, “Oh it’s got white space,” and that white space really bothers me that you have to indent, even though these people indent their code normally, but they’re like, “That bothers me.”

Matt Harrison: 00:25:19

And what I find is most of the people after a day of just learning the rules are like, “This doesn’t even matter, it’s not even a big deal.” And so similar thing here with the chaining, it is different because most people aren’t doing it and most people don’t see it. But if you adopt this, my take is that it’s generally going to make it. So you are focused on making a recipe of code, steps that you’re going to do, and it’s going to force you to be more clear and you’re going to be able to read your code and come back to it. But also if you start sharing it or collaborating with others, it’s really easy to do that. And if you write it in a single chain, you can start testing it as well, you put it into a function and now you can test it. You really don’t care about the intermediate state, you only care about the end result of that. So this basically wraps up the recipe from raw data to end result. So, that’s my first tip.

Jon Krohn: 00:26:20

Yeah. So I have good news for you, Matt, which is that I am firmly in the chaining camp. So I absolutely love chaining for… There’s no point in me repeating everything that you just said, you articulately made the case for chaining and then even summarized it. So I’m not going to bore the audience by just repeating it again. But I am a big chainer, I love not having intermediate variables. They just, as you said, they make it harder to understand my own code, to understand other people’s code. I love chaining. And I’ve been into the idea ever since the dplyr library in R. So I was big into R before I got into Python and the dplyr library allowed the same thing. And my first experiences with that with piping in R, I was like, this is incredible. My code is so much cleaner and easier to understand. You just have this process, just this pipe of processing, you showed us the example in code, actually, in your screen share there. And yeah, your data just flow through this pipeline and you can see exactly what that pipeline is from start to finish. It’s so neat and clean. So I do the same with my Pandas code as well, thank you for that first tip.

Matt Harrison: 00:27:29

Awesome. You’re in the 1% club.

Jon Krohn: 00:27:33

Well, we are, I think, a very elite 1%.

Matt Harrison: 00:27:38

I’m trying to make that not 1%, but this is something that I see that a lot of blog posts or people like, how to use Pandas, they just don’t even talk about this. So I feel like this is, for some reason, a lot of people from R they’re used to the pipe and people from Python world, that it’s just novel to them.

Jon Krohn: 00:28:01

Which is interesting, because even in Unix at the command line piping is super common.

Matt Harrison: 00:28:07

Yeah.

Jon Krohn: 00:28:09

Anyway. Well, yeah, so that was your first tip?

Matt Harrison: 00:28:11

That’s my first tip.

Jon Krohn: 00:28:12

What’s the next?

Matt Harrison: 00:28:12

My next one would probably be working with raw data, and this isn’t necessarily Panda specific, but maybe you can combine it with, if you are making this chain and you do what I said, you load your data and then you have this chain, I like to take that and put it into a function and then I put the cell that loads my data at the very top and the next one that just cleans it below that as a function. Invariably, what I found in consulting arrangements or even in work or teaching, someone will say, why is that? Why did you get that number? And so if you are not working with raw data, making those explanations, tracing your code is hard. But if you’re using chaining and you work with raw data, you can actually trace the code and you can trace the data through every step of that. And you can say, “Oh, this is why the average is this.” And it makes basically explaining to the higher ups very easy if you work with the raw data. That would be probably my second tip.

Jon Krohn: 00:29:21

Nice. That one was quite concise, after chaining was so in depth. But yeah, working with raw data is almost self explanatory. If you don’t work with the raw data, how are you going to be able to explain in any detail if there’s any further probing on some summary stats that you have, so I love that. Working with raw data, you can also identify issues. There could be issues with the incoming data pipelines that, in summary stats, those are obscured and misleading. All right, so that’s chaining that we’ve already got, we’ve got working with raw data. What’s your third tip for us, Matt?

Matt Harrison: 00:29:57

Yeah. My next one is related to both of those, but I’m going to mention it separately. And that would be organizing your Jupyter code, which, again, might not be Panda specific, but if you follow that chain and then you put your code in a way where you can execute yourselves one by one, what that does is it starts to enable collaboration and enable you to work with your code easily. One of the common complaints that you hear in the Python realm is that Jupyter, although it’s nice because you can execute things out of order, whatnot, it makes things hard. Yeah, I get that. I mean, sometimes when I’m doing loose goose exploratory data analysis, I might just make random cells all over the place. But if I were to run that from start to end it will not work. So leveraging the best practice of chaining, and then… I like to tell my students, anytime you’re going to start to collaborate with someone, you do want to make sure that you can take your notebook and run it from the start to the end without issues. There’s nothing more frustrating than coming to a notebook that if you look at like, in notebooks, actually there’s numbers on the side that tell you the order in which those codes were executed and you go through these and it’s like higher numbers get in for lower numbers indicating that the cell above was run after the cells below it.

Jon Krohn: 00:31:17

No matter what tools and program languages you use, I believe that every data scientist should be able to use the Unix command line. In episode 531, I spoke with Jeroen Janssens, CEO at Data Science Workshops and author of the book, Data Science at the Command Line. Well, I just learned that Jeroen has a new cohort based course called Embrace the Command Line. You can apply now for the first cohort which is 50% off and starts March 20th. You’ll learn together with other data scientists and during live sessions with Jeroen himself. For details and upcoming cohorts, visit datascienceatthecommandline.com. That’s datascienceatthecommandline.com.

Jon Krohn: 00:31:58

Yeah, you should be able to clear all outputs and restart the notebook and just run all cells and all the numbers should be in order, you should not have any errors thrown. Exactly, it’s maddening if you get a notebook sent to you that only works if you execute the cells out of order.

Matt Harrison: 00:32:17

Yeah. So, again, that’s probably one of those things where you just have to put some constraints on and make sure that you’re doing things inside a certain framework, and that will help you. Next one would probably be another one that you see a lot of people throwing around advice on internet, which is, they say there’s this apply method which, if you think about it, is like pipe but it has a drawback. So for those who aren’t aware, again, we have series and we have a data frame and series basically is a vector of data. And how Pandas works is basically leveraging NumPy, and what NumPy gives us is it instead of having a series of individual Python objects like Python integers or Python floats, it’s going to give us a buffer of data in memory. And we don’t have the overhead of Python for a series. And so if you want to add something to a series, you can say like, plus two, and it will leverage modern computer architecture, SIMD instructions, and basically say, here’s the buffer, add two to that, it will give you a new buffer versus using apply, but apply will say, pull out each individual number, convert it to a Python object, then run some code on it.

Matt Harrison: 00:33:42

You’ll hear this thrown around, “Oh, you can use apply and you just write Python code and it works.” It does work, however at this point, you’re going down what I say is the slow path, because you’re taking something that’s very optimized and you’re pulling it back into Python, which is a slow language. So if you can avoid apply, generally your code will run faster. Now there are cases where, I think, apply is okay, so maybe I’ll just put a caveat on that. If you’re doing numeric operations and you’re using apply, to me, that’s a code smell, a hint that you probably could be doing this in a different way and it would probably run 10 to 50 to 100 times faster. However, if you are using strings in Pandas, how Pandas represent strings is it doesn’t have an optimized storage mechanism for strings. It basically has a buffer but those buffers are pointing back to Python objects for the strings. So I’m okay with apply, if you’re doing apply on a series that has string data in it, because at that point you’re already in the slow path. So that would be my next one, is just look for instances of apply if you’re using apply with numeric data, probably could be doing it faster.

Jon Krohn: 00:34:49

Awesome. So chaining, working with raw data, Jupyter effective use, avoiding apply. You got a couple more for us, Matt?

Matt Harrison: 00:34:57

Sure. Maybe I’ve got two more. One is using the correct types. So this is another thing that is pretty important and you see that in the tweak as well, where when we’re loading our data, oftentimes people will load their data from a CSV file. And CSV files are nice and that they’re human readable, but that’s about the extent of the niceness of a CSV file. The other nice thing is that they’re all over the place, which may or may not be nice. But oftentimes you’ll get these CSV files and they might be encoded in some weird windows encoding, or they might have characters that Pandas does understand. So Pandas will try and convert numeric columns to numbers, but if it has a string or some value that it doesn’t understand then it’s going to leave it as a string. So you have those sorts of issues.

Matt Harrison: 00:35:47

So you do want to make sure that you look at your types and just make sure that things that you thought were numeric are numeric. But another one on is, and this goes back to our strings and the Pandas really doesn’t optimize strings. But if you have categoric data where you have low cardinality, by default, if you read a CSV file and Pandas is going to represent that as a bunch. If you’ve got like car makes, and you’ve got like 20 car makes and you’ve got 50,000 rows, it’s going to be 50,000 Python strings. Well, that’s probably not optimal, and Pandas does have a way to represent that with what’s called a categorical type. And so if you use a categorical type for that column, you’re going to have a huge memory savings. Because if there are only 20 unique values, what it can do is it can make, basically, like a mask that’s going to give a bunch of integers that are going to reference the list of 20. So it’s basically going to be a number from one to 20.

Matt Harrison: 00:36:41

And then if you wanted or if you needed to do string operations on the make, now instead of doing string operations on 50,000 Python objects, you’re only doing that on 20. So not only to get a memory savings from doing that, you can also potentially get a huge speed improvement from doing that. There is a point of crossover where when the cardinality gets to a certain point because of that layer of interaction, where it makes sense to leave it as Python streams rather than categoricals.

Jon Krohn: 00:37:11

Nice. Great tip. All right. And then sixth and final one?

Matt Harrison: 00:37:14

Yeah. The next one would just be learn to master aggregation. So this would be pivoting or group by. This is a syntax that can be a little bit different if you’re not used to it, because, generally, it’s done in at least two or three steps where we specify what we want to group by, and then we might pull out what we want to group, and then we do aggregations to them instead of just one step, or alternatively, there is a pivot table syntax. And my advice would just be, start playing around with that and get used to that. It might seem a little bit overwhelming or confusing at first, but if you can master that, it’s going to make slicing and dicing your data easier. If you need to start making reports, you can do that, or if you need to aggregate things at a certain level to prep them for machine learning, it’s going to make it really easy to do that.

Jon Krohn: 00:38:06

Nice. All right. Amazing tips, Matt. I expected nothing less. And just so the listener is aware, Matt, wasn’t prepared with these. I ambushed him and I said, just before we started recording, I said, “I’d love to have this episode just focus on your top tips for Pandas.” And he was like, “Sure.” And so there you go, his six top tips, chaining, working with raw data, effective use of Jupyter, avoiding apply, typing on your columns, making sure that you get those types right, and then mastering aggregation or pivoting. Yeah, so I’m going to be sure many of these tips that you provided are ones that I already implement, but there were a few here that I wasn’t aware of. For example, the avoiding apply, that was actually something, again, mentioning how I used to be big into R before Python. In R, the apply is actually a miracle helpful function that can make things like for loops run, orders of magnitude more quickly. So if I see that in Pandas, I say, “Oh, great. Something that’s going to speed everything up for me.” Meanwhile, it could be turning everything, orders of magnitude slower-

Matt Harrison: 00:39:11

That’s interesting.

Jon Krohn: 00:39:12

.. in Pandas. Yeah.

Matt Harrison: 00:39:13

Yeah. To be honest, I’m not an R user at all. So that actually is interesting, and maybe that explains a lot of it.

Jon Krohn: 00:39:21

Why people end up using it at all, and you’re like, “Why are these people doing this?”

Matt Harrison: 00:39:24

Yeah.

Jon Krohn: 00:39:24

Yeah, it’s interesting. So you come from a computer science background. So people with a computer science background, they tend to be much stronger on team Python for data scientists, and then people who come from a statistics background, they’re more likely to have come from R background. At least for people our age, probably for younger people, it might not be the case anymore, I’m not sure. Yeah, so that could be the genesis of why you see people still using apply a lot in Python, in Pandas, even when it might be slowing them down. So super cool. Thank you, Matt. I learned a ton, I’m sure our listeners did as well. You’ve mentioned MetaSnake, your consulting company, you have educated tons of leading organizations. There is an enormous list of them, but some of the big names include a Stanford, Netflix, and NASA. You are teaching some of the smartest people on the planet, how they could be using Python or Pandas more effectively. How do your engagements work? Are they typically shorter engagements, longer engagements? I guess it depends on what the client’s looking for?

Matt Harrison: 00:40:39

Yeah. Typically, my live training, which generally is done through Zoom or some online media these days, tends to be at the three to five half day length. And that would be sufficient for an introductory Python course or an intermediate Python course, or an introduction to Pandas course, that sort of thing. People often say this is basically a semester long course packed into few days is how those work. And the feedback on those is pretty awesome, one of the nice things about something like that is you can take your team and even if you have different levels at the end of the course, they’re all leveled up, they all have the same knowledge gaps filled in and they’re not speaking past each other. What I find, especially, with a lot of people, at least coming from the data science side is they don’t really have the software engineering background, and so they’re using Python as a tool, but they might have a bunch of knowledge gaps in there.

Matt Harrison: 00:41:48

So it’s often useful to fill in some of those knowledge gaps, like how does Python work? Everything is an object in Python, and if you master that idea that everything is an object in Python, opens the doors of understanding of more complex things, but also lets you understand why certain operations in Pandas might be slow. Another thing that I do like to do with some of my live courses is adapt them to the client, which is nice. For example, when I’m teaching a Pandas course, I have CAN data that I use, but I often will take the client’s data. And rather than teaching Pandas with CAN data we will Pandas with their data. And when I’ve done that, those have probably been my best courses because students come in, they’re already experts on their data and now it’s just saying, “Oh, this is how I can use Pandas or whatever tool to slice and dice my data.” And there, there have been incredible insights with people just digging into it and then creating visualizations and having conversations with colleagues that probably wouldn’t have been able to happen had we just used the CAN data there. So that’s something really powerful that you get with a live training. I mean, as much as I do like on demand courses, that sort of thing, it’s really live training for, I think, corporations is the best way to take a team and level it up.

Jon Krohn: 00:43:25

Yeah. It’s the most impactful option, I think. It obviously means that then you’re going to have to get everybody aligned on timing, and in a non pandemic world, then the instructor also needs to be there physically. And I think that, that actually does also make a big impact. Like if you can physically be there, it isn’t the be all end all, you can absolutely have entirely online trainings be effective, but being there with the people, there’s all kinds of questions that they might not raise their digital hand and ask. But if they’re talking to you on the side of a classroom or when you go to the water cooler or whatever, there’s all of interesting questions that they might ask in that kind of scenario.

Matt Harrison: 00:44:11

Yeah. Definitely, I agree with that. Since March 2020, I have not done in person training, but I’m curious to see what that’s going to look like. I mean, even some of my clients who are very much like, we need people in chairs sort of thing, have gone virtual. So it’ll be interesting to see how that progresses if that moves to quarterly, all hands, where people are actually congregated. But there is something to be said about looking at someone’s face, watching them do it, and then being able to help them directly, which is the most powerful, but we are where we are right now, so we make the best of what we have.

Jon Krohn: 00:44:52

Yeah. I could see that being a really effective use of this hybrid scenario where people are mostly working from home, but we say, “Okay, let’s fly everybody in for a week for lessons on Pandas, intermediate Pandas lessons from Matt, he’s going to be working with our data,” and that could be a good reason to get everyone together. And then, you mentioned how it’s half days and that’s probably because you can only learn for so many hours in a day, and then I think half a day is definitely the max. So then you could have half a day of in classroom instruction and then the other half of the day could be spent on projects or meetings or planning or just going out a team dinner since you haven’t been able to do that for a quarter. So, that makes a lot of sense to me.

Jon Krohn: 00:45:34

Also, you mentioned something in there that I think is helpful for any learning with software in particular, which is that if you want to learn something really well, do it with your own data. So, you’re talking about that in a training context and how your trainings are more effective, if you are using the client’s data as opposed to just random data. And the same is true for you, when anybody is learning, it could be with an on-demand course or a YouTube video or a textbook. You’re going to get a much more in depth understanding if you import your own data and do some aggregating, pivoting, plotting with your data, as opposed to just the demo data.

Matt Harrison: 00:46:17

Yeah. And so one of the things I did do with Effective Pandas, I didn’t do with Learning the Pandas libraries. I added exercises. However, a lot of the exercises are actually more project based and are taking this point of using your own data rather than saying, like given some data that Matt likes is using a data set of your choice, do this, which might seem like a cop out. But if you actually think about it to your point, if you have an interest in something, maybe it’s a hobby or maybe it’s data from work that you’re paid to be interested in, you’re going to take learning a lot more seriously. And once you start digging into it and applying it, you’re learning will be a lot more effective than just reading a book or listening to someone talk about it.

Jon Krohn: 00:47:06

Yeah. 100%. Awesome general learning tips. And I’ve got one more question for you in that vein Matt. So when people want to learn anything, it could be a software development concept or data science concept, anything, do you have tips on how they can most effectively do that?

Matt Harrison: 00:47:23

Yeah. And this is a common question I’m asked quite a lot and obviously I’m highly biased, right, because I’m in the education space. But as I mentioned previously, I think for a team, the most effective is to get everyone in the same room and sort of force them to do it. For individuals, I think it’s sort of a similar thing applies. And if you look at like the highest performing individuals, look at like maybe Steph Curry or presidents, they all have coaches, right? They all have someone who can help them. So if you can find someone who can help you for me, that’s been really beneficial in my career. It might be someone at work. It might be someone that is a friend of the family, or it might be someone that you pay to coach you. But if you can do that, what that’s going to… Go back to our skiing example, right? If you’ve never skied, that’s a really weird situation to be on basically really long planks that slide because most people when their feet are on the ground, they don’t slide. And so learning to ski can be a challenge.

Matt Harrison: 00:48:33

And so you can read about learning to ski all you want, but when you get on the slopes, if you have someone there who can coach you, I can guarantee you that you will learn a lot faster. You’ll learn more than from reading a book, if you have a good coach there to help you. And so I think similarly, this applies to learning anything. If you have someone who’s a master at that, can sort of pick out your holes or where you need to practice, that can be something that can be useful. And then there’s just a broad swath of learning opportunities right now. Right. You’ve got books, you’ve got courses, you’ve got free things as well. Right. You’ve got YouTube. I mean, you could spend your whole life just researching and listening to YouTube. But if you never applied, if you never use it, that’s certainly not useful. So my sort of practices or where I push someone, is if you want to learn something, if you have someone who’s a master who can take you along, that’s great otherwise you want to think about the path you’re going on. And while random blog posts, or Stack Overflow, or random YouTube videos might be great, you really want to have some sort of track that is going to fill in knowledge gaps along the way. And a book or a dedicated course that has thought about that path that they’re going to take you along, I think is more effective, but you also need to couple that with practice. You really need to take your own data or start a project and try it out. Because if you don’t invariably, what’s going to happen it’s going to go in one ear and out the other.

Jon Krohn: 00:50:15

Nice. Amazing tips, Matt. Coaches and practically applying things, making sure that what you’re applying is on the path is on, where you’d like to go, as opposed to just randomly following things. I love both of those learning tips. At the beginning of this episode, we talked about how many bestselling books you’ve written and how some of those are self-published. And you mentioned how it’s long hours, you’ve got to stay motivated. There might not be a publisher there or some other stakeholder there on some project that has the carrot that they dangle in front of you or prods you with a stick. So particularly when you are the primary stakeholder on a project that you take on, like writing a self-published book, what are your tricks for staying motivated and continuing to be productive over long periods of time?

Matt Harrison: 00:51:05

Yeah, that’s a good question. And it’s a challenge. I mean, I feel like a lot of times I’m not motivated, but maybe here are some hints. I mean, one thing is that I’m sort of forced to just because the nature of my work where I’m I guess on my own, right? So I don’t have anyone else doing anything for me. So if I want to continue doing what I’m doing, teaching and helping others, I need to sort of go out and find my I guess, food that I’m going to eat or what so to speak. And so for me, as someone who is an engineer and who feels like they’re not great at selling, one of the most effective ways of me getting myself out there is writing books. And so that might be some advice for anyone who’s considering consulting or anything like that. It’s a lot different when you’re at like a conference, when we used to go to conferences to say like, Hey, my name’s Matt, or whatever versus saying, Hey, I’m Matt and I wrote the book on such and such topic.

Matt Harrison: 00:52:21

And so I think this is really a powerful hack for people who want to maybe go out on their own or want to start maybe consulting, is considering making a book. And again, that can be somewhat stressful. So maybe some hints for making a book. One thing would be if you aren’t working with a publisher, might be to make some public commitments. These are just sort of hacks. If people know that you are making it and your sort of like putting it out there or you maybe make a landing page, that might be something that motivates you to start working on it or continue working on it. Another thing might be to limit the scope of it. Once you get into some of these things, you’re like, oh, there’s so much I could cover. And I’ve already had people who are like the book doesn’t cover this. Yeah. It doesn’t cover geographic information systems with pandas, or it doesn’t cover extension arrays and pandas. So you might want to consider limiting the scope of that, that’s something else. Another thing, I mean, I’m a fan of like deep work and limiting interruptions. So setting aside time for when you can work on something. I’ve got four kids. So oftentimes,

Jon Krohn: 00:53:42

No kidding. Wow.

Matt Harrison: 00:53:43

Evenings is sort of set apart for family. And so it might be you get up an hour earlier in the morning for writing, something like that. Or you can just work on that without distractions. So those are some hints. I guess, one more would just be coming back to maybe the idea of deep work. For me, I found that if I’m just like cranking away, I burn out very quickly. So I do need to interrupt myself. So it might be taking a walk, taking a nap. I mean, I do live pretty close to a ski resort. So if it’s a powder day, I might go skiing in the morning. So those are some things that you can split that up. I’m going to be productive at this point in time, but separate that out. As far as like compensation wise, right? I mean, you should set your level of your compensation at an appropriate expectation. And so I would say, for most people you would probably be more effective as far as like payment, if you did consulting rather than writing a book.

Jon Krohn: 00:55:07

Right. Certainly.

Matt Harrison: 00:55:07

But for me, I’m playing the long game, right? And,

Jon Krohn: 00:55:10

Right, right, right.

Matt Harrison: 00:55:11

So for me, writing a book and getting myself out there is what is bringing me business rather than just doing one off consulting jobs, which might pay off in the long run or get further clients. But that’s a trade off that you might want to consider as well.

Jon Krohn: 00:55:31

That long term strategy, I think is a great one, Matt, because the more books that you write say, the more expertise that you demonstrate, it probably means that your consulting rates go up. So if in the short term you’re not maximizing your returns in year one of book writing because you’re spending so much of the time writing the book, as opposed to doing consulting. By year 10 or year 20, after there are a series of bestselling books, I would say, or a series of maybe not bestselling, but that demonstrate a clear expertise in a focused area that you can consult on. That could mean that in year 10 or year 20, you’re charging 10 X, per hour of consulting. And now all of a sudden your returns are exponentially more than they would’ve been in the scenario where you’re just trying to maximize consulting in the beginning.

Matt Harrison: 00:56:31

Yeah. Yeah. So look at that. Again, in my course on book writing, I interview a bunch of authors. Some of them had made six plus figures directly from their books, some of them had made significantly more than six plus figures, and some of them have made a lot less. Right. And so you do have these outliers of certain books are going to sell a lot. And that’s just the name of the game with content creation. But I do think looking at, you’re absolutely correct. If, and this would go to anyone who wants to maybe, and it’s not just if you want to be a consultant, but I think it goes to maybe this meta question that people might have of like, how do I get a job, or how do I get a better job? Right. And so things like writing a book are things that are putting you out there, but are also demonstrating your expertise. And it just greases a lot of the skids. If you want to interview at some place, or do consulting at some place, or even speaking at a conference oftentimes with something like a book, that can push the odds in your favor.

Jon Krohn: 00:57:46

Totally. Well, you’re preaching to the choir in my case, Matt, but some great tips for the SuperDataScience audience there as well. So we do actually have some SuperDataScience audience member questions coming up for you, but quickly before we get to that, you have a bachelor’s degree in computer science from Stanford. And so we’ve alluded to that fact a couple of times in this episode that you have this computer science background and how you came into data science from that perspective. How has a formal computer science education been helpful to your career as a data scientist maybe specifically, and then maybe even more generally as a business person, as a consultant?

Matt Harrison: 00:58:37

Yeah. And I think there are generally two takes on this from the data science crowd, right? There’s the, I would say the more mathy statistical take, and then there’s more the programmatic take. And I don’t know that either one of them is necessarily correct. I can only speak from my experience, which is going to be from the programming side. Now,

Jon Krohn: 00:58:58

I actually, I really like the joke that a data scientist is somebody who isn’t good at statistics or programming, which is definitely what I am.

Matt Harrison: 00:59:07

Yeah. I would say that my statistics is probably lacking quite a bit and some people might see that as a huge flaw. Right. But I would say that my software engineering is probably significantly better than most data scientists. So there are gives and takes there, right? What I’ve seen is that a lot of people, I think I alluded to this previously who are data scientists don’t necessarily want to be programmers, but basically they are programmers. I mean, you could say the same thing about Excel users who are using Excel in some capacity doing VLOOKUPs. And so there are certain practices that if you adopt them are going to make your life easier and going to make collaborating and working with others easier. That I get that you don’t want to be a programmer or whatnot, but once you start having to work with others and you really want to start adopting some of these practices.

Matt Harrison: 01:00:14

So I think that’s certainly useful for data scientists to have some software engineering background. I mean, some things that would be useful for a lot of them is like learning to understand Git to manage source control. I think a lot of people I’ve seen have very limited exposure to the command line. And so that, in the Python world, the command line is basically used all over the place. So it’s not to say that you have to have that, but you are going to make things harder for yourself if you don’t have some basic command line usage there as well. And then just sort of general programming practices that I see violated all the time. I mean, probably the most egregious one is, and Jupyter sort of encourages that is just globals all over the place, using globals all over the place, which a trained software engineer, that’s like a huge no-no, but we sort of overlook that in Jupyter land.

Jon Krohn: 01:01:24

Yep. Those are great tips. So Git, the Unix command line and avoiding global variables. On the note of the Unix command line, there’s an interesting episode, number 531 with Jeroen Janssens. It is specifically on data science at the command line. Actually, he’s another O’Reilly author and has written a book on that title. And that is a really interesting episode. It isn’t from the perspective of, Hey, you should know Unix because Python is going to be running on top of it and it’ll be easier in some cases. It’s from this interesting perspective of Unix being a glue between all different kinds of programming languages. So you can use it flexibly as kind of like a Rosetta Stone between different program languages. So then you can learn some pandas for some specific task. You can learn some R package for some plotting task, and you can have Unix as a glue, kind of blending everything together. So kind of a cool episode. Anyway, those are really great tips, Matt. And yeah, I can see how programming best practices would definitely be useful for collaborating with colleagues. And so I really appreciate those tips there.

Jon Krohn: 01:02:43

All right. So let’s move on to some audience questions. We got some great ones here on Twitter for you. So some of them are quite pragmatic and some of them are quite high level. So let’s start with one that’s pretty high level. So this user here Jagriti is wondering where a data science enthusiast should begin from if they’re just getting started.

Matt Harrison: 01:03:17

Yeah. Yeah. Infamous question. You ask 10 data scientists, you’ll get 10 different responses. Where should you begin from? Again, the term data science is so broad that is kind of a challenge, right? I mean, I think we’re finally seeing roles sort of come into play where you have like a machine learning engineer or someone who’s more interested in DevOps. So I think one thing that you might want to consider is like, where do you want to focus? Right. And if your thing is diving deep into the data, if that’s your thing, then maybe you focus on understanding how to slice and dice data, you start learning about making some models, doing some visualizations. If your thing is more like deploying this and monitoring this, you might want to pick up more of the programming skills. I think a hard thing Jon, is that a lot of people, they just hear data scientists and they’re like, data scientists make a lot of money, so I want to do that, but they really don’t know what they want to do.

Jon Krohn: 01:04:30

Right.

Matt Harrison: 01:04:30

Right. So that’s sort of what [crosstalk 01:04:32].

Jon Krohn: 01:04:32

They want to make money, Matt. They know what they want to do. They want to make a lot of money.

Matt Harrison: 01:04:37

Yeah. So I mean, if you can probably interview, I mean, this would be what I would say prior to pandemic life, would be leverage and network. And in past, when I would say a killer tip would be to go to some meetups, like in Salt Lake,

Jon Krohn: 01:05:00

Totally.

Matt Harrison: 01:05:01

There’s Python meetups, there’s data science meetups, there’s Python data meetups. And I would say, go to those and just throw yourself, and don’t just sit back, but actually like introduce yourself and say, Hey, I’d like to talk to some people because I’m a student or I’m interested in breaking into this. Generally, there’s a lot of people there who are hiring. And there are a lot of people there who are willing to talk because that’s why people go to these things, because there are these introverted nerds who need at some point to like demonstrate that they can talk and interact with others. But that would give some insight into what people are really doing. Right. So interviewing, figuring out what you really like. And then I sort of go back to that best practice for learning. I’ve got a blog post on this that’s like, it’s written, what’s the best way to learn Python in 2021. But I think you could adopt that. You could replace Python with whatever you wanted to. But I think at some point after you sort of figured out what you want to do, you need to come up with this path and game plan of breaking in there. So what are some things that might get you into the door as, I mean, maybe I’ll speak broadly rather than just like a data scientist versus data engineer, right? A lot of times people will want to see a degree or some sort of pedigree. So that may or may not be your case. You do have sort of the bootcamp option.

Matt Harrison: 01:06:37

I’m going to say, if you go the bootcamp route, you’re going to have to have a lot of projects and be able to talk about those proficiently, such that they’re basically like, we don’t have any doubt that your sort of equivalent to someone who has a degree. I see a lot of people who are coming from PhDs in non like physics or math, and they’re like, basically I don’t want to wait for a prof to die. And so I’m going to pivot over to data science. So again, you would need, a lot of these people have like the math, but they’d need to have some sort of portfolio, right, to demonstrate to someone that they’re interested in that. There’s a lot of projects that you can do, right. There’s Kaggle, like you can take my book and start working out the assignments. But I would say, come up with a project that demonstrates proficiency. I mean, you can also be a little bit more direct if you already have certain companies that you want to work with. You might go out and do a project that’s sort of semi related to what they’re doing exactly. And then sort of,

Jon Krohn: 01:07:45

Great idea.

Matt Harrison: 01:07:45

Link to someone in LinkedIn, or try and connect with someone and then like pitch that to them directly or ask them for feedback on that.

Jon Krohn: 01:07:52

Great idea. I love that.

Matt Harrison: 01:07:55

The least effective thing is just to like, say, I’m going to send out a bunch of resumes.

Jon Krohn: 01:08:00

Yeah.

Matt Harrison: 01:08:02

That can work, but basically you’re playing a numbers game then, and it’s not effective. So you want to network. Again in the past, meetups were a great way to do that. There are virtual meetups. LinkedIn might be a great way to do that. Even Twitter might be a great way to do that and get your foot in the door and then start demonstrating proficiency.

Jon Krohn: 01:08:23

Nice. I love all of that guidance. It was a big open ended question and you gave an excellent answer with lots of different options that are appropriate to the disparateness of the field. And actually that ties back to something, you just talking about the computer science education you have and the relative software engineering strength that you have as a data scientist. Something that I could’ve mentioned then and will mention now, is that because there is such a broad range of specializations in data science, it lends well to people coming from all different kinds of backgrounds. And so lots of opportunities there for people wherever you’re coming from into the field, just get started.

Matt Harrison: 01:09:05

And maybe one more thing is don’t get frustrated and maybe this might sound pessimistic, but don’t put all your eggs in one basket. Just because someone rejects you does not mean that you’re not qualified or whatnot. I’ve been rejected at a lot of companies, and so it is still a numbers game, but if you’re like, “My dream job is to work at such and such company,” and then you interview with them and they reject you, that’s not the end of the world. Keep going on. Keep applying. And you might have had a bad day, they might have had a bad day, it’s not the end of the world. So that can be a challenge because a lot of people are like, “Well, I’ve interviewed at this place and they rejected me and so I’m not qualified.” That doesn’t mean anything. Again, various roles, how people describe various roles, mean different things to different people.

Jon Krohn: 01:10:04

Yeah. Keep at it. And actually on the flip side of that, if you aren’t getting rejected from anything, then you are not aiming high enough.

Matt Harrison: 01:10:12

Yeah.

Jon Krohn: 01:10:13

If you’re always being accepted to everything that you do, or you’re always succeeding at everything that you try, you’re not taking big enough risks.

Matt Harrison: 01:10:20

That’s a great way to look at it. I love that.

Jon Krohn: 01:10:24

All right. So then following on from that general data science question, there is one I think you might have a very specific answer to this, Adil asks, how do you master pandas? What should be the essential steps to be taken to get to that level?

Matt Harrison: 01:10:41

Yeah, I mean, highly biased, right? So, I mean maybe I’ll just give general points rather than just say, “Read my book and do that.” But one thing that you need to be aware of is that from my point of view, after I wrote my first book about pandas, and I think when someone geeks out on something, all of us have things that we geek out on, it might be skiing, or it might be pandas, or it might be whatever, but you tend to study up on that. And my phone pops up anything that has pandas. It’s like, “Here’s 50 medium articles about pandas for the last week,” so I might scroll through those. So a lot of what you see in that is regurgitated content and a lot of that is bad regurgitated content. So again, that’s part of the reason why I put my book out there because there’s a lot of people who want to start blogging and they kind of don’t know what they’re doing so they start writing blog posts. It’s like, “Okay, how to use pandas,” but the content they put out is actually not good advice.

Matt Harrison: 01:11:53

So I would recommend trying to go to a more authoritative source rather than just random blog posts. So, that might be a book like mine, it might be taking a course, it might be going to the pandas documentation as well to get sort of best practices from experience rather than just sort of blind teaching the blind. But again, it comes back to practicing. And if you just read about it, I think most people’s brains are going to forget about it unless they do some sort of space repetition to keep it in there. So the science tells us that if you practice it and physically type it out, you’re going to make different connections with your brain and you will remember it better. So get good information, practice it, and then don’t be afraid to revisit it or review it or share it with others. We talked about the importance of a mentor, someone who can guide you, and don’t be afraid to ask for feedback on that from someone who’s more experienced as well.

Jon Krohn: 01:13:11

Nice, great tips. Again, definitely I can confirm that doing something for yourself, especially if it’s not just taking, as we already talked about earlier, the data that are provided to you in the example, but pulling in your own data and then you could do something very similar. You could use the same method as you’re going through in the tutorial, but use it on your own data and then maybe just change a few of the arguments. Maybe you’ll have to change a few of the arguments because your data are different, but yeah just experiment a little bit and you will quickly pick things up. And I can recommend, because you’re being too modest Matt, but Adele, you could definitely work through the Effective Pandas book by Matt Harrison and you will be not a master just by reading, but if you and go through it and apply everything that you’ve read on your own data, then you’ll be pretty close to being a master.

Matt Harrison: 01:14:10

Oh thanks, Jon. I hope so.

Jon Krohn: 01:14:12

All right. And then Vipin, I don’t know if this is a question, I’m going to be interested to hear what your answer is. Vipin has a very, after those kind of two big broad questions, there’s one here that I think might be so specific it’s kind of impossible to answer, but maybe. You’re the expert, so you might be able to cut right into it, which is, “How many rows and columns can a pandas data frame load?”

Matt Harrison: 01:14:34

58. Oh no, 42, 42.

Jon Krohn: 01:14:36

Ah yes, 42.

Matt Harrison: 01:14:43

Interesting question. Maybe I can answer that in a different way that might be a little bit more roundabout, but I like to term pandas as a small data tool and what I mean by small data is data that will fit on a single machine. Some people call that different things, but I just call that small data versus, from my point of view, big data is multiple machines. So what small data means to you is somewhat context dependent, right? I mean, 10 years ago, small data was like eight gigs or 16 gigs or something, but nowadays you can get, I mean my laptop has 64 gigs in it, but you can go out to the cloud and you can rent a computer that has many multiples of that in it as well, right?

Matt Harrison: 01:15:28

So that’s sort of your first constraint, but some other things that you need to take into account again is that pandas is small data, so it has to keep things in memory, but due to the nature of how pandas works, in that generally we’re doing these chain operations that return data frames along the way, you’re going to have to have some overhead for copying your objects. And some people are probably going to say, “Well, Matt, if you just use this in place parameter on your methods, you’ll get around that.” So anyone who’s saying that, if you open up the source code for most of these operations that have in place, they actually make a copy under the covers and then shim it in. So in place really doesn’t do what you want and there’s actually a bug with the intent to remove it completely from pandas just because most people see that and it does not reflect what’s really going on with pandas. So, you need to have I like to say three to 10X the amount of memory as the size of your data set.

Jon Krohn: 01:16:32

Gotcha.

Matt Harrison: 01:16:32

That might be somewhat problematic, but again, remember one of my other hints was use the correct types. So another thing that you can do that I’ve seen is oftentimes, especially if you have a lot of string data, especially if it’s low cardinality categorical data, by changing those strings to categories, you can save a huge amount of memory that way. Also, for numeric data, if you have integers and by default, pandas is going to use an eight byte integer to represent that. If you have smaller integers, you can use smaller sizes to do that and you can do the same thing with floats as well. So I have seen cases where you load your data set as a raw CSV, and by doing a few of these changes, you get it to like 5% of that size without any loss of fidelity of the data, and not even considering maybe you need to filter out some of the columns or rows.

Matt Harrison: 01:17:27

So those are some things that you can do, but do remember that pandas, the library itself is a small data library. Now I can sort of side note briefly on that, what we’re also seeing these days is there’s not only pandas the library, but there’s now pandas the API and the Python and data science community is basically standardized on if you’re using Python, then we want to provide a pandas like API. And so you see that in a tool called Dask, you see that in a tool called Modin, you see that in a tool called Spark. Spark with, I think 3.2, just recently merged a commit, which was called the pandas API. The idea being that pandas, for better or for worse, is the defacto API for data manipulation. And so a lot of banks, bio companies, insurance companies who are using pandas, and now they need to start scaling out, want to leverage that pandas code and so these other platforms are offering them a path to do that.

Jon Krohn: 01:18:43

Yeah. Super cool answer. I am blown away that you were able to go into that level. I mean, I shouldn’t be surprised, but I am blown away at the level of detail that you went into on answering what sounded like a simple question, how many rows and columns can a pandas data frame load. And not only did you answer the question, but you also gave us lots of cool tips on how to maximize what you can fit in, how much data you can fit in, on a given single machine. But now digging into that question about going from small data with just a single machine to big data with multiple machines processing your data or holding your data, our final question is from Alex Monahan who asks, he actually has specific questions about these alternatives. So you mentioned some of them, he also mentions Dask, Modin. You mentioned Spark. He also mentions Viax and Arrow. So he says, “Do you have any thoughts on some of the larger than memory, pandas like alternatives, which ones are easiest to write clean and effective code with, which ones scale most easily?”

Jon Krohn: 01:19:56

So it’s interesting, you kind of actually already answered one of those questions because it sounds like some of these, including with the Spark 3.2 update that you mentioned, will use the pandas API, so that sounds like they could potentially be some of the easiest to write clean and effective code with. But I’ll leave it for you to answer, so which ones are the easiest to write clean and effective code with and which ones scale most easily?

Matt Harrison: 01:20:18

Yeah, that’s a great question and I’ll caveat this with, in sample size one, in my consulting and training engagements, I’ve had limited experience with Dask and Spark, but certainly not anywhere the same as my small data experience. So I don’t consider myself a big data expert, but let me just maybe give a little bit more insight into that. So I have a list of alternate platforms of pandas here on my desk, and I count one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, at least 13 different alternate platforms where it’s pandas the API versus pandas the library that sort of started this out. So as far as scaling out, I think your most robust solutions, you’re going to be looking at Spark, you’re going to be looking at Dask, and I would say Modin as well. And let me maybe talk about those from some pros and cons of them. Spark, if you’re not familiar with it, it’s written on the JVM Scala, but because of people doing a lot of data science in Python, they do have a Python interface for that. I think the pandas API is kind of like their third iteration of using a data frame like API. So I have not actually used that in happiness or in anger, but that was merged there. My understanding is it’s not 100% compatible, but a good chunk compatible. But my understanding is that you have pretty good scale out there as well.

Matt Harrison: 01:22:12

Dask has a pandas API that is pretty good as well, and then you have Modin. My understanding, actually from speaking to some of the Modin folks, is that their goal is 100% API implementation to the extent that they will even replicate bugs in pandas with the intent there that some of these pandas users that are industrial strength or big enterprises that are using pandas, do want to scale out and they’re looking to be the tool for them to say just drop your code in here, right, rather than we’ve got 90% of the API, but we’re so committed to the API that we are even writing the same bugs so that it won’t magically work on Modin rather than on pandas. There are some other ones as well. A new one I just heard of the other week was Terality, which I guess is a service and they’re doing like serverless pandas scaling, right? I don’t know the extent of the API conformity there, but I mean all of these are offering pandas APIs, so it is certainly a space to watch and I think competition is good in general because it makes everyone play harder and play better.

Matt Harrison: 01:23:48

But from my point of view, that just makes investing in pandas a good investment right now, because as you said, Python is the most popular data science language and people can be upset at me for saying that, but I think the numbers play that out and pandas is probably one of the most popular, if not for structured or tabular data, the most popular tool for doing that. So I think it really, if you are in that place where you have database tables or spreadsheets and you need to start using Python, pandas is the place to go right now and an investment in pandas is a wise investment.

Jon Krohn: 01:24:30

Yep. And I think its a time investment as opposed to trading advice. It’d be fun if you could trade in opensource libraries.

Matt Harrison: 01:24:39

Yeah, you might have to say this is not investment advice, right? The disclaimer.

Jon Krohn: 01:24:43

Exactly. But all your money on pandas. It’s the sure thing.

Matt Harrison: 01:24:52

I’m starting at pandas ICO, a pandas NFT.

Jon Krohn: 01:24:57

Nice. I’m sure it will skyrocket. So we’ve gotten through the audience questions, Matt. My only last real question for you is if you have a book recommendation for us?

Matt Harrison: 01:25:13

Yeah. So I’m going to offer this book here, The Programmer’s Brain, which I think is probably relevant to almost everyone reading this or listening to this, and The Programmer’s Brain is not language specific or job specific, even though it says programmers, but again, I think people who are listening to this should be interested in this. It’s basically taking a step back and looking at how we learn, and how to grok code, how you think about code, how your brain works. I think those are good things to do. It’s got some really good assignments that are basically reflections. I tried to do some of this recently when I did the Advent of Code, which is a coding competition in December, and I went through and I actually documented every error that I made is doing that and sort of classified those types of errors just as a meta thing of being more mindful about how you code and how you create things. And so, I think anyone who’s working with a computer, this is a great book to read and give them a chance to think about how they’re doing that. I think a lot of times we don’t even think about it, but if you start thinking about it, I think that can give you a slight insight into what you’re doing and may even make it so you do things in a more effective way.

Jon Krohn: 01:26:46

Nice. That’s a really cool recommendation, Matt, and I had not heard of that book. It sounds super interesting. Sounds like a pragmatic tip for data scientists or anybody interacting with code. All right, and then it’s abundantly clear, Matt, that you are a go-to expert on any topics related to Python, particularly the pandas library. How can people stay in touch with you and stay up to date on your latest, your book releases, your thoughts and so on?

Matt Harrison: 01:27:17

Awesome. Yeah, I’m on Twitter. So dunder M Harrison, __mharrison__. I’m on LinkedIn, happy to connect with people on LinkedIn. I also have a newsletter. If you go to metasnake.com, I have a newsletter there. People can connect with me there. So again, I tend to share, not a lot of cat pictures, but more code pictures. So definitely try and provide a lot of good content on those platforms.

Jon Krohn: 01:27:51

All right, so for those SuperDataScience listeners looking for lots of cat photos, sorry, you’ll have to wait until our special cat episode coming up later. I don’t know. All right, Matt, thank you so much. It’s been so much fun having you on the show. I’ve learned a ton. I have no doubt that our listeners did too. Thank you so much for taking the time.

Matt Harrison: 01:28:14

Yeah. Thanks Jon. Let me offer your listeners a discount. So I’ll give you a discount code and that will work on my pandas bundle or on the Effective Pandas book. The bundle includes the book and some courses on pandas.

Jon Krohn: 01:28:29

Awesome, yeah. So we will put that in this show notes so that you can use that special discount code. What kind of percentage are we talking about here, Matt?

Matt Harrison: 01:28:37

Oh, let’s do 30% off.

Jon Krohn: 01:28:40

Wow. All right, 30% off, there you go. Amazing Matt.

Matt Harrison: 01:28:46

Let’s do 42% off. I’ll do 42% off.

Jon Krohn: 01:28:49

Oh my goodness.

Matt Harrison: 01:28:51

Just for SuperDataScience listeners.

Jon Krohn: 01:28:53

All right. That’s incredible. 42% off for SuperDataScience listeners. That’s amazing. Thank you so much, Matt. Thank you for being on the show and we’ll catch you again soon.

Matt Harrison: 01:29:02

Yeah, my pleasure. Thanks Jon.

Jon Krohn: 01:29:09

I thoroughly enjoyed filming this episode with Matt and I learned a ton. I hope you did too. In today’s episode, Matt filled us in on his six top tips for pandas, namely chaining, working with the raw data, executing Jupyter notebooks from top to bottom, avoiding apply, typing the columns of your data frames, and becoming adept at aggregating and pivoting data. He also talked about how to learn anything effectively by having a coach and applying what you’ve learned to broader path you’re trying to follow. He talked about how programming best practices are useful for collaborating in data science. He provided his recommendations to become familiar with get version control, the Unix command line, and avoiding global variables in your code, and he filled this in on how Spark, Dask, and Modin could be great options for you for scaling the processing of tabular data to multiple machines, particularly if you’d like to remain within the comfort of programming by the pandas API standard.

Jon Krohn: 01:30:01

As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Matt’s Twitter and LinkedIn profiles, as well as my own social media profiles at www.superdatascience.com/557. That’s www.superdatascience.com/557. If you’d like to ask questions of future guests of the show like several audience members did of Matt during today’s episode, then consider following me on LinkedIn or Twitter. That’s where I post who upcoming guests are and ask you for your thoughtful inquiries. All right, thank you to Ivana, Mario, Jaime, JP and Kirill on the SuperDataScience team from managing and producing another deeply educational episode for us today. Keep on rocking it out there folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 557: Effective Pandas

SDS 557: Effective Pandas

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

November 18, 2025

November 14, 2025

Podcasts SDS 557: Effective Pandas

Share

SDS 557: Effective Pandas

Podcast Transcript

Share on

Related Podcasts

November 21, 2025

SDS 942: Odds of AGI by 2040? LEAP Expert Forecasts and Workforce Implications

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025