Podcasts SDS 827: Polars: Past, Present and Future, with Polars Creator Ritchie Vink

69 minutes
Data Science

SDS 827: Polars: Past, Present and Future, with Polars Creator Ritchie Vink

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Ritchie Vink, CEO and Co-Founder of Polars, Inc., speaks to Jon Krohn about the new achievements of Polars, an open-source library for data manipulation. This is the episode for any data scientist on the fence about using Polars, as it explains how Polars managed to make such improvements, the APIs and integration libraries that make it so versatile, and what’s next for this efficient library.

Thanks to our Sponsors:

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Ritchie Vink

Ritchie Vink is the author of the Polars DataFrame library and query engine. He has been working as a software engineer and machine learning engineer for 8 years. Before he started polars, he did many side projects on varying topics in computer science and statistics.

Overview

We covered Polars’ incredible speed improvements over Pandas in a previous SuperDataScience Podcast episode (episode 815) with Marco Gorelli. Ritchie takes this story further, detailing the API options for eager and lazy execution, as well as how Polars might soon be able to manage massive distributed datasets. He explains that Polars uses the Python library Arrow to store and add missing data much faster than Python’s multiprocessing package. For this reason, Ritchie calls Arrow the “de-facto standard for sending or sending and sharing data over the wire.” [24:58]

Jon also asked Ritchie about his position on Moore’s Law and its potential negative influence on Polars’ ability to continue increasing its efficiency. Ritchie felt that improvements in processing speeds and single instruction, multiple data (SIMD) are two areas that may help alleviate the plateaus predicted by Moore’s Law. Ritchie says that the increased capabilities of our devices enable better scalability across multiple machines, which can enhance performance in distributed computing. In the latter case, SIMD allows for faster data processing within individual threads, reducing inefficiencies we might expect in parallel operations.

As CEO of Polars Inc., a company that has recently secured $4 million in seed funding, Ritchie finally explained how essential a tech company’s scalability is. To this end, Ritchie and his team are building an entirely new Polars engine, also open source, and Polars Cloud. With the cloud option, Ritchie looks to help enterprises with fault tolerance, and increase ease of use with partitioning data and handling larger datasets.

Listen to the episode to hear Ritchie’s advice for first-time users of eager and lazy executions in Polars, what composable expressions are in Polars, why they are important for anyone composing data, and how Ritchie plans to add to the ever-growing Polars ecosystem.

In this episode you will learn:

Why Polars is so efficient [05:20]
Polars’ easy integration with other data-processing tools [21:23]
Eager vs lazy executive in Polars [32:15]
Polars’ data processing of large- and small-scale datasets [38:28]
Ritchie’s plans to scale his company [46:14]
Upcoming features in Polars [58:06]

Items mentioned in this podcast:

This episode is brought to you by Keith McCormick (follow #SDSKeith)
This episode is brought to you by Gurobi
This episode is brought to you by ODSC West – for an additional 10% off, use our special code: PODCAST
Polars
SDS 659: Open-Source Tools for Natural Language Processing
SDS 815: Polars: Faster DataFrame Ops, with Marco Gorelli
Rust
Arrow
Single instruction, multiple data
“I wrote one of the fastest DataFrame libraries” by Ritchie Vink
Supervised Machine Learning for Text Analysis in R by Emil Hvitfeldt and Julia Silge
SIMD
Narwhals
Effective Polars by Matt Harrison
Python Polars: The Definitive Guide by Jeroen Janssens and Thijs Nieuwdorp
Polars Cookbook by Yuki Kakegawa
ScaleUp:AI
The Super Data Science Podcast Team

Follow Ritchie:

Polars Discord

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 827 with Ritchie Vink, creator of Polars. Today’s episode is brought to you by epic LinkedIn Learning instructor Keith McCormick, by Gurobi, the Decision Intelligence Leader, and by ODSC, the Open Data Science Conference.

00:00:22

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now, let’s make the complex simple.

00:00:53

Welcome back to the Super Data Science Podcast. Today, we’re super fortunate to have Ritchie Vink, creator of the extremely hot Python DataFrame library, Polars. Ritchie is CEO and co-founder of Polars Inc. as well. It’s his startup that has raised $4 million in seed funding to support the Polars open-source project. He previously worked as an ML engineer, data scientist and data engineer at companies like Adidas and Royal Dutch Airlines. That’s KLM. He holds a master’s in structural engineering and worked as a civil engineer prior to catching the data science bug.

00:01:28

Today’s episode will appeal most to hands-on practitioners like data scientists and ML engineers. In today’s episode, Ritchie details how Polars regularly achieves 5 to 20x, sometimes even 100x speed improvements over Pandas for most DataFrame operations. He talks about the Eager and Lazy execution APIs Polars offers, and when you should use one or the other. He describes his vision for scaling Polars to handle massive, distributed datasets. And he talks about how we can continue to make data processing efficiency gains even as Moore’s Law slows down. All right, are you ready for this efficiency increasing episode? Let’s go.

00:02:11

Ritchie, welcome to the Super Data Science Podcast. It’s great to have you on the show. How’re you doing today?

Ritchie Vink: 00:02:17

Hey, thanks. I’m doing great, thanks. How’re you doing?

Jon Krohn: 00:02:21

I’m doing great too, yeah. It’s an honor to have you on. I just said before we press the record button that it’s such a surreal thing for me to be able to have rock stars like you, the brains behind the most exciting software projects in the world as guests on the show. So, yeah, I’m doing great. Where are you calling in from today, Ritchie?

Ritchie Vink: 00:02:42

I’m calling from my home, which is in Maarssen, which is very close to Utrecht in the Netherlands.

Jon Krohn: 00:02:48

Say Amsterdam?

Ritchie Vink: 00:02:50

Amsterdam is also very close. Yeah, it’s in Netherlands.

Jon Krohn: 00:02:53

What was the word you said?

Ritchie Vink: 00:02:54

Utrecht.

Jon Krohn: 00:02:55

What’s the city? Oh, Utrecht. Okay.

Ritchie Vink: 00:02:57

Yes. Amsterdam is 20 minutes by car.

Jon Krohn: 00:03:00

Nice. Yeah, that’s the easy one for me. I think, yeah, four trips to the Netherlands and I’ve only been to Amsterdam, actually, central Amsterdam.

Ritchie Vink: 00:03:11

Really?

Jon Krohn: 00:03:12

Should I be making a trip out to Utrecht next time?

Ritchie Vink: 00:03:15

Yes, I think you’ll like it. If you like Amsterdam, Utrecht has some unique canal structure, which is also nice. It’s different from Amsterdam. It’s a bit smaller though.

Jon Krohn: 00:03:26

Awesome. I will have to check that out. So, we were introduced by Marco Gorelli, who is a developer of Polars. He was our guest in episode number 815, and that episode had rave reviews. People loved hearing about Polars. It had some of the most engagements this year on the social media post when I announced the episode.

Ritchie Vink: 00:03:49

Wow.

Jon Krohn: 00:03:52

You commented or something on the episode, and Marco had recommended you as a guest. He said, “You should have Ritchie on the show. You should have the creator of Polars on the show.” And so, my initial thought, we had an initial exchange where I reached out and I said, “I’d love to do another Polars episode, but let’s wait a few months so that we’re spacing out the Polars episodes.” But then I realized that actually having them close by is great because then Marco’s episode is still fresh on people’s minds. That was about a month ago that it came out, episode 815, and we’ve been careful with the topics that we’ve curated for this episode to complement that episode as opposed to being duplicative. So, it should be great.

Ritchie Vink: 00:04:38

Yeah, so I can skip the context and dive right in.

Jon Krohn: 00:04:43

Yeah, nice. So, you’re the author of Polars. It’s a DataFrame library that took the data science world by storm by being 5 to 20 times faster than Pandas for most operations. In Marco’s episode, he talked about some circumstances where he’s seen a 100x speed-ups. But yeah, 5 to 20x faster for most DataFrames operations relative to Pandas, which is the incumbent open-source DataFrames application or framework within Python. And in addition to being faster, there’s also much less memory use.

00:05:19

So, Ritchie, what is the secret sauce? Or I guess it’s not so secret because you have blogged about it. What is the not-so-secret sauce behind Polars being so much faster and more memory efficient relative to the incumbent out there?

Ritchie Vink: 00:05:36

For relational data processing, DataFrame like data processing, there are a few things you can do. It’s actually pretty old, relational data processing. It’s what databases do for over decades, and it’s what SQLite does, and it’s what ClickHouse does and Snowflake. So, all these databases, they exist, and they have different performances. And the different performances are because of various reasons, for instance, if you look at a SQLite which is row oriented, which is great for transactional processing.