Podcasts SDS 815: Polars: Faster DataFrame Ops, with Marco Gorelli

86 minutes
Data Science

SDS 815: Polars: Faster DataFrame Ops, with Marco Gorelli

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Polars, Python, Narwhals, Rust, and Pandas: Marco Gorelli talks to Jon Krohn about the many ways to use the newest data libraries available, the joys of open-source development, and the best method to win prizes in forecasting competitions.

Thanks to our Sponsors:

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Marco Gorelli

Marco is a core dev of pandas and Polars and author of Narwhals. He’s spoken at several Python conferences, taught Polars professionally, and written the first complete Polars plugins tutorial. He currently works as Senior Software Engineer at Quansight Labs. Before getting involved in open source software, he worked in data science and was one of the prize winners of the M6 forecasting competition. He’s an advocate for making it more accessible to contribute to open source software and has run several mentored sprints, aimed both at the general public and at underrepresented groups in tech.

Overview

It shouldn’t come as a surprise that Polars has 65 million downloads and 28,000 stars on GitHub. The data library is up to 100 times faster in DataFrame operations when compared to Pandas. And yet, all data scientists should know that one-size-fits-all libraries don’t exist. In this episode, Quansight Labs’ Senior Software Engineer Marco Gorelli explores the differences between Pandas and Polars and what makes each one fit for purpose. He says that working with the newer library (Polars) can reward users with “bonus code” [09:20] and the ability to bypass row labeling, both of which can end up supporting their project. For Marco, Polars’ ‘expressions’ are especially important when making calculations because they assign temporary variables that give users the extra flexibility to optimize performance based on their needs if they so wish, and these expressions can otherwise run in the background. Marco believes that the future of Polars lies in how it can compute results with GPU acceleration and query optimization.

Marco also discusses how compatible his lightweight Narwhals package is with the Polars and Pandas libraries. While Marco was working with the DataFrame Consortium, he saw an opportunity to create a DataFrame standard where users could write code on top of that ‘template’. He called this package Narwhals, which enables Pandas and Polars users to adopt newer DataFrame libraries for free. Marco notes that open-source packages and libraries are a symptom of the data science community’s interest in “fixing things” [1:04:38]. However, he also recognizes that the problem with unfunded open-source tools is that they may disappear over time. Marco believes that having only volunteer-run projects is not a good strategy in the long term and points to the CZI (Chan Zuckerberg Initiative) and NASA’s mission to support open-source projects as models of organizations that enable open source to be maintained – and give developers the motivation to continue doing so.

Listen to the episode to learn about string evaluations for NLP, how Polars handles tricky operations like time zones, and how to close the gender gap in software development.

In this episode you will learn:

When to use Polars vs Pandas [08:26]
How Polars optimizes string operations and data processing [20:08]
Where Narwhals outstrips Polars and Pandas [48:37]
The benefits of using Altair [55:21]
Addressing the lack of women in data science [1:09:58]
How to win a forecasting competition [1:16:58]

Items mentioned in this podcast:

This episode is brought to you by AWS Trainium and AWS Inferentia
This episode is brought to you by Babbel
This episode is brought to you by Gurobi
SDS 813: Solving Business Problems Optimally with Data, with Jerry Yurchisin
Quansight Labs
Polars
SDS 765: NumPy, SciPy and the Economics of Open-Source, with Dr. Travis Oliphant
SDS 805: How to Be a Supercommunicator, with Charles Duhigg
Pandas
scikit-lego
Altair
Rust
seaborn
Cross-validation
Narwhals
How Narwhals and scikit-lego came together to achieve dataframe-agnosticism
Dataframe interoperability — what has been achieved, and what comes next?
Modin
Chan Zuckerberg Initiative
Open Science at NASA
PyLadies London
Dr. Maren Westermann’s talk at EuroSciPy
M6 forecasting competition
How I won $6,000 in the M6 Forecasting Competition
Grace Hopper Celebration
Programming Rust by Jim Blandy, Jason Orendorff, Leonora F . S. Tindal
All That’s Left Unsaid by Tracey Lien
The Super Data Science Podcast Team

Follow Marco:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 815 with Marco Gorelli, Senior Software Engineer at Quansight Labs.