SDS 827: Polars: Past, Present and Future, with Polars Creator Ritchie Vink

Podcast Guest: Ritchie Vink

October 15, 2024

Ritchie Vink, CEO and Co-Founder of Polars, Inc., speaks to Jon Krohn about the new achievements of Polars, an open-source library for data manipulation. This is the episode for any data scientist on the fence about using Polars, as it explains how Polars managed to make such improvements, the APIs and integration libraries that make it so versatile, and what’s next for this efficient library. 

Thanks to our Sponsors:
Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.
About Ritchie Vink
Ritchie Vink is the author of the Polars DataFrame library and query engine. He has been working as a software engineer and machine learning engineer for 8 years. Before he started polars, he did many side projects on varying topics in computer science and statistics.
Overview
Ritchie Vink, CEO and Co-Founder of Polars, Inc., speaks to Jon Krohn about the new achievements of Polars, an open-source library for data manipulation. This is the episode for any data scientist on the fence about using Polars, as it explains how Polars managed to make such improvements, the APIs and integration libraries that make it so versatile, and what’s next for this efficient library.
We covered Polars’ incredible speed improvements over Pandas in a previous SuperDataScience Podcast episode (episode 815) with Marco Gorelli. Ritchie takes this story further, detailing the API options for eager and lazy execution, as well as how Polars might soon be able to manage massive distributed datasets. He explains that Polars uses the Python library Arrow to store and add missing data much faster than Python’s multiprocessing package. For this reason, Ritchie calls Arrow the “de-facto standard for sending or sending and sharing data over the wire.” [24:58]
Jon also asked Ritchie about his position on Moore’s Law and its potential negative influence on Polars’ ability to continue increasing its efficiency. Ritchie felt that improvements in processing speeds and single instruction, multiple data (SIMD) are two areas that may help alleviate the plateaus predicted by Moore’s Law. Ritchie says that the increased capabilities of our devices enable better scalability across multiple machines, which can enhance performance in distributed computing. In the latter case, SIMD allows for faster data processing within individual threads, reducing inefficiencies we might expect in parallel operations.
As CEO of Polars Inc., a company that has recently secured $4 million in seed funding, Ritchie finally explained how essential a tech company’s scalability is. To this end, Ritchie and his team are building an entirely new Polars engine, also open source, and Polars Cloud. With the cloud option, Ritchie looks to help enterprises with fault tolerance, and increase ease of use with partitioning data and handling larger datasets.
Listen to the episode to hear Ritchie’s advice for first-time users of eager and lazy executions in Polars, what composable expressions are in Polars, why they are important for anyone composing data, and how Ritchie plans to add to the ever-growing Polars ecosystem.
In this episode you will learn:
  • Why Polars is so efficient [05:20]
  • Polars’ easy integration with other data-processing tools [21:23]
  • Eager vs lazy executive in Polars [32:15]
  • Polars’ data processing of large- and small-scale datasets [38:28]
  • Ritchie’s plans to scale his company [46:14]
  • Upcoming features in Polars [58:06] 
Items mentioned in this podcast:
Follow Ritchie:
Follow Jon:

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 827 with Ritchie Vink, creator of Polars. Today’s episode is brought to you by epic LinkedIn Learning instructor Keith McCormick, by Gurobi, the Decision Intelligence Leader, and by ODSC, the Open Data Science Conference. 
00:00:22
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now, let’s make the complex simple. 
00:00:53
Welcome back to the Super Data Science Podcast. Today, we’re super fortunate to have Ritchie Vink, creator of the extremely hot Python DataFrame library, Polars. Ritchie is CEO and co-founder of Polars Inc. as well. It’s his startup that has raised $4 million in seed funding to support the Polars open-source project. He previously worked as an ML engineer, data scientist and data engineer at companies like Adidas and Royal Dutch Airlines. That’s KLM. He holds a master’s in structural engineering and worked as a civil engineer prior to catching the data science bug. 
00:01:28
Today’s episode will appeal most to hands-on practitioners like data scientists and ML engineers. In today’s episode, Ritchie details how Polars regularly achieves 5 to 20x, sometimes even 100x speed improvements over Pandas for most DataFrame operations. He talks about the Eager and Lazy execution APIs Polars offers, and when you should use one or the other. He describes his vision for scaling Polars to handle massive, distributed datasets. And he talks about how we can continue to make data processing efficiency gains even as Moore’s Law slows down. All right, are you ready for this efficiency increasing episode? Let’s go. 
00:02:11
Ritchie, welcome to the Super Data Science Podcast. It’s great to have you on the show. How’re you doing today? 
Ritchie Vink: 00:02:17
Hey, thanks. I’m doing great, thanks. How’re you doing? 
Jon Krohn: 00:02:21
I’m doing great too, yeah. It’s an honor to have you on. I just said before we press the record button that it’s such a surreal thing for me to be able to have rock stars like you, the brains behind the most exciting software projects in the world as guests on the show. So, yeah, I’m doing great. Where are you calling in from today, Ritchie? 
Ritchie Vink: 00:02:42
I’m calling from my home, which is in Maarssen, which is very close to Utrecht in the Netherlands. 
Jon Krohn: 00:02:48
Say Amsterdam? 
Ritchie Vink: 00:02:50
Amsterdam is also very close. Yeah, it’s in Netherlands. 
Jon Krohn: 00:02:53
What was the word you said? 
Ritchie Vink: 00:02:54
Utrecht. 
Jon Krohn: 00:02:55
What’s the city? Oh, Utrecht. Okay. 
Ritchie Vink: 00:02:57
Yes. Amsterdam is 20 minutes by car. 
Jon Krohn: 00:03:00
Nice. Yeah, that’s the easy one for me. I think, yeah, four trips to the Netherlands and I’ve only been to Amsterdam, actually, central Amsterdam. 
Ritchie Vink: 00:03:11
Really? 
Jon Krohn: 00:03:12
Should I be making a trip out to Utrecht next time? 
Ritchie Vink: 00:03:15
Yes, I think you’ll like it. If you like Amsterdam, Utrecht has some unique canal structure, which is also nice. It’s different from Amsterdam. It’s a bit smaller though.
Jon Krohn: 00:03:26
Awesome. I will have to check that out. So, we were introduced by Marco Gorelli, who is a developer of Polars. He was our guest in episode number 815, and that episode had rave reviews. People loved hearing about Polars. It had some of the most engagements this year on the social media post when I announced the episode.
Ritchie Vink: 00:03:49
Wow. 
Jon Krohn: 00:03:52
You commented or something on the episode, and Marco had recommended you as a guest. He said, “You should have Ritchie on the show. You should have the creator of Polars on the show.” And so, my initial thought, we had an initial exchange where I reached out and I said, “I’d love to do another Polars episode, but let’s wait a few months so that we’re spacing out the Polars episodes.” But then I realized that actually having them close by is great because then Marco’s episode is still fresh on people’s minds. That was about a month ago that it came out, episode 815, and we’ve been careful with the topics that we’ve curated for this episode to complement that episode as opposed to being duplicative. So, it should be great. 
Ritchie Vink: 00:04:38
Yeah, so I can skip the context and dive right in.
Jon Krohn: 00:04:43
Yeah, nice. So, you’re the author of Polars. It’s a DataFrame library that took the data science world by storm by being 5 to 20 times faster than Pandas for most operations. In Marco’s episode, he talked about some circumstances where he’s seen a 100x speed-ups. But yeah, 5 to 20x faster for most DataFrames operations relative to Pandas, which is the incumbent open-source DataFrames application or framework within Python. And in addition to being faster, there’s also much less memory use.
00:05:19
So, Ritchie, what is the secret sauce? Or I guess it’s not so secret because you have blogged about it. What is the not-so-secret sauce behind Polars being so much faster and more memory efficient relative to the incumbent out there? 
Ritchie Vink: 00:05:36
For relational data processing, DataFrame like data processing, there are a few things you can do. It’s actually pretty old, relational data processing. It’s what databases do for over decades, and it’s what SQLite does, and it’s what ClickHouse does and Snowflake. So, all these databases, they exist, and they have different performances. And the different performances are because of various reasons, for instance, if you look at a SQLite which is row oriented, which is great for transactional processing.
00:06:22
Transactional data processing is when you load data, when you have a database, and you use it when you do transactions. For instance, if you buy a product, you update that row and then you need to check if that transaction has succeeded or not. Otherwise, you have to fall back. That’s an application of databases. 
00:06:45
Another one is analytical data processing, and that’s more where Polars, Pandas or Snowflake come in. And in that case, doing things columnary is way faster. So, columnary means that you process data column by column. This is something Pandas does as well. They’re based on NumPy, but there are other things you need to do, which Pandas has ignored, and that’s multiprocessing or multithreading to be just multithreaded parallel programming. I mean, my laptop has 16 cores available. I want to use them. It’s a waste of those resources if you only use one core for expensive operations like joins or group bys. 
00:07:37
The other one is that Polars is close to the metal. Pandas has just taken NumPy, which was meant for… For numerical data analysis, it’s great. But when you had string data before NumPy took over, there wasn’t a really good solution for that. And if you talk about nested data like lists and structs and arbitrary nested data, Pandas actually gave up because it just used the Python object type, which means, “Hey, we don’t know what to do with this. We let the Python interpreter see what to do with this.” And in that sense, Pandas took NumPy and built on top of that. NumPy was never really meant to build a data processing tool like a database on top of that. 
00:08:35
And Polars is written from scratch. It’s written from scratch in Rust. And every performance critical data structure we control, we control. And by that control, we can have very effective caching behavior, very effective resource allocation, very effective control over memory. That’s the most important part because a lot of compute or a lot of resource is in control over memory.
00:09:21
And then the third one, I think, is very important, is that we also use an optimizer. So, we actually made a different… If you look at databases, they, A, can be really fast because of performance and how you write the code, how you write the kernels that execute the compute. But there’s also an optimizer, and this optimizer will make sure you only do the compute that’s needed. And this is very similar to what a C compiler does. If you write your C, you can be sure that the computer will never execute the code as you’ve written it. There will be a compiler in between that will try as hard as possible to prove that it doesn’t have to do certain kinds of work. And that’s actually quite similar to data processing. If you don’t need to load a column, it saves a whole IO trip. It saves a whole research allocation. So, this can save a huge amount of work. 
Jon Krohn: 00:10:26
Yeah. Let’s dig into that Rust one a little bit more. So, were you a Rust developer before you started working on Polars, or did you see an opportunity in Rust specifically for tackling this problem? How did this come about? Did you notice there was this problem that, okay, the kinds of issues that you described where Pandas, for example, was based on NumPy and you’re like, “Oh, for string operations and lists, it’s not going to be efficient. We’re going to have to start from scratch”? Did you notice that opportunity, and then you’re like, “Okay, I’m going to need to learn Rust”? Or you just already knew Rust and you were like, “That’s going to give me lots of memory and concurrency advantages relative to doing this in Python or C++.” How did that come about? 
Ritchie Vink: 00:11:12
No, it wasn’t top-down. It wasn’t from observing Pandas and thinking, “Hey, this could be better.” Actually, I was jumping on the hype train that Rust was back in the day, a little bit later, I guess. I think five years ago, a friend of mine said, “Hey, look at this Rust. It’s very cool.” And I came from a Python data science background, and I thought, “Ah, I don’t need that low level of language.” But I always dabbled a bit with functional programming and wanted to learn more languages just for the fun of it. And so, I also learned Rust. 
00:11:52
In the beginning, what I did was, if I had an algorithm for some data machine learning use cases, I implemented that algorithm in Rust if it was slow in Python. And I found it super powerful because you could implement that algorithm and made a Python binding to it, and you had a very fast library, which you could use in Python in an interactive manner with the performance benefits of Rust. And they still didn’t give me an ID for a DataFrame library. But when I was getting more mature in Rust and I was implementing a web server for my work back in the day, I needed to join two tables that were on this. And I thought, “Let’s take the DataFrame library from the Rust ecosystem,” which didn’t exist.
00:12:52
So, I thought, “Oh, maybe I can build this join algorithm from scratch. How does it work? How do you write a join algorithm?” So, I wrote one, finished it. It worked, and I was just curious, how does it perform against Pandas, and it was super slow. 
Jon Krohn: 00:13:09
Keith McCormick, the data scientist and prolific LinkedIn learning author is giving away a course called “Executive Guide to AutoML” and he’s giving it away exclusively to Super Data Science Podcast listeners. Nearly every ML platform has some support for AutoML, but there is both confusion and debate about which aspects of the ML pipeline can be automated. With this course, you’ll learn how to automate as much as possible and how to explain to management what can’t be automated! You may know Keith from episodes 628 or 655. Be looking for his return on an upcoming Friday episode. You can access his “Executive Guide to AutoML” course by following the hashtag #SDSKeith on LinkedIn. Keith will share a link today, on this episode’s release, to allow you to watch the full new course for free.
00:13:58
That was an unexpected. I thought that was going to end with- 
Ritchie Vink: 00:14:02
No, no, no. 
Jon Krohn: 00:14:02
… it was crazy fast and you’re like, “Hey, it was terrible.” 
Ritchie Vink: 00:14:05
It was terrible. Back in that time, I wasn’t good in performance programming yet. I didn’t know anything about hardware optimizations, et cetera, but it gave me a challenge. I thought, “Okay, let’s make this thing faster than Pandas.” So, I made it faster. And then I thought, “Maybe I can add more methods and build a DataFrame library for the Rust ecosystem.” And that took me another four months, I think, before there was something that I would call a DataFrame library, the very beginnings of it. 
00:14:48
And then there was a database benchmark back in the day, the database benchmark from R, which was hosted by the R DataTable guys. And I added Polars to the benchmark, and it did pretty well. I think we were fourth or so. Yeah, we were fourth or so after Julia, but R DataTable was far ahead. It was far ahead, and that gave me another goal. How did they do that? So, I put my head down, and after a few months, or I don’t know how many months, but now, it’s faster than R DataTable. We beat them in performance. 
00:15:34
But it was more implementing that stuff because I wanted to learn about it and setting the goals of those performance things. And during the way, I started to learn about how databases work, how query optimizers work. Can I elaborate on this? Because it’s quite a long story or- 
Jon Krohn: 00:15:59
Absolutely, man. Go for it. 
Ritchie Vink: 00:16:02
Okay. As I said, in the beginning, I still wanted to build a DataFrame in Rust. And when that got complete, I started to make a Python API for it. And I thought, “Let’s take the Pandas Python API and build those methods.” And that was a really rough decision because the Pandas API, it’s very hard to predict what the output will be. Actually, the output will change depending on the data, which is a rule that isn’t allowed in Polars. 
Jon Krohn: 00:16:34
What does that mean, that it changes based on the data? Like the format will change automatically? 
Ritchie Vink: 00:16:40
The output data types are not known statically, and this is a rule that Polars doesn’t want to break. So, if you have a schema, you have a schema and you apply a join or you apply some operation, you need to know what the output schema will be. Because that way, you, as a programmer, can predict what will happen independent of the data that’s coming in.
00:17:09
The only exception to this rule is when you do data inference. So, if you have a CSV file, you need to infer what the data types will be. At that point, you say, “Okay, this is based on the data, this inference.” But after that, we know what the schema is. The operations may lead to another schema, to another data type, but it should be known statically. Statically means before running the operation, we need to know what the output type will be.
00:17:38
And this is important one for not creating bugs. As a user, it’s very convenient to understand what my output type will be. So, if I expect an integer and I index into a dictionary, that will keep working. And if it suddenly changes to a float, my index doesn’t make any sense or if I index into a list. Other than that, it’s also very important for an optimizer to know anything. If the optimizer doesn’t know what an output type will be and needs the data to determine that you have a dependency, you can only start optimizing when you have the data, and then it’s too late. So, it’s also very important for an optimizer to know the schema. 
00:18:35
I started to make a Python interface and wanted to copy the Pandas API, and that was really hard because the output types were unknown, depending on the data. But I still tried it. And at that point, I also started investigating and researching how query optimizers work. And for query optimizer, you need to know the data type. So, I had to come up with a different API and actually add a Lazy API to get some distance between the API and the execution. That should not be mapped one on one. There should be an intermediate representation in there, an intermediate step, so you can make the DSL, the domain-specific language. In Python, the Python syntax is a DSL. In Polars, it’s a Polars query language, which is a DSL. It should be different from execution. It should not be tied closely together. 
00:19:44
So, when I started doing research on databases, at that point, Polars became much more familiar to people who know Polars as it is now. Because at that point, we started the expression API. We started with Lazy API and the unification of those APIs, and it became, at that point, it was also a realization for me, hey, this is super powerful and I think this is something that even if it would have the same performance as Pandas, would still be very useful for people because it gives you a lot of versatility. And I think, at that point, I think that was two and a half years ago, it became a project that really had a goal of being something that takes the Pandas use case. 
Jon Krohn: 00:20:38
Yeah, it really has been taking off. Something that, another library other than Pandas that we haven’t talked about yet, that is intimately related to Polars is Arrow. And you’ve previously mentioned that Arrow is becoming the go-to memory model for big data processing, big data, cringey, basically, for processing where there’s too many data for us to fit it onto our local machine. So, we could call that big data when you’re using multiple different devices for storage. And so, yeah, you previously mentioned that Arrow is becoming the go-to memory model for these kinds of large-scale processing situations. 
00:21:23
How does Polars use of Arrow enhance its integration with other data processing tools, I guess? And maybe even give us some context on how Arrow is different from Polars and Pandas, but how it interrelates to all of them. 
Ritchie Vink: 00:21:41
Yes. There’s also quite some misconception about what Arrow is. That’s because PyArrow, the core library, the core maintainers or the core implementation of the Arrow spec in Python does not only do Arrow memory, but also Arrow compute. And sometimes, people say Polars is fast because of Arrow, and that doesn’t make any sense. So, first, Arrow is a specification of how memory should be ordered for specific data types. That’s all Arrow is. And that’s the same as how JSON is a specification, of how you should store the bytes of a JSON serialized data structure.
00:22:34
JSON is not a very optimal way to store data. So, it’s not very good for performance. Arrow is much better if you are doing columnar data processing. And if we think about numeric data, the way we store those bytes is actually quite simple because we just store the values back to back. First, the integer of value one, then the integer value two, value three, et cetera, all the way to the end. This is not different from what NumPy is. Only Arrow also adds missing data. So, if you have missing data, it adds another buffer with a bit mask or a bit padding, bit buffer. Just stores the bits. And if the bit is one, the value is valid. If the bit is zero, the value is invalid.
00:23:27
That’s for numeric data, which is quite trivial. But Arrow also has a specification for nested data, for lists, for structs, for strings. It has multiple specifications for strings actually. And that way, if you know that a process can deal with Arrow data, you can say, “Hey, I have some memory laying around here. It’s laid out according to the Arrow specification.” At that point, you can say to another process, “This is the specification, and this is the pointer to where the data is.” If you read this from this, according to this specification, you can use this data as is. You can read it without needing to serialize any data to the other process. 
00:24:17
If you compare that to Python multiprocessing, if you need to send some data from process A to process B, you often need to pickle the data and send it over the wire to the other process. The other process needs to deserialize that because the pickle protocol can differ, or the Python version can differ. The memory layout can differ. So, there’s a lot of overhead involved. And with Arrow, processes can interoperate the memory, and only one process will own the memory and the other can just read it. And this can be super-fast when you share data between processes.
00:24:56
So, that’s where Arrow is becoming the defacto standard for sending and sharing data over the wire. Other than that, it’s still the tool that uses Arrow to build a query engine on top of that. Pandas will use PyArrow. Polars has its own query engine. Pandas is not building a query engine, by the way. Pandas is using PyArrow kernels to do the compute. So, yeah, that’s what Arrow is and that’s how it can help with saving performance between processes. 
Jon Krohn: 00:25:37
Nice. So, how do you see… Do you think that these different ways of doing DataFrame operations… Obviously, Pandas, Polars, Arrow’s evolving as well. How do you see these different projects developing over time and providing more performance gains in the future? 
Ritchie Vink: 00:26:06
Per project or? 
Jon Krohn: 00:26:10
Yeah, I guess you could speak about it primarily from a Polars perspective. So, let’s focus on that, I guess. 
Ritchie Vink: 00:26:18
Yes. Let’s give a high-level introduction what Polars does if you write a query. So, if you write a query, you’d write something that might look a little bit similar to Pandas, a little bit, not too much. If it looks too much like Pandas, you’re not using Polars properly, but somewhat similar.
Jon Krohn: 00:26:45
Let’s dig into that into a little bit more detail before you move on. So, you’re saying if somebody naively they hear, okay, on the Super Data Science Podcast, Ritchie Vink said that if I use Polars, I’m going to get a 5 to 20x speed-up relative to Pandas. So, then they start using Polars and they write code that looks just like their Pandas code would’ve. What you’re saying is that in that kind of circumstance, they might not experience the speed-ups that we’re talking about? 
Ritchie Vink: 00:27:16
Yes, for a few reasons. A, it means they are using the Eager API. And the Eager API is saying, per operation, you force Polars to materialize the result. And to give a very silly example, if I say to you, “Can you get me a coffee from the kitchen and a knife and a fork and a spoon?” if you would walk between every operation, it will be very inefficient. If you would have been a little bit more lazy and just awaited the full list and then you went to the kitchen, you could have grabbed everything at once. And that’s similar with a query. 
00:28:08
If you write a query, if you finish a query before you say, “Okay, now, fetch me the result,” the optimizer can inspect that query and can see what you want to do holistically from a holistic approach. It knows what’s the data that you will use later. If we see you only use two columns, we’re going to only fetch two columns. If we see you’re going to use a filter later, we can apply this filter before we do the join operations because then the join operations will be much cheaper. In Parquet data, maybe we don’t even need to download certain amounts of bytes, which can save a huge amount of work. So, this all depends.
00:28:49
Another one is that in the Eager API, in Pandas, it’s quite normal to use lambdas or to use… You have to assign. And in Polars, you need to use expressions. And expressions are a way for us to understand what you’re doing. It allows us to parallelize the work and to optimize the work. And I need some code examples to make this more clear. But in Pandas, there’s a lot of procedural programming. It’s not in… And then people will probably say, “Yeah, but that’s not idiomatic Pandas anymore. That’s Pandas from a few years ago.” But it’s still how a lot of people write Pandas. 
00:29:40
If you say DataFrame [A] and then you assign a series and then DataFrame [B], and then you assign the second series, then you force… In between those assignments, you go back to Python. And there is no way we can be parallel because we have to apply the operation Eagerly and have to give the operation back to Python. And if you’ve written everything in a single context with multiple expressions, then we are allowed to run all those expressions in parallel because we don’t have to go back to Python in between. That’s sort of the intimate answer. 
Jon Krohn: 00:30:22
In a recent episode of this podcast, the mathematical optimization guru Jerry Yurchisin joined us to detail how you can leverage mathematical optimization to drive commercial decision-making, giving you the confidence to deliver provably optimal decisions. This is where Gurobi Optimization comes into play. Trusted by most of the world’s leading enterprises, Gurobi’s cutting-edge optimization solver, lightweight APIs, and flexible deployment simplify the data-to-decision journey. And, thankfully, if you’re new to mathematical optimization approaches, Gurobi offers a wealth of resources for data scientists, including hands-on training, comprehensive Jupyter-notebook examples, and extensive, free online courses. Check out Episode #813 of this podcast to learn more about mathematical optimization and all of these great resources from Gurobi. That’s Episode #813. 
00:31:11
Nice. That’s great. Let’s talk a bit more about that Eager versus Lazy API in Polars. I love that analogy there with going to the kitchen and grabbing different items. So, that makes things crystal clear here. When you’re doing Eager execution, you’re executing each item one by one as it’s requested, like having to walk to the kitchen to go get the cup of coffee. Come back to the living room, go to the kitchen, pick up a fork, and so on through all the items that you requested.
00:31:39
Whereas with the Lazy API, you’re like, “Okay, I’m going to wait to get all of the instructions before anything’s executed.” And then you could figure out what’s the optimal way of doing all of these requests together. And so, in that example, we make one trip to the kitchen, pick up all the items, the coffee, the fork, the knife, the spoon, whatever, and then we come back to the living room. So, that’s great for understanding the differences between Eager and Lazy execution.
00:32:14
When you’re thinking about a data engineer or a data scientist who’s using Polars for the first time, what advice would you have for them around when they should be doing Eager or Lazy execution? Because it sounds to me like, oh, you might think, well, I should just always be doing Lazy execution if that’s always going to be more efficient, but yet, you built an Eager API, so there must be reasons why that’s useful as well. 
Ritchie Vink: 00:32:38
Yes, I built an Eager API that’s almost exactly similar to the Lazy API. That was a big requirement because you don’t want to have two APIs that differ, that don’t make sense. Knowledge needs to extrapolate between those APIs. So, actually, if you write the Eager API, you can actually switch to Lazy by just changing the read function to scan, which starts Lazy frame and then finalizing the whole query with collect often. But the Eager API, we made it because we recognized that interactive programming and doing data exploration is valuable to people. People who are in a notebook want to just read the CSV file or read the Parquet file and explore what they have, then they want to do operation I and do that.
00:33:44
So, the Eager API, I would say, is for data exploration, for understanding your data, grooming your data. And when you go into something that needs to run a production or something that needs to run more often than once, I would say take those operations out of a notebook into a script and make a query out of it, and then make it a Lazy query. This will save a lot of compute.
Jon Krohn: 00:34:18
Nice. All right, thanks for that overview of when we should be choosing Eager or Lazy API execution. Something else that is unique about Polars is composable expressions. So, what are those, and why do they make a difference? 
Ritchie Vink: 00:34:40
So, here, I want to go back to Pandas. There is a, I think it’s called the assign method in Pandas. In the assign method, you often say, “Assign A = lambda.” And then you get the DataFrame as argument in the lambda, and then you can take the column from the data. But in Polars, we see the requirement of a lambda, the requirement of a custom Python function, as a failure of our API because you needed to go into Python to execute Python code because we didn’t provide you the API you required to do your data exploration.
00:35:29
And of course, there will always be the cases where we cannot do everything, but we want to offer you an API that’s composable. So, the realization is, if we need to make methods for everything you want to do, the API would explode because we cannot predict what you want to do. And if we could, you would get a humongous API, which would do everything under the sun, and it would not be discoverable. And also, knowledge of those methods would not transfer.
00:36:07
So, what the expressions are, the expressions are ways of composing data, and they are similar to a programming language. And if you recall, for instance, in Python, if you look at the vocabulary of Python, it’s not that big. You have if, else. You have some built-in structures, lists. But by combining all those words in the vocabulary, you can do anything you want. And with expressions, it’s similar. So, we give you operations and you can combine them. And by combining them, you can implement all kinds of stuff we couldn’t have predicted. And those expressions will run on their own engine, and they will run vectorized and parallel, and they will be optimized. 
00:36:56
So, you will get a lot of versatility. You will get a lot of stuff where you would write a lambda in Pandas, but here, you can use it in Polars expressions. We don’t have to use any Python. We can still optimize those expressions. We can run them in parallel, and they will be executed on the engine. So, they will be executed on kernels that are written from scratch in Rust and are vectorized, so they are very fast. You can think of how NumPy is fast if you do. It’s sort of what NumPy does with arithmetic, Polars’ expression does with all kinds of operations. That’s how you can think of it.
Jon Krohn: 00:37:37
Nice. That’s a great way of summarizing it there. Very cool. All right, so that covers this first set of questions that we had for you around Polars development. In this next section, I’d like to dig into your philosophy and industry trends that you see that relate to data, data processing. So, for example, with Moore’s Law slowing or ending as we start to make the details of chips so fine-grained that electrons could start hopping across, with that, with Moore’s Law slowing or ending in the near future, we’re going to need more efficient hardware use. How does Polars aim to unify data processing for both small-scale and large-scale datasets?
Ritchie Vink: 00:38:37
Yeah, so with Moore’s Law ending, there’s one thing, one observation that’s different from, in the ’80s, when you wrote an algorithm, you were not going to optimize it because in the next two years, it would be twice as fast. So, in certain fields, maybe that could make developers lazy. One thing, of course, is what’s changing because of processor speed not improving a lot per thread. There is still new developments. One is SIMD where you have single instruction multiple data. That means that you have special registers on a thread that can… 
00:39:23
For instance, for arithmetic, instead of doing a plus operation on a single float, you can do it on four floats at a time, which is a four times parallelism without actually using multiple threads. So, it’s four times parallelism on the same thread. So, that’s already something you can have but sometimes, the compiler can auto-generate this SIMD code. Other times, you will have to implement this manually. So, there’s one shift that, instead of writing the algorithms for a single thread, you now also have to make them SIMD aware or in a way that compiler generates SIMD code. 
00:40:04
The other one is that you actually need to change the algorithms because the single thread is slower. It’s not slower, but it’s not improving as much. What is improving is the amount of cores we have. As I said, I have 16 cores on my laptop, and I think 10 years ago, this was four. And I think this will keep increasing. I think we can horizontally get more and more cores on machines. I believe my phone has eight cores, which is already huge. But to make use of those cores, you need to write multithreaded code. So, that’s one thing. That’s what Polars open-source already does.
00:40:52
As a company, we’re now building Polars Cloud, which wants to do the same. But then for multiple machines, we want to be able to take a Polars query and be able to scale that up to multiple machines. This, again, has whole different strategies, whole different challenges, but the principle is the same. Principle is, hey, we see that we have a… One thing what Polars is, is a query. And then with that, there’s the description of what you want to use, what you want to run. And we can take that description and build a way of implementing the physical side of it, the physical side of taking that description and returning you the data. 
00:41:42
And for SQL, there are already tons of implementations. For Polars, there only is now the Polars open-source, which does a single machine. But if you want to scale Polars to larger datasets that don’t fit into a single machine, that’s not something that’s possible yet. So, that’s what we want to build. 
Jon Krohn: 00:42:05
Nice. Yeah, that’s an interesting take. And yeah, it brings together a lot of different concepts across hardware and software, particularly relevant to people working with data, like a lot of our listeners. Would you say that there are challenges associated with trying to have Polars be as efficient with large-scale data as small-scale data? So, it seems like Polars, it seems designed to be extra efficient for large-scale data processing. Would you say that that’s true? And do you think that there’s challenges with being efficient for both small-scale and large-scale datasets? 
Ritchie Vink: 00:42:48
Not per se. I mean the small scale is pretty easy, if it’s fast. 
Jon Krohn: 00:42:55
I guess so. That’s right. 
Ritchie Vink: 00:42:57
The only thing we do sometimes is sometimes a little bit of threading overhead. So, if we know we are in Eager and the data is super small, we choose to not go into the thread pool. So, there are some adaptations for that in Polars. So, sometimes, we have a check in place, but the solution is not going into a thread pool, which is a trivial solution. So, the small scale is pretty easy. The large scale, we’re doing at the moment, but we want to turn it up a notch. So, currently, Polars still needs your data to be… It’s still required to fit the memory after optimization. What we want to go towards is a new streaming engine where the goal is to be able to process datasets that don’t fit into RAM, but as long as it fits into disk with some spare space, it should be possible. That’s what we want to focus on. 
Jon Krohn: 00:43:57
Cool. 
Ritchie Vink: 00:43:59
Another thing, because there’s a distinction between the query… So, the query turns into a DSL. You can think of it as an AST in Python, and you can hook any engine into that. You can translate that AST into different physical plans and hook that into different physical engines. So, last week, together with NVIDIA RAPIDS, we released GPU support for Polars. This is entirely possible because we can take this DSL and transform it into a physical plan for the GPU, which is a whole different use case. So, what I want to say is that you can have different engines, different back ends for different data sizes because you have the distinction between the front end and the back end. 
Jon Krohn: 00:45:01
I’ve had the pleasure of speaking at ODSC West many times over the years, and without question, ODSC West is one of my favorite conferences. Always held in San Francisco, this year it’ll be taking place from October 29th to 31st. As the leading technical AI conference, ODSC West brings together hundreds of world-class AI experts, speakers, and instructors. This year’s offering will feature hands-on sessions with cutting edge techniques from machine learning, AI agents, AI engineering, LLM training and fine tuning, RAG and more. Whatever your skill level beginner or experienced practitioner, you’ll leave ODSC West with new in-demand AI skills. Register now at odsc.com/california and use our special code podcast for an additional 10% off on your pass.
00:45:51
Nice. Yeah, thanks for that insight into the opportunities that lie ahead as well with Polars. Let’s shift gears a bit here from just the open-source project Polars, which every listener today can be using within their Python environment. In addition to that, there is a Polars company that you founded and that you’re a CEO of, and it’s now secured $4 million in seed funding, as far as we could tell online. And so, what are the plans to scale this company while maintaining the integrity and innovation that made the Polars project so popular in the open-source community? 
Ritchie Vink: 00:46:37
Yes. When I was thinking of starting this company, we were looking in a way what’s something that we believe in that keeps Polars open-source as is or even better. And by a successful Polars open-source also means a successful Polars company. And the important thing was that make a distinction of what’s open-source terrain and what’s company terrain. And what Polars open-source terrain is, is a single machine. That’s what my initial goal was for Polars open-source. And this goal, we’re still improving upon that goal. 
00:47:26
So, Polars, the company, is sponsoring a lot of man-hours to improve Polars open-source. We’re building a completely new engine that will be completely open-source. And yeah, we’re making Polars open-source better because if we get more users, if Polars gets used more widely, I also think Polars company will benefit for that. And as a company, we want to build Polars Cloud, which is actually an extension where you say, “Okay, we have Polars queries and now, I want to orchestrate them. Now, I want to run them serverless somewhere else and I want this to be easy. I want to have fault tolerance. I want to have schema validation.” All that stuff, that’s Polars, the company terrain. 
00:48:18
And if we look on the horizon of things, hopefully, we can, in a few years, actually do any dataset size on Polars queries. There are still a lot of challenges to be solved there. So, first, we want to go into the serverless, into partition data, but one thing we’re doing is using the Polars open-source engine as runners in our cloud. So, this also makes sure we’re committed to making the open-source engine better because we’re actually using it in Polars Cloud as well. 
Jon Krohn: 00:48:54
Nice. So, as this Polars Cloud starts to develop, why would, say, an enterprise or a company choose to use Polars Cloud? What are the advantages of that, going for that commercial option, relative to using open-source Polars?
Ritchie Vink: 00:49:09
One is fault tolerance, ease of use, but also horizontal partitioning. The first releases will allow horizontal scaling if your data can be partitioned. And this is something you cannot do with open-source, or you need to implement it yourself. And eventually, we also want to do more and more queries where we can say we can scale to larger datasets. But with distribution, there comes a lot of challenges. Keeping sure the nodes are alive. Keeping sure that if a node has died, that the query is finished or that we can take off that query where it has left. So, all these complications is something we are going to build. 
00:50:00
And also, if you’re only interested in running Polars in memory and you want to run this query remote, currently, you need to set up a server somewhere, install Polars, then you need to serialize a query plan. But you also need to make sure that the serialization, it isn’t stable between versions. So, you need to make sure you have the same versions on both sides. There’s just a lot of engineering hassle and stuff that can break very easily. So, it’s similar to how you can run SQL. If you write SQL here, you can run it remotely very easy. That’s something we want to generate with Polars as well. You run it locally and run it remotely serverless very easy. 
Jon Krohn: 00:50:45
Nice. This also gives us a great segue, this topic, into our audience questions. So, we had tons of reactions. I typically try to post a week in advance of me recording with a guest so that our audience has lots of time to ask questions. And we typically get a lot of reactions, a lot of interest in upcoming guests. Sometimes, we don’t get any questions. Sometimes, we just get a couple.
00:51:13
I posted that you were going to be my guest 12 hours before we started recording. And just since we started recording, we got another thousand impressions on the post. And so, we’re going to have tens of thousands of impressions very easily on this post. In just those 12 hours, we had already over a hundred reactions, over a dozen comments. And many of those comments are actually questions, which is nice. Yeah, so clearly, a lot of interest in what you’re doing.
00:51:46
And the segue, from talking about Polars Cloud, is that someone named Gaurav Singh Gupta, who’s the head of data engineering based in London, he says, “Ritchie, I hope you have the foresight and product curiosity to change Polars” to build Polars, I think he means, “into the next Databricks.” He says, “You obviously can do it.” So, that’s a good challenge. 
Ritchie Vink: 00:52:10
I hope I can. I hope I can. As I said, if I can… Initially, this is not the problem we want to tackle because you need to break off what you can chew is the expression, right? But I foresee we can do a lot of cool stuff already. And I coined this diagonal scaling. I think we can… So, there are two things I believe in. One is that vertical scaling gets more and more interesting because the vertical scale of machines is increasing a lot. The amount of RAM you can get on a single machine is improving, and also, the amount of cores on a single machine is increasing.
00:53:01
That means that a lot of operations, a lot of queries that had to run horizontally in the past can now run in a single machine. And if you can run on a single machine, you absolutely should because you don’t have any overhead of sending data over the wire between the machines. You don’t have to shuffle the data. Shuffling data is one of the most expensive operations in the distributed environment. 
00:53:27
On a single machine, you can keep the data, you can share data between threads just by sharing pointers, so that’s free. We’re also building the new streaming engine, which will allow datasets that don’t fit into memory to run on a single machine. So, we think vertical scaling gets more and more interesting. Other than that, we want to slowly do some parts horizontal scaling. So, if we can see, hey, this query starts with something that can start out with a MapReduce manner, the optimizer can recognize, hey, we can make a distribution plan that does part vertical, part horizontal, and it’s up to the optimizer to decide what will be faster.
00:54:19
So, that’s stuff we want to look at. And hopefully, in a few years, we can say, “Hey, we can run the whole Polars API in a distributed manner.” But then still, we want to have an optimizer that can recognize when it’s most cost-effective to do it vertical or horizontally because I don’t think there’s a clear winner, and it depends on the query. 
Jon Krohn: 00:54:43
There you go. The vision is there, Gaurav. Yeah, there’s a lot of people out there who have a lot of faith in you. Someone named Mathias Colpaert just said hero.
Ritchie Vink: 00:54:55
Wow. 
Jon Krohn: 00:54:57
And Vincent Warmerdam, whose name I’m probably butchering, as I butcher all Dutch names. 
Ritchie Vink: 00:55:04
It’s close. 
Jon Krohn: 00:55:08
He was our guest in episode number 659. And actually, that was the most popular episode of our show in 2023. 
Ritchie Vink: 00:55:15
Really?
Jon Krohn: 00:55:16
And so, Vince said that I should ask you how you got your Polars keycaps made. He said it’s the best dev swag ever. What is a keycap? Oh, okay, yeah. So, in the video version, people can see that Ritchie’s holding one for me on camera. But what is that? Oh, it’s for your keys, for your keyboard. 
Ritchie Vink: 00:55:44
Yes. So, if you have a keyboard, for the listeners, the escape button, we changed it with a keycap with a Polars logo. 
Jon Krohn: 00:55:53
I was thinking it was something to do with house keys. Of course. 
Ritchie Vink: 00:55:59
Vincent, I gave him one on the PyData Amsterdam last week. So, he was really happy with it. 
Jon Krohn: 00:56:05
Nice. Yeah, he says, “Best dev swag ever.” That’s great. And it’s good product placement for you. You get to have your Polars key in front of your key target audience all the time, 24/7. Well, I guess they’re probably not working 24/7, but anytime that they’re working, they’ve got that reminder. That’s brilliant. Nice.
00:56:30
Next comment is from Magdalena Kovalczuk, who is a longtime listener. She’s interacted with us a lot. She sent me a photo of outdoor artwork that she did. Some people would call it graffiti, I guess, but it’s more like artwork. And so, Magdalena created this polar bear artwork in Amsterdam. It’s a gigantic mural, and she sent me a selfie of you in front of it. So, in the video version, we’re going to overlay that as I’m describing it right now, so that people can see it. So, that’s really cool. I don’t know if you want to tell us a little bit more about that.
Ritchie Vink: 00:57:11
Yeah, it’s really cool. I saw on LinkedIn one day, two murals: one for Narwhals and one for Polars. And she said it was for the Polars project. I was like, “Whoa, this is insane. This is great.” I mean, look at the thing. You cannot think that someone makes a mural for your project. That’s the biggest honor. So, she gave me the geolocations and it was in Amsterdam. I had to bike out there before there was real graffiti on top of it. I had to look for a while, took an hour of biking and wandering around, but I found it. 
Jon Krohn: 00:57:54
Nice. And Magdalena also had a question for you. She said that she can’t wait for this episode. She’s really excited to hear Ritchie’s stance on expanding the Polars ecosystem. So, given that Polars is a relatively new data manipulation library, it’s understandable that its ecosystem isn’t very extensive yet. Does the Polars team expect the ecosystem to grow organically, or are there strategic efforts in place to encourage the development of tools that integrate with Polars? 
Ritchie Vink: 00:58:24
Yes, there are few things. I am a big fan of organic growth by implementing the technical requirements for the growth to secure- 
Jon Krohn: 00:58:39
So, on that note, she mentions Narwhals, which is Marco Gorelli’s project. We talked about him at the outset. He’s the person that introduced me to you. He’s in episode 815. We talked about Narwhals in that episode. And so, that, for example, is a lightweight compatibility layer that allows… You could probably speak about it better than me.
Ritchie Vink: 00:58:58
Yeah, yeah. So, the observation here is that when Pandas came out, Pandas was ubiquitous and probably still is very much ubiquitous with the name DataFrame of Python, right? If you want the DataFrame support in Python, every library just added Pandas support and was done with it because there were no other DataFrames that people use in a significant amount, but I hope Polars is challenging that status quo. But this also serves a problem for library maintainers because now, if you’re plotting libraries, say, and you want to support arbitrary DataFrames, you don’t want to depend on Pandas and Polars and X and then have if-else branches and have different code duplications for all those libraries. 
01:00:01
So, what Marco realized, he made Narwhals. Narwhals is targeted for library maintainers, and it uses a subset of the Polars API, and it works as DataFrame in, DataFrame out. So, if you pass a Dask DataFrame, for instance, you will get returned a Dask DataFrame as a result. If you pass a Pandas DataFrame, you will get a Pandas DataFrame as a result. If you pass a Polars DataFrame, you will get a Polars DataFrame as a result. So, as library maintainer, you can use this subset, this API, and it will automatically work with any DataFrame library, the consumer, the user gives to your library. And it also makes your library very lightweight. 
01:00:48
So, Altair, the plotting library, made switch to Narwhals and got Pandas out of it as a dependency. And as a result, they don’t have any big binary dependencies anymore. And they’re now super lightweight, and I believe, it’s only Python code, pure Python now, which saves a lot of data sent over the wire.
Jon Krohn: 01:01:14
Nice. Yeah, so great example there in Narwhals of this kind of organic development of integrations. And I interrupted you to blurt out Narwhals. Is there anything else you’d like to add on that front in response to Magdalena’s question?
Ritchie Vink: 01:01:26
Yeah. I would like this to come organically, but give them the technical possibilities to do so. And one of such examples is, for instance, the Polars plugins or something that NumPy did, for instance, with ufuncs, is set up the architecture so people can build their own custom logic and have the benefits of that. So, with Polars plugins, there is the architecture that you can build your own expressions, and we also added Polars IO plugins, so you can come with your own data formats. So, this way, the landscape can improve and can go further than Polars as we, as a company, see what Polars is, but also what users themselves want what Polars is. 
Jon Krohn: 01:02:21
Great answer. We look forward to more polar bear murals from Magdalena in the future. Our last questions, we’ll see. We might just pare it down to one. We’ll see. We might just pick one of the best ones, comes from Svetlana Hansen. So, Svetlana is a longtime listener to the show. She’s been commenting, interacting with me since I took over as host of this show four years ago. It’s great to have you listening for so long, Svetlana. 
01:02:47
Svetlana is a senior software engineer who works on projects for NASA spacecraft in Houston, so fascinating individual in her own right. And she said, “Great choice picking Ritchie for the show. I’m looking forward to hearing what he has to share.” And yeah, she gave us three questions on the future of Polars and data science. So, this is a very nice way to start to wind down the show, which is what new features or enhancements can we expect from Polars in the near future that would impact a data scientist in particular? 
Ritchie Vink: 01:03:28
The biggest one here is the new streaming engine. This will be a complete rewrite, the streaming engine, that will be custom-made for the Polars API. So, Polars, when we built the initial streaming engine, it was a beta experiment and we came, we bounced against the wall with some problems. And literature also didn’t help as much because most literature is made for relational algebra, for SQL, for row-wise data processing. And Polars has a unique model, which we coined the DataFrame API or the Polars DataFrame API which has different… It has columnar semantics. And as a user, you can access columns at any point in the API and actually also access all data, which isn’t very streaming friendly. 
01:04:37
So, the model we had to make was different. We’re still making a streaming model, but it had to be specially adapted to the Polars API. This will be implemented. As a user, not much will change per se, but as a data science user, you will have the power to process bigger datasets on your machine and get more performance. So, it will have a lot of benefits, but it’s also a silent enhancement.
Jon Krohn: 01:05:11
Nice, Ritchie. Well, thanks for those insights into the exciting streaming functionality that is coming and will be useful to all of us. Lots of exciting things happening at Polars. I really appreciate you taking the time to answer all my questions, as well as these audience questions. There’s clearly a lot of engagement, and I know that this is going to be a super popular episode. Ritchie, before I let you go, do you have any book recommendations for us?
Ritchie Vink: 01:05:35
Yes. No, I don’t read a lot of books anymore. I cannot find the time anymore. I know that there might be some book giveaways. I’m going to give the boring answer and maybe recommend some Polars books. They are not written by myself, but I have the honor that there are people writing Polars books. Matt Harrison has a Polars book. Yuki has a Polars Cookbook, and there’s also coming a new O’Reilly Polars book in January. The new book from O’Reilly is made by two ex-colleagues of mine. And once in a while, they come to the Polars office and hammer us with questions. So, I know some of our answers might’ve flown back into the book. 
Jon Krohn: 01:06:34
Nice. Great recommendations there on Polars books. And yeah, this is hot off the press at the time of recording. I don’t have more detail, but Matt Harrison, whom you mentioned there, the author of Effective Polars, he’s written so many books, including things like Effective Pandas, which did really well for him. So, his Effective Polars book, he has offered to give away some free copies, but I haven’t had a chance to speak with him yet about how many and that kind of thing. But you can anticipate, listener, that when Ritchie’s episode comes out on that day, we’ll have a book giveaway for some number of Matt Harrison, Effective Pandas. Sorry. Oh, my goodness, that’d be hilarious. We’re giving away his Effective Pandas book. No, we’re giving away Effective Polars.
01:07:26
And yeah, that association is really queued up in my mind because I did a whole episode with Matt Harrison about a year ago. The episode was called Effective Pandas and that’s all we talked about. So, I’ve had a strong neural connection there. But over time, I’m sure I will be thinking Effective Polars, Effective Polars more and more. So, the Effective Polars book, yeah, we’ll have some to give away. So, people who comment or reshare on the post from my personal LinkedIn account announcing Ritchie’s episode on the day that it comes out, yeah, if you just mention in that comment or reshare that you would like a copy of Matt Harrison’s book, we’ll figure out how to get you that access. 
01:08:12
All right, Ritchie, thanks so much for being on the show. For people who want to stay up to date on your latest or Polars latest, how should they do that? I know, for example, that there is a Polars Discord channel. 
Ritchie Vink: 01:08:26
Yes, it’s very active. I would definitely recommend hanging out there if you want to learn more about Polars or more about Polars insights. There are very active power users there. So, if you want to understand, hey, why does this not work or how does this… I don’t know. If you have any questions there, we have a super active Discord. You can also keep me awake if you post a bug in the middle of the night. So, yeah, see you there. 
Jon Krohn: 01:09:03
Awesome. All right, Ritchie, thanks so much for taking the time. And maybe we’ll catch up again with you in a few years and see how your Databricks dethroning Polars Cloud- 
Ritchie Vink: 01:09:14
Yeah, really. 
Jon Krohn: 01:09:14
… project is coming along. 
Ritchie Vink: 01:09:15
Cool. Thanks for having me. 
Jon Krohn: 01:09:21
What a treat to have Ritchie on the show. In today’s episode, he filled us in on Polars use of Rust, the Arrow memory model, and multithreading to dramatically speed up DataFrame operations. He also talked about the importance of Lazy execution and query optimization for maximizing performance, his plans to develop Polars Cloud for serverless and distributed data processing at scale. He talked about upcoming features like a new streaming engine to handle datasets larger than RAM in Polars, and he talked about the growing Polars ecosystem, including integration libraries like Narwhals and community-created extensions. 
01:09:55
As always, you can get all those show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Ritchie’s social media profiles as well as my own at www.superdatascience.com/827. 
01:10:10
And if you’d like to connect in real life as opposed to online, on November 12th, I’ll be conducting interviews in New York at the ScaleUp:AI conference run by the iconic VC firm, Insight Partners. This is a slickly run conference for anyone keen to learn and network on the topic of scaling up AI startups. One of the people I’ll be interviewing will be none other than Andrew Ng, one of the most widely known data science leaders, and I’m very much looking forward to that.
01:10:37
All right. Thanks, of course, to everyone on the Super Data Science Podcast team, our podcast manager Ivana Zibert, media editor Mario Pombo, operations manager Natalie Ziajski, researcher Serg Masis, writers Dr. Zara Karschay and Silvia Ogweng, and founder Kirill Eremenko, the show’s founder. Thanks to all of them for producing another efficiency-increasing episode for us today. 
01:10:56
For enabling that super team to create this free podcast for you, we’re deeply grateful to our sponsors. Thank you to them and thank you for supporting the show. You, listener, can support the show simply by checking out our sponsors’ links, which are in the show notes. And if you, yourself, are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast. 
01:11:19
Otherwise, share this episode with people who might like it. Review it on your favorite podcasting platform. Subscribe, if you’re not a subscriber. And I don’t know, any other way you can support the show, but most importantly, just keep on tuning in. I’m so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there. And I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon. 
Show All

Share on

Related Podcasts