Podcasts SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)

87 minutes
Data Science, Python

SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this episode we take a technical deep dive into the creation of the pandas library, the content of Wes’s books, what Apache Arrow is, Wes’s favorite tools, and more!

About Wes McKinney

Wes McKinney is an open-source software developer focusing on analytical computing. He created the Python pandas project and is a co-creator of Apache Arrow. He authored two editions of the reference book, Python for Data Analysis. Wes is a member of The Apache Software Foundation and also a PMC member for Apache Parquet. He is now the CTO of Voltron Data, a new startup working on accelerated computing technologies powered by Apache Arrow.

Overview

Wes and I have several mutual connections (some of which who have been on the podcast before), which includes the New York R community. But, what Wes is most known for is being the creator of the pandas library in Python. Wes’s background is in pure math before he got a job in finance on the eve of the 2008 financial crisis. He got interested in the Python language through some colleagues as a way of developing tools to work with data. When he set out to write pandas, he looked to his positive and negative experiences with utilizing R and looked at additional features that he would want. It’s fascinating to learn the history of these communities and look at my own practices, shifting from R to Python, along with the shifts in the community.

Today the open-source community around pandas is robust and maintained regularly. When I think of big open-source projects I think of anonymous GitHub users slogging quietly online. But the beauty of pandas is it is partially updated outside the digital world at conferences where the community can collaborate in real-time on the project. Wes describes pandas as an essential glue utilized by a half-million other projects on GitHub as well as a computational building block for multiple other platforms.

In addition to his development work, Wes wrote an incredibly informative and influential book on utilizing Python for data analysis. It currently is out in two editions with a third edition on the way, expected in 2022. Because the book is so influential and utilized in classrooms, Wes attempts to keep it as up-to-date and maintained as possible to continue to keep it relevant for students.

Recently, Wes has been working on the Apache Arrow project that arose out of a variety of needs for interdisciplinary issues. Wes approached the problem from the data science perspective and the problems he was dealing with while developing and maintaining pandas. One of those issues was working through massive amounts of data as well as different forms of data like string data or nested data, which was treated with the same priority as numeric data. Over 6 years the team created a stack of technologies across languages for interoperable and agnostic operating systems. It’s mind-blowing to listen to Wes discuss the numerous aspects involved in making a tool like Apache Arrow so effective and we only scratch the surface during the podcast.

In addition to this work, Wes is the CTO of Voltron Data which Wes got involved with while he was working with Cloudera. In response to the many projects Wes was being asked on, he formed a consortium to work on the many projects. To unlock the power of Arrow, Wes realized he was going to need more than the consortium that he had so they worked to gain capital to launch a larger organization to fund and provide resources for the work they were doing. Voltron continues to grow, and Wes expects within the next 6 months more information will be available. They currently have around 15 job openings worth checking out.

As far as favorite tools go, Wes enjoys using old-school tools like Emacs, and interactive tools. He works in a lot of C++ and uses tools around that. To stay organized, he utilizes Jira for Arrow project management as well as Notion for creating wikis. We also discussed remarkable 2 as a way of achieving whiteboard-style note-taking in a handheld notebook.

From there we closed out with an audience Q&A.

In this episode you will learn:

History of pandas [7:29]
The trends of R and Python [23:33]
Python for Data Analysis [25:58]
pandas updates and community [30:10]
Apache Arrow [41:50]
Voltron Data [55:10]
Origin of Wes’s project names [1:08:14]
Wes’s favorite tools [1:09:46]
Audience Q&A [1:15:34]

Items mentioned in this podcast:

Python for Data Analysis by Wes McKinney
Data Science Without Borders
Voltron Data job openings
Emacs
C++ Perf
GDB debugger
OpenTelemetry
Jira
Notion
reMarkable 2
A World Without Email by Cal Newport

Follow Wes:

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is Episode #523 with Wes McKinney, creator of the Python pandas project, co-creator of Apache Arrow and CTO of Voltron Data.

Jon Krohn: 00:00:14

Welcome to the SuperDataScience podcast. My name is Jon Krohn, a chief data scientist and bestselling author on deep learning. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now let’s make the complex simple.

Jon Krohn: 00:00:44

Welcome back to the SuperDataScience podcast. I am beside myself that we have the data legend himself, Wes Mckinney, as our guest on this episode. Wes created the Pandas project, an open-source Python library that has become the global data science industry standard for working with data and that is used within half a million other open-source software projects on GitHub. More recently, he co-created Apache Arrow, another open-source software library that is language-agnostic and enables the execution of efficient data analytics on modern CPU and GPU hardware. On top of that, Wes authored two editions of an international bestselling book and now classic desk reference called Python for Data Analysis. He worked as a technical expert at some of the world’s most prestigious companies, including Cloudera, RStudio, Two Sigma and AQR. He now serves as co-founder and CTO of Voltron Data, a firm focused on accelerating the development and the impact of his open-source Apache Arrow project.

Jon Krohn: 00:01:54

In today’s episode, Wes takes us on a detailed technical deep-dive through the creation story of his now ubiquitous pandas library, the content of his iconic book and a sneak peek at the fourth-coming edition, what the Apache Arrow project is and why it’s posed to revolutionize the data science and software industries, perhaps even more so than pandas has. And then, he also talks about the software and hardware tools that he uses day to day to be such an epically productive software developer and entrepreneur. Today’s episode will be of interest to anyone who wants to get deep into the mind of one of the most influential opensource software developers of our lifetime. Given his rich expertise, this is an especially technically dense episode, but both Wes and I did our best to break down concepts so that their rough shape and impact can be appreciated by anyone. All right, you ready for this incredible episode? Let’s do it.

Jon Krohn: 00:02:57

Wes McKinney, welcome to the SuperDataScience podcast. I can’t believe you’re here. I’ve been so excited in the run-up to recording of this episode. How you doing and where in the world are you calling in from today?

Wes McKinney: 00:03:11

I’m good. Thanks for having me on the program. I’m here at home in Nashville, Tennessee, where I moved about three years ago.

Jon Krohn: 00:03:21

Nice, and you were in New York before, right, for quite a while?

Wes McKinney: 00:03:25

Off and on for quite a while. I was in the Bay area for a few years before that, but otherwise I was in New York with a year in the middle at… I started a PhD at Duke in North Carolina, and then dropped out to work on opensource software.

Jon Krohn: 00:03:45

Well, that seems to have been a good decision and I suspect that the Nashville decision was probably a good one too. Would you say that the weather is superior to San Francisco or New York?

Wes McKinney: 00:03:56

Yeah, it’s definitely hot, hot, hot in the summer, but it’s pretty temperate the rest of the year. Right now, it’s beautiful. The winters are… It does get cold. Maybe it snows once or twice, but the winters are pretty mild. It rains a surprising amount.

Jon Krohn: 00:04:19

Really?

Wes McKinney: 00:04:20

But all in all, it’s great. You have mountains and driving distance, and the City of Nashville is booming. So I certainly didn’t know about COVID when I moved here, but when it hit, I was happy to have the extra space in my apartment to roam about.

Jon Krohn: 00:04:47

Yeah, I can say from experience that being in Manhattan during a COVID pandemic is not the place you want to be. Finding my way through. If it’s nice, things must be opening up. They’re probably almost fully back to normal in Nashville now.

Wes McKinney: 00:05:03

Yeah, they are. They’ve been pretty much back to normal for a long time. It’s funny. It seems like for some parts of the state, the State of Tennessee, there wasn’t really much of a lockdown. The City of Nashville itself did have a closure of nonessential businesses, indoor dining, for a period of time, but it opened up. Indoor dining reopened last summer-

Jon Krohn: 00:05:42

Oh, wow.

Wes McKinney: 00:05:43

… the summer of 2020. I think even during the COVID peak around December, January, things stayed open, but I was definitely happy that… Because I’m in a red state, I was able to get vaccinated right at the start-

Jon Krohn: 00:06:05

Yeah, I bet.

Wes McKinney: 00:06:06

… right at the start of April, and so that brought down my anxiety levels. And so, I’ve been comfortable resuming travel, and visiting friends and family, things like that.

Jon Krohn: 00:06:17

For our international listeners, red state there means that there’s relatively conservative politics, and so it means that the queue to get a vaccine was really short. I did have to wait around here in New York, in this blue state as we say, but all is well now. Yeah, so it seems like you definitely lucked out there with Nashville during the pandemic as far as luck goes in a situation like that. So Wes, you were introduced to me by Jared Lander, who was on Episode 501, but we also have some other connections or some other recent guests, Claudia Perlich in Episode 437, Drew Conway in 511. And with that group of people, I felt pretty confident that you had a lot of connections to the New York R community, also knowing that you speak regularly at the New York R Conference. So I asked you about that just before we started recording and discovered that, yes, indeed unsurprisingly, you have been involved in the R community for a long time, but that is not what you’re most known for, I suppose. So amongst a lot of achievements, perhaps one of the achievements that you stand out for most is being the creator of the pandas library in Python, which I’d love for you to dig into a bit of the history of how you got started with pandas, but I like the connection there between being involved with R so much a decade ago in, I suppose, R’s heyday in data science.

Jon Krohn: 00:08:01

And so, R had, still has, this data structure called the data frame that allows you to mix multiple different data types, so you don’t have to have a matrix of just float values or integers. You can mix in character strings with numbers, and so this is very helpful for working with data, and pandas that you created allows us to have that same kind of data frame functionality in Python and also quite a lot more. So I’ve given a long intro there. First of all, let me know what I got wrong, and then let us know about, yeah, the history of pandas and why you created this library that is now one of the most widely used pieces of software in the entire world and, yeah, if there’s any relationship to our data frames there.

Wes McKinney: 00:08:54

Sure, there’s a lot to unpack there and there’s certainly a lot of history. I got a degree in pure math in college and did a dabbling of computer science. I did more mathematical optimization, like the math side of computer science and electrical engineering than I did of actual programming and software engineering, but I got a job in quantitative finance at AQR Capital Management at the-

Jon Krohn: 00:09:31

Oh yeah.

Wes McKinney: 00:09:32

… at the opportune moment of August 2007, right as the-

Jon Krohn: 00:09:36

Oh no.

Wes McKinney: 00:09:37

… right as the financial crisis was beginning. And I remember it because my first day was on August 13th, 2007, and the week before that was the much-studied equity crisis that really only… There was a hedge fund or something that failed the week of August 6th and because a lot of the quant hedge funds used the same risk models, when they sold off their portfolio, it caused these correlated disruptions in portfolios where, if you were a quant fund that used the same bar or risk model, you would lose a lot of money because of the way that the mean variance optimization worked. But other retail investors, you would not see those same kinds of… It wasn’t like the stock market went down 20%. It wasn’t like Black Tuesday or whatever happened in 1987. Might’ve gotten the name of that wrong, Black Monday, something like that, black something, but yeah.

Wes McKinney: 00:10:51

But anyway, I was there at AQR and I had some colleagues that got me interested in the Python language. They had written some tools to automate certain business processes and I really found the language really attractive. And in our job, we were doing a ton of Excel and a ton of SQL. It just struck me how difficult it was to work with data, so I started to be interested in, “Well, maybe I could create some tools for myself to make it easier to do my job and automate my job so that I can either work less, or I can spend my free time working on other things other than the drudgery of data manipulation.” But indeed, at the same time, I had some colleagues that were users of R. And of course, R was hugely different in 2007, 2008. Now, so many of the things that we take for granted in R, like RStudio as a development environment, tools like dplyr, the tidyverse, the whole Pike concept and many of the modern libraries that’s made R such… I would argue with you. You said the heyday of R was a decade ago. I think the heyday of R is-

Jon Krohn: 00:12:18

I know.

Wes McKinney: 00:12:18

… now.

Jon Krohn: 00:12:18

I know.

Wes McKinney: 00:12:18

The heyday of R is right now, I think. I tip my hat to the work of the R community and what they’ve done to really build out a really amazing set of libraries to empower data science work, but in any case, I was exposed to R. And so, when I initially started writing the code that turned into pandas, which was in April 2008, I had the positive and negative experiences of using R. And so, I was trying to capture some of the good things about R data frames and that way of working with data and going about statistical modeling, running regressions, doing data manipulation, cleaning up data, but adding a bunch of additional features that R did not have that were causing me a lot of pain and suffering. So we were working with a lot of irregular data that hadn’t gone through a lot of normalization and data cleansing, and so you would read two different tables from two different database tables and Microsoft SQL server, and then you’d have this data integration problem of how do we join and merge these things together. And so, I was hit by all manner of a cascade of problems in R from strings as factors equals true and silently doing the wrong thing when merging data frames together. This was before dplyr and DataTable, which can do these joins properly.

Wes McKinney: 00:14:02

So I felt really burned by that. I said, “This shouldn’t be so easy, so maybe we can build this data integration, data alignment logic into the data frame data structure itself.” And that’s how pandas’ indexing, like row index and column index concepts came about, and so they provide this automatic data alignment feature, which is super useful if you have that problem. And now, 13, 14 years later, people are cursing me for having this indexing feature in pandas, that it can be a nuisance to people that don’t need it, but to me at the time, it was super useful. And so, I spent several years. I spent a total of three years at AQR and worked really hard to build a foundation of tools that enabled AQR ultimately to transition to Python as a quant modeling language and to build a lot of production systems that now are essential parts of how the business operates, and that experience of working with my colleagues and building tools for them and just working to make everybody productive is how I became passionate about working on data analysis tools in the first place, because I got to see firsthand the kinds of productivity gains that are possible when you go from tools that are a nuisance to use or that are cumbersome or don’t fit the problem that you’re dealing with.

Wes McKinney: 00:15:45

And if you build the ideal tool that has the built-in features where people can think naturally about what they want to do with the data rather than the low-level mechanics of like, “Well, how do I get this to join properly, and how do I deal with this missing data?” things like that. That was the early days of pandas and AQR graciously allowed me to opensource pandas in the middle latter half of 2009, which wasn’t something that a lot of financial firms were doing at the time. I think Jane Street was one company that was one of the early movers in open-sourcing their tools in the OCaml community. But yeah, hedge funds, notoriously secretive, so pandas was an anomaly in terms of opensource at the time and I think that it helped open the floodgates of more financial firms releasing their internal tools with the hopes that they would become more widely adopted, and it would influence the Python community or the programming community in the opensource ecosystem in ways that are beneficial over time, but yeah. I never did any R package development. I never published-

Jon Krohn: 00:17:08

Right, right, right.

Wes McKinney: 00:17:10

I’ve never published personally an R package to CRAN, although lately we do have really active collaborations with the R community. So the story about how I got connected with Drew Conway and Jared Lander was I dropped out of grad school in 2011, because I felt really passionately about spending… I initially took leave and I said, “Well, I’ll spend a year and just develop.” I had saved. I didn’t have any college debt, so that was an enormous privilege to not have monthly debt payments, and I had saved as a result of not having any college student loan debt and having lived frugally while I was working in finance. I had saved enough money such that I did the math and said, “Well, I can support myself living a pretty bare-bones lifestyle in New York, and I could spend a year working on pandas and see what happens.”

Jon Krohn: 00:18:17

Cool.

Wes McKinney: 00:18:17

So during that time, I started attending meetups and one of the active meetups at that time was the New York R meetup, and I talked to Drew Conway, and he saw that there was growing interest in Python and he said, “Why don’t we open up this meetup to more than just R?” And at that time, Drew was getting busy with some other things, and I met Jared and Drew said, “Hey, why don’t you and Jared run the meetup?” So Jared and I became the organizers of what was the New York R meetup and became the New York Open Statistical Programming meetup. And yeah, we’ve all become friends and have known each other for a long time now. In effect, because Jared is a really good organizer and really good at finding speakers, and doing logistics and all that, and ordering pizza in particular… Jared is a [crosstalk 00:19:13]-

Jon Krohn: 00:19:12

Yes, ordering-

Wes McKinney: 00:19:14

Jared is the resident pizza expert in the R community and the programming community in New York, and so I mostly just had to show up to the meetups and Jared ended up doing almost all the work. So I was grateful for that. I tried to my best, but meetup organization and planning events, not my forte.

Jon Krohn: 00:19:39

All right. Well, you might be being too humble.

Jon Krohn: 00:19:48

Eliminating unnecessary distractions is one of the central principles of my lifestyle. As such, I only subscribe to a handful of email newsletters, those that provide a massive signal-to-noise ratio. One of the very few that meet my strict criterion is the Data Science Insider. If you weren’t aware of it already, the Data Science Insider is a 100% free newsletter that the SuperDataScience team creates and sends out every Friday. We pore over all of the news and identify the most important breakthroughs in the fields of data science, machine learning and artificial intelligence. The top five, simply five news items. The top five items are handpicked, the items that we’re confident will be most relevant to your personal and professional growth. Each of the five articles is summarized into a standardized, easy-to-read format, and then packed gently into a single email. This means that you don’t have to go and read the whole article. You can read our summary and be up to speed on the latest and greatest data innovations in no time at all.

Jon Krohn: 00:20:53

That said, if any items do particularly tickle your fancy, then you can click through and read the full article. This is what I do. I skim the Data Science Insider newsletter every week. Those items that are relevant to me, I read the summary in full. And if that signals to me that I should be digging into the full original piece, for example, to pore over figures, equations, code or experimental methodology, I click through and dig deep. So if you’d like to get the best signal-to-noise ratio out there in data science, machine learning and AI news, subscribe to the Data Science Insider, which is completely free, no strings attached, at www.superdatascience.com/dsi. That’s www.superdatascience.com/dsi. And now, let’s return to our amazing episode.

Jon Krohn: 00:21:43

Well, it’s nice to hear that whole story. It’s interesting. Over the last few months, having had Jared and Drew and then now you on the program, I’ve been able to piece together for myself. And for any listeners that have been listening to those episodes, again Jared was 501 and Drew was 511. We’re getting more and more vantage points on the history of what is such an influential community. So that community, what is now called, as you say, the open statistical programming meetup, it is the world’s largest community for these kinds of opensource efforts, like R and Python.

Jon Krohn: 00:22:20

And yeah, so it’s interesting to hear where it all started and meet the people that were involved in that. Going back to the very beginning of when I started introducing your work with pandas, I deeply regretted as soon as it came out of my mouth that I said R’s heyday was 10 years ago, and so I’m glad that you caught me on that. It is absolutely. It’s so interesting. I think you put your own life experience. You overlay that on the real objective history that’s out there. And for me, 10 years ago, I was using R every day. It was my bread and butter, and now Python is my bread and butter every day. So I have this idea in my head that like, “Oh, that’s how things have shifted in the data science world in general,” but that’s absolutely not true. R is an absolutely massive ecosystem. It’s bigger than ever. And yeah, all of these kinds of tools that you mentioned, the tidyverse that allows us to work with R a lot more easily, is brilliant and what the RStudio ecosystem has brought along.

Wes McKinney: 00:23:33

Yeah. If you look at 10 years ago versus now, I don’t know what the difference in the number of data practitioners… The whole field of data science has undergone an order of magnitude or more-

Jon Krohn: 00:23:56

Oh, for sure.

Wes McKinney: 00:23:57

… of growth. Yeah, the market share of Python and R, it’s largely dominated. So yeah, I think a lot of the work that we’ve had to do has been how do we make these ecosystems more hospitable to million of users, and so we went out of a period of time where you had to be a hardcore hacker to really be successful doing things, at least doing data work in Python. So a lot of the systems that have been built and tools that have been built is how do we enable Python to have as many users as Excel has. When you think about how many users Excel has, it’s most [crosstalk 00:24:48]-

Jon Krohn: 00:24:47

Right.

Wes McKinney: 00:24:48

It is in essence. Excel is the most popular programming language in the world if you look at it a certain way.

Jon Krohn: 00:24:54

Yeah, yeah. It’s unfortunate. Yeah, you feel bad for all those Excel users out there. I think through things like these kinds of podcasts, I think people do realize more and more, especially as datasets get larger and larger, that there is this big opportunity to be moving away from tools like Excel to where it’s maybe SQL can be your gateway drug to something like R and Python. And Excel certainly or other spreadsheet tools, like Google Sheets, they’re hugely useful and I do use a spreadsheet software for a lot of day-to-day tasks, but it’s great that people like you and places like AQR have been supporting this opensource movement that has allowed things like the Python pandas library to, yeah, allow us to much more efficiently be managing large datasets and be performing a lot of operations that wouldn’t be possible in Excel.

Jon Krohn: 00:25:56

So on that note, if people aren’t already very familiar with how to be using Python for data analysis, you’ve written a book on that, which it’s an absolute bestseller for O’Reilly. It’s Python for Data Analysis. If you’re watching the video version, I’ve got my copy right here on camera, which is heavily dogeared. It’s a great reference book for lots of functionality for working with data in Python, so it introduces working with Jupyter Notebooks. Back in 2012, IPython Notebooks as I learned about them, which also actually, even that word Jupyter, it’s interesting how it shows how in data science it can be so useful to be familiar with Python and R, and maybe even Jupyter such that they put all three of those words into one.

Wes McKinney: 00:26:52

Julia, Julia.

Jon Krohn: 00:26:54

Oh, did I say Jupyter? Yeah, Julia. They put Jupyter into Jupyter. It was really, really clever. Julia, R and Python into Jupyter, so you introduce working with Jupyter Notebooks, working with NumPy, and then of course also working with pandas. And in particular, you have chapters on working with time series data and financial data, which makes sense given your history in financial markets. And certainly, a lot of people do work with those kinds of data and they can be some of the most difficult to mung around with. So first edition came out in 2012. Second edition was in 2017 and I understand that a third edition is in the works, Wes? And yeah, [crosstalk 00:27:39]-

Wes McKinney: 00:27:39

Yeah, I am slowly editing my way towards a third edition, since I keep busy with a lot of other projects and the book is not my principal project, like it was in 2012. I wrote the first version of the book during my one-year self-funded sabbatical. But yeah, so I expect that the third edition will be out in 2022, so next year, so that’ll be about 10 years since the first edition came out. And given that the book has become a mainstay in many classrooms, university courses, a supplemental text for data science courses. I think that’s important for me to continue to maintain the book, to keep it up to date with the latest and greatest improvements in pandas to make sure that the code examples still work and that the other things that bit rot over the years. The way that we install and manage Python packages has changed a lot in the last 10 years, and so those aspects have to be kept up to date, have to be kept up to date in the book. So I hope to create better digital resources for readers of the book with this third edition, and so look out more for me on that in early 2022 when it gets closer to the publication date.

Jon Krohn: 00:29:24

Nice, yeah. No doubt it will just continue edition over edition to become more and more of an absolutely essential resource for the data scientist that uses Python. And yeah, so if people want to see the early work on this third edition, if you have a subscription to the O’Reilly learning platform, then some of the early chapters are already available for you to check out of that third edition from Wes. All right, so we’ve talked about pandas in detail and about how people can be getting started or getting more deeply involved in using Python for data analysis. The next topic that I’d like to cover is-

Wes McKinney: 00:30:09

If I could-

Jon Krohn: 00:30:09

Oh yeah.

Wes McKinney: 00:30:10

If you could pardon me for a few minutes to talk a little bit about what has happened with pandas in the last eight or-

Jon Krohn: 00:30:20

Oh yeah.

Wes McKinney: 00:30:20

… eight or nine years, because I think one of the unfortunate things about projects like pandas that take place over a long period of time is that often people really focus on the origin story of how the project started, but a lot of people don’t realize that I haven’t been actively involved as an individual maintainer in pandas since 2013, so it’s been more than-

Jon Krohn: 00:30:40

Oh, wow.

Wes McKinney: 00:30:41

It’s been more than eight years now, and so the project has obviously flourished and become a ubiquitous tool that everybody uses, and that work of building out the project and sanding the rough edges, building lots of new features, doing performance optimizations on all of the odd corners of the project, dealing with memory use problems and adapting the library to suit the needs of the modern generation has been carried out by an extremely passionate but also very small core team of developers whose names some people know, because they’re active in pandas’ development and they say, “Who’s reviewing my poll requests and who’s fixing all these bugs that I report?” but pandas has had a total of… Between issues and poll requests, our issue and poll request numbers in the GitHub project are over 40,000.

Jon Krohn: 00:31:45

Wow.

Wes McKinney: 00:31:45

So that’s just an astronomical amount of work that’s been driven for years and years by Jeff Reback and Brock Mendel, Joris Van den Bosch, Philip [Augsberger 00:31:56], and we have a really passionate core team that continues to grow. And so, I try to sing their praises as often as I can, because the work of large opensource projects, they require a community and pandas has grown a massive community that’s been really active in recruiting new contributors and lowering the hurdle to opensource development. And so, there’s many really successful opensource developers that one of their first poll request they ever made, our first opensource contributions, was to pandas, because pandas has made itself easier to contribute. And core maintainers, like Mark Garcia, have been organizing international documentation hackathons in Latin America, for example, to bring new people into the project. And so, that’s been just wonderful to see and many organizations have donated money to pandas’ development through NumFOCUS, which has enabled this kind of-

Jon Krohn: 00:33:14

Right.

Wes McKinney: 00:33:15

… this kind of community development that’s helped fund these events, has helped sponsor projects, small projects. So here, $1,000 here or 5,000 or $1000 there really goes a long way, and enabling somebody to spend a month of their time working on pandas rather than working on some other software, proprietary software contract.

Jon Krohn: 00:33:42

I am so glad that you fit in that response. And yeah, it’s so important to mention how pandas has evolved over the years. I’m glad that we didn’t just skip ahead to what you’re working on now, and then maybe I should also open it up to say, “What has changed over this last decade with pandas itself?” So you’ve highlighted how the incredible opensource community has enabled a lot of growth and change, and actually it’s super interesting to hear how a big, successful opensource project like this, like pandas, goes in and out of being just purely in the digital realm, just on the internet. So I often think of, and I’m sure many listeners often think of, big opensource projects as being anonymous GitHub usernames and people just slogging along, making progress online, but it makes sense that such a successful opensource project like pandas ends up going back and forth between the digital realm and the real world, having conferences where people can be together and having real-world communities where exchange and growth can happen. That sounds like a really big part of it, and so what kinds of things has the opensource pandas community been able to develop or hone about pandas over the years?

Wes McKinney: 00:35:19

One of the biggest things is that pandas has become this essential glue between different types of systems, so many downstream. You look on GitHub. Pandas is used by over a half million other projects on GitHub. It’s just a-

Jon Krohn: 00:35:40

Wow.

Wes McKinney: 00:35:41

It’s just crazy. It’s crazy.

Jon Krohn: 00:35:42

That’s insane.

Wes McKinney: 00:35:44

It’s a crazy number. And pandas has also become a computational building block that other systems use, and so Dask as a distributed computing platform for Python, distributed computing framework has become really popular. In recent years, becoming even more popular, particularly in enterprise use and pandas is used as a computational building block for Dask data frames, distributed data frames. And pandas is being used in many other contexts as an essential computing tool. It’s being used in spark. It’s being used in other distributed computing projects, like Modin, which is built on top of the Ray project. It’s interesting. Pandas has become this essential infrastructure of the Python data science world and honestly my main regret is that I didn’t take any database systems courses in college, because I think I would’ve made better software architecture and design decisions in the early days of pandas. And I feel like now, a decade and change later, year over year, I’m working on atoning for my sins in the early days of designing pandas’ internals.

Jon Krohn: 00:37:15

Well, it seems to be doing well enough as it is. I’m sure there isn’t that much of a gap. Yeah.

Wes McKinney: 00:37:27

Yeah, yeah, I know. I think where pandas has struggled is at the extremes. And so, if you never work with more than a gigabyte of data, you never have anything to worry about. Pandas is wonderful and that describes the vast majority of users, but the world is collecting more and more data as everyone wants to program Python. Your data keeps getting bigger. And if you’re in a company that has tons and tons of data, what works well for one gigabyte doesn’t work well for 100 gigabytes or a terabyte or more. And so, the struggles that have been faced at the extremes have been memory use challenges, performance challenges, like not effectively utilizing the compute hardware. And so, that’s definitely created some tension. Projects like Dask have alleviated some of that tension by handling the distributed computing with pandas, but there’s still quite a bit of systems challenges for at-scale computing with Python. And so, pandas’ popularity, also it’s a double-edged sword, because everybody wants to use pandas. They want the pandas API. They want to take their code that runs on a small CSV file and run it on a massive distributed data site, and so that definitely created a tension in the ecosystem where we want pandas.

Wes McKinney: 00:39:28

We want pandas to work for big data, but it’s easier said than done. And so, on one hand, so there’s some projects, like Koalas for Spark or Modin for Ray. Both are different approaches to scale out computing in Python, and so those are projects that aim to clone the pandas API really closely, whereas other projects have diverged from pandas. So I’ve been working with some folks from the pandas community for the last five, six years to build a project called Ibis, which is deliberately… It’s a data-frame framework for scale-out processing. You can run it on top of Google BigQuery and run on a petabyte of data, or run it on top of many other data processing backends, but it deliberately is not pandas in its API. The intention was to take the good, some of the ideas from pandas that makes sense in a scale-out computing context, while adding some of the missing features like high-level expressions, yeah, essentially building complex compute graphs, like deferred execution graphs, so having everything be lazy and not eagerly evaluated like pandas is, so it’s difficult. It’s hard to make everybody happy in a tool like pandas, but I see the ecosystem moving in a productive direction, and everybody earnestly wants Python to be one of the principal data programming languages of the future.

Jon Krohn: 00:41:22

Yeah, and I think that’s one of the beautiful things about being opensource is that it allows other projects like Dask to branch off of what pandas has already done. And yeah, I’m sure pandas contributors can continue to refine pieces that allow other tools, like Dask or Spark, to then work very effectively, and so, yeah, this constant evolution is very exciting. So part of the solution to some of the problems that you’ve outlined here, things like being able to have efficient data analysis at very large scale, gigabyte scale or that hundreds of gigabytes of data that you mentioned there, handling not just flat data but also hierarchical data, making the most of the compute resources that you have, CPUs or GPUs, those kinds of problems, it sounds like those could be resolved by another opensource project that you’ve co-founded more recently, the Apache Arrow project. Do you want to tell us about that?

Wes McKinney: 00:42:35

Sure. So the Arrow project arose, Apache Arrow arose, out of a bunch of different needs that all percolated and coalesced at the same time, so I think of it as an interdisciplinary collaboration between analytic database systems, big data computing frameworks and data science tools. So I came at this problem from the data science perspective, which is that having worked for many years on pandas, some of the biggest problems that we were dealing with is it’s very expensive to load data into memory, and then to perform computations on all of the different kinds of data that we observe in the wild. So in particular, working on numeric data in pandas is pretty efficient, but working on string data and non-numeric data is not. And so, data has gotten more and more complex. We see a lot of nested data, dictionaries, and lists and things that arise from JSON and event data that’s being generated by basically mobile phones and websites and things like that, these complex JSON events. And so, pandas has either all the data that you’re working with is in memory or it’s all not in memory. And so, if you can’t fit your problem into memory, get all of the intermediate results, then you run out of memory and you’re having a bad time.

Wes McKinney: 00:44:20

So Arrow, for me, was building from the ground up a data and computing foundation that could sustain the ecosystem for the coming decades, and so it starts with a standard way of representing data frames and tabular data where you can interact with data on a disk in a way that’s really low-cost, so reading. So if you have 100 gigabytes or half a terabyte of data on disk, and you need to read one small section of it, you can do that precisely and cheaply without having to scan through the whole dataset. The data structures deals with strings and nested data as first-class citizens, and so the goal was to make the operations the analytics that you would do on strings and binary and nested data as efficient for the CPU, or for the GPU as numeric data. So you wouldn’t have the kinds of performance problems that we’ve experienced in pandas. And so, we’ve built up, over the last five, six years, a stack of technologies to use Arrow in many different programming languages, protocols for connecting together systems and programming languages. And so, now we can build these interoperable high-performance analytic systems in many different programming languages, from C++ to languages that can easily plug into C++, like Python and R.

Wes McKinney: 00:46:07

And so, we have a really active collaboration with the Python and R community, and the Ruby community, and this shared foundation of computing. I gave a talk four years ago at JupyterCon called Data Science Without Borders, and so it speaks about this idea of building this really strong shared runtime for data science, a common compute foundation, which can be used portably across Python and R, and hypothetically any programming language. You could use it in Java, because we can build applications with Java and other programming languages that speak Arrow, and you don’t have to pay a high cost to move the data to and from the JDM, like we used to have to in the past. And so, it’s been a deep well and we’ve created these toolboxes for creating analytics applications, data frame libraries, database systems. And we’re working towards an Arrow-connected world where systems can become increasingly Arrow-native, that they can work with very large datasets, very efficiently transport data between machines or between processes on the same machine without paying the high penalties that we saw in the past and this is coming all at the right time.

Wes McKinney: 00:47:35

So there’s a reason that Arrow happened, which is that people saw disks getting a lot faster and network getting a lot faster, and they said, “My goodness, our hardware is going to be starved for data.” So problems that were I/O bound 10 years ago are no longer I/O bound. They’re compute-bound, and so we have to make interacting with the data a lot more efficient for computing. So a related trend there is that computing hardware is becoming more sophisticated, so not only are CPUs gaining new processing capabilities, higher bandwidth, SIMD vectorization instructions on Intel hardware, ARM hardware. We’re seeing RISC-V, new architecture for CPUs coming in the future, but then we also have hardware acceleration with GPUs. And so, GPUs have been put to work really effectively for machine learning and deep learning, and that same revolution is coming for analytics as well. And the RAPIDS team at NVIDIA, an analytics engine on top of Arrow that runs on NVIDIA GPUs and show that you can… Under my desk, I have an RTX 3090 GPU, graciously donated by NVIDIA with 10,000 cores and I can use that 10,000-core GPU to process my data frame, so it’s wonderful and all the data is Arrow.

Wes McKinney: 00:49:14

So that’s enabled this really interesting evolution at the systems level, but Arrow is really a tool for other opensource developers and project developers. So our goal is for Arrow to not be something that most average data scientists have to think about. They just think about like, “Oh, how do I get access to my data? How do I express my data frame operation?” So in R, we’ve completely hidden the Arrow computing capabilities behind interfaces like dplyr, the dplyr tidyverse interface, and so that enables you to take code that’s written for dplyr and use Arrow-based dataset reading, Parquet file reading, in memory computing capabilities without having to rewrite your code, which is exactly what we want. And we’re working to enable that same kind of portability of compute in Python as well through the Ibis project and we’ll be working with the pandas community to help retrofit pandas with Arrow, improved Arrow-based processing and data representation to enable pandas to become faster and more scalable. We started with strings, and so pandas in a recent release got Arrow-based string, a string extension type, which made string processing a lot faster and more memory-efficient. So you can work with larger datasets that way without running out of memory and all your operations will run faster, and we expect that will continue to extend to nested data and other kinds of gnarly data that pandas users have historically-

Jon Krohn: 00:51:03

Right.

Wes McKinney: 00:51:03

… have historically struggled with. And for me, the most exciting thing is because I’ve become a database nerd, and so I’ve been learning about database systems and the 30, 40 years of analytic database development, but now because Arrow is this cross-disciplinary project, we get to collaborate with database systems people. So we’ve been working with the DuckDB project at CWI in the Netherlands, when CWI has spawned MonetDB and a lot of database technology that’s powering many of the systems, like Redshift, Snowflake. These are people who passed through CWI and learned how to build databases in the Netherland, so to be able to benefit from those types of collaborations is extraordinary, because I think database systems people in the past were disinterested, in a certain sense, with data science tools. But as the world of data science, and Python and R in particular, have become so mainstream and so essential that now the challenge is how do we get the Python code running inside the database. How do we get the R code running inside the database? And so, that’s one of the really next interesting frontiers of analytical computing is bringing those worlds together and creating a productive developer experience.

Jon Krohn: 00:52:40

That’s so cool, Wes. Thank you for sharing all of that. It’s mind-blowing to hear all of the different aspects that you have to think about and get right in order for a tool like Apache Arrow to be so effective, database considerations, hardware considerations. And by the way that you speak about those areas, I can tell that we’re just scratching the surface of what you know about it, and so it’s cool to see that this project can have the impact that it can, that it can allow us to be running data science operations, data analytics operations across whatever hardware is available to us, be it CPUs or GPUs. And it’s awesome that you’re doing that in a language-agnostic way and so that it’s as easy as using an R dplyr call, a Python pandas call.

Wes McKinney: 00:53:46

Yeah, because ultimately Arrow is central. One of its central missions is how do we improve performance, improve efficiency, because you think about it just in terms of how much carbon emissions our data processing is causing and the growth of data centers in the world. We’ve got to accelerate. We’ve got to do more with less, otherwise at the rate of the explosive amount of explosion in data volumes, it can become a real problem in the future if we don’t make analytical computing orders of magnitude more efficient than it is now, but obviously we’ve made a lot of progress in the last 10 years, but we have a long way to go.

Jon Krohn: 00:54:42

Yeah, exponentially larger datasets. It’s something like, every 18 months, the amount of data on the planet doubles, and then model sizes are getting bigger even faster than that with our 100 billion weight natural language processing models. Yeah, we definitely need to be making the most of the compute. I hadn’t thought of the environmental social aspect there, which is also cool. All right, so the Apache Arrow project, that’s opensource, but you’re also CTO and founder of Voltron Data, which you’re going to have to explain this to me. This commercial entity, how does it interact with the opensource entity? There must be some revenue model, but it also probably, in some way, enables you to accelerate the impact the Apache Arrow project can have, right?

Wes McKinney: 00:55:38

Yeah, so I’ll give the quick origin story of Voltron Data. So I helped start the Arrow project when I was at Cloudera, working with a bunch of Clouderans working on Impala, Kudu and Spark. We collaborated with people like Jacques Nadeau and Julien Le Dem who were at Dremio, so we worked with people at Hortonworks and it was really a multi-company collaboration, an opensource project collaboration. I left Cloudera to join Two Sigma in New York, because Two Sigma saw what the future looks like with Arrow and was a true believer, and so they said, “Come here and let’s work on this together.” And so, we worked and we added Arrow support to Spark, and we made major contributions to growing the Arrow project. And in 2018, piling on into the Arrow project, everyone was trying to hire me to work on some part of Arrow and I said, “Well, I can’t work for you all, so why don’t we forge an industry consortium to develop Apache Arrow?” And that was called Ursa Labs. Its primary sponsor was RStudio. RStudio said, “Hey, we can’t leave R out of this data revolution.”

Wes McKinney: 00:57:18

So we had amazing sponsorship from Two Sigma, NVIDIA, and Bloomberg and a number of other companies, and so that enabled us to really focus in and do community-building, and to build bridges and not walls. And so, after nearly five years of Arrow development, I started to see that, to really unlock the value of Arrow in mainstream enterprise computing, was going to take a massive investment of engineering and it wasn’t something that we could do with a team of six opensource developers in Ursa Labs. And so, we decided to spin out from RStudio and launch Ursa Computing. We raised a venture round from Google Ventures and some other investors. And not long after that, we started to look. I reconnected with the GPU side of the ecosystem, so because while this was all going on, Josh Patterson at NVIDIA had built the RAPIDS organization and a large 100-person team at NVIDIA working on GPU-powered analytics, machine analytics, data processing, pre-processing on CUDA and had shown the massive performance benefits that are possible with NVIDIA hardware and GPUs in general for analytics. Prior to that, it wasn’t known whether GPUs could be used effectively for analytics, and so RAPIDS proved that.

Wes McKinney: 00:59:34

Blazing SQL, a startup, had worked closely with the RAPIDS team at NVIDIA to prove out that RAPIDS could be used as the basis of SQL processing, so for doing analytical SQL and proved out the value of the performance and cost-savings, reduced power consumption, reduced carbon emissions of using GPUs to do heavy, large-scale analytics workloads. And so, the idea struck us that we could bring all of this expertise and experience under a single company and work to be the definitive computing company for our Arrow-native future. We’re calling ourselves Voltron Data, and so we’ve been growing a big team, an engineering team, to work towards that goal of an Arrow-native future, especially in the enterprises, unlocking the value of Arrow in enterprise computing, but a lot of what we’re doing, you can see our work on GitHub. Our number of Arrow contributors has been ticking up steadily over time. More than 25 developers from Voltron Data have contributed to the recent Arrow 6.0 release, which is about to come out, so we’re really investing serious time and money in growing the Arrow project and hardening it as a cornerstone of the next generation of analytical computing. And so, we see that as existential, and so that’s our primary focus right now. So in terms of our product and what we offer to sell, and since we are a startup and we have investors who will want return on their investment, we’ll have more about our product offerings in the coming months.

Jon Krohn: 01:02:10

Cool.

Wes McKinney: 01:02:15

I won’t put a date on it. But yeah, in the next six months, you can expect to learn more. We are hiring very actively. We have 15 or so open roles, mostly in systems engineering, C++, distributed systems, algorithms, and we’re also hiring opensource support engineers, because part of what we need to do is provide a backbone of support for production applications that are running on Arrow. And there’s a lot of engineering that goes into making that possible, so that’s the lifeblood of these types of projects. And so, yeah, it’s a very exciting mission. Yeah, it’s just a really thrilling feeling to be a part of this and to know that these systems have so much tangible impact on the day-to-day lives of data scientists.

Jon Krohn: 01:03:21

Yeah, I bet. And as you were speaking about the origin of Voltron, it became really obvious to me that commercial opportunities from being involved in these kinds of major opensource projects, they’re going to be everywhere, because if there is this opensource project that can do things more computationally efficiently than ever before, there’s going to be people who come around and say, “We have this particular use case. Can you adapt what you’re doing for this use case?” And so, there’s consulting opportunities there, but it sounds like potentially projects in the works too.

Wes McKinney: 01:04:00

Yeah. So yeah, yeah, we look forward to rolling out those pieces as we’re ready to share them, but just based on just on the numbers, I would say the majority of the engineering work that we’re doing is landing in the Arrow project. So to be able to invest substantial resources, millions of dollars in R&D-

Jon Krohn: 01:04:32

Wow.

Wes McKinney: 01:04:32

… to Arrow, it’s the dream of opensource, so I see it as an immense privilege and opportunity to do good for the computing ecosystem. It is essential to us, A, that we build a sustainable business so that we can continue to make these kinds of investments in opensource infrastructure and building a healthy community, developer community. You sometimes see that toxic patterns emerge when there are commercial entities working in open-source projects, and so we are being very mindful about the steps that we take to engage and build a community so that we are good stewards, yeah, building a thriving community that can get 10 times bigger or 100 times bigger than it is now.

Jon Krohn: 01:05:42

Yeah, and so no doubt, not only a great honor and opportunity for you to be contributing to the open-source world like this but also for anybody who gets to work on the Apache Arrow project, whether that’s a part of Voltron Data or not. And so, we’ll be sure to include in the show notes the career page for Voltron Data. I know that you have 17 different job title openings, so even more jobs than that available right now at the time of recording?

Wes McKinney: 01:06:15

Yeah, and I think it’s interesting that the work that we’re doing is in… There are almost entirely disjoint communities of developers in the Arrow project working on very exciting technology. There’s a really passionate group, a growing group, of Rust developers, building Rust-based analytics, analytical computing in Arrow. And I’m very excited about that also and I think the future is going to include a lot of Rust, and so there’s so much interesting stuff going on. There’s Go development. There’s Rust development. There’s JavaScript development, so there’s so many ways to be involved in this project that are, in a sense, working in parallel to the work that we’re doing. These systems are all compatible with each other, which makes for very interesting opportunities in the future to create heterogeneous language, heterogeneous applications and system architecture. So yeah, it’s too many. I have to keep myself focused on just [inaudible 01:07:33] problems at a time, because I think about all these things, and then I get really excited about all the possibilities and all the things that we could build, but realistically there’s only so many hard problems you can solve yourself at-

Jon Krohn: 01:07:48

Yes, no doubt.

Wes McKinney: 01:07:50

… at one time.

Jon Krohn: 01:07:51

Yeah, you’re taking on a lot of them. But yeah, leave some for the rest of us, Wes, but we haven’t talked about the names and one thing that caught me is Ursa Labs. Ursa is Latin for bear, I think, and pandas sounds like a bear, so I don’t know. Is there a relationship there? Where did the pandas name come from and maybe the Arrow name, maybe the Voltron name if those are interesting stories? You don’t need to tell us all of them, but maybe at least the pandas one.

Wes McKinney: 01:08:24

Yeah, well, the pandas name came… I was trying to create some name to capture Python data analysis or something, some concept like that, but I was also working with a lot of econometricians who were talking about panel data all the time. Panel data is a certain kind of dataset that gets collected in statistics and econometrics. And so, in all of my writing down candidate names, I was like, “Oh, there’s a panda living in there.” Somebody suggested and I can’t remember who, suggested making it plural. They’re like, “Oh, it’s cuter if it’s plural,” so hence pandas.

Jon Krohn: 01:09:16

Great. Well, there you go. David Regalado, who is a co-founder at Data Engineering LATAM, he was interested in the name history there, so we got that for him. So I know from our chat just before we started recording that you might not be able to spend as much time down in the weeds, writing software yourself every day now, but I know that you do still spend time on that, and certainly you have a big history in it. So do you have any particular software tools or techniques that you highly recommend for listeners?

Wes McKinney: 01:09:55

When I do development, and I’m doing less and less development these days since I can be helpful to my team and to the open-source project in other ways other than writing poll requests, when I do do development, I’m pretty old-school in the tools I use. I’m still an Emacs user.

Jon Krohn: 01:10:20

Nice.

Wes McKinney: 01:10:20

Not a very tricked-out Emacs user. I see people’s VI or Emacs’s setup. They have all these fancy auto-completion things. I’ve never managed to set up those sorts of things, which probably means I’m not as productive as I could be, but I’ve managed for many years. And yeah, I really focus on interactivity and exploratory performance testing, and doing more and more C++ these days, and so definitely use perf and GB and other. We’re starting to use things like OpenTelemetry to make it easier to collect performance data from more complex data processing applications. In terms of staying organized, we use Jira and Arrow, and people have strong feelings about Atlassian products, but I do like Jira for project management. We’re big fans of Notion for Wiki and more lightweight project organization, project management, knowledge and building a knowledge base for a team. Critically, we’re a distributed company. We are globally distributed. We have many people in Latin America, Europe, US and Canada.

Wes McKinney: 01:11:56

So we want to hire the best and brightest all around the world, and selecting tools that facilitate asynchronous collaboration, documents, building documents, collecting knowledge and building a written culture where people can find out about things and become less dependent on synchronous interactions, and that we don’t have people with all of the knowledge stored away in their heads and inaccessible to other people. So yeah. Just using tools like Notion has been really helpful in building that knowledge creation, written culture.

Jon Krohn: 01:12:39

Cool. Another tool that we mentioned just before we started recording that I had to look up when you were talking about it, I hadn’t even heard of it. This is a hardware tool that seems like it could come in handy for a lot of listeners is something called reMarkable 2, so I was lamenting to you before the show. There, we’ve got it up on video. Wes has it up, the video version. I was lamenting not being able to whiteboard with my team in person and how, in the last couple of weeks, I’d been able to get in person with my team again in New York and I was saying, “It’s awesome to get in front of a whiteboard.” And we’re able to talk about ideas in a way that I felt we couldn’t as effectively completely remote, but your reMarkable 2 notepad seems to be doing the trick.

Wes McKinney: 01:13:25

Yeah, well, I’ve always liked pen and paper, and it helps me think to write things down and take notes, but I was burning through a lot of notebooks, and then you end up with this stack of notebooks. You’re like, “Well, I should just throw these away.” So yeah, I love my reMarkable and it’s replaced my paper notebooks, and recently you can use it as a digital… It’s an e-paper, e-ink tablet and you can write on it with a pen. It’s a pen that’s a special pen, but you can share what you’re writing. So you could have a document that’s your scratchpad or your whiteboard versus some of the other pen tablets, the Wacom tablets and things like that, where you have to… You’re doing this one thing with your hand and looking at the screen, and it’s like, “Where is my hand?” You can think, “I’m writing on a piece of paper.” And so, that’s what working on a reMarkable feels like, and then who you’re talking to, could be on the other side of the world, can see what you’re writing on your digital whiteboard. So that’s been a helpful tool.

Jon Krohn: 01:14:51

I love that. I’m definitely going to check it out. I do everything with my iPad, which doesn’t have the same mechanics, and I miss that about notebooks. I had the same situation. I was like, “This is absurd how many notebooks I have.” And then, also you don’t always know what situation you’re going to need one of the notebooks in. I tried historically to keep them organized by subject, but then I’m in the office and I need a notebook from home, and I don’t have it.

Wes McKinney: 01:15:21

Yeah.

Jon Krohn: 01:15:22

So very cool tool. I’m going to have to check that out. All right, Wes, thank you so much for taking so much time with us today. Such rich answers to all of the content, all of the topics that we’ve had so far. I’ve got a handful of great audience questions for you here. We’ve got one from Daniel Kapitan. So he is wondering, “Will pandas ever support nested columns, like Parquet, BigQuery and other engines do?” And it sounds like, not it sounds like, that is something that you’ve explicitly mentioned the Apache Arrow project handles, but maybe that’s something that pandas might have in its future as well?

Wes McKinney: 01:16:05

Yeah, I don’t know where it falls on the pandas roadmap right now, but we have the data structures and the tools that are needed to represent and efficiently process the nested data that you see in Parquet files or that you can process in SQL systems like BigQuery, and so what’s needed there is to develop an extension type for pandas that is a struct type or an array type. And I guess that there’s already some issues in the voluminous pandas backlog about that. So if you’re interested in that, that’s an avenue to get involved in, in pandas. And if you’re at a company that needs that capability, that’s also the kind of project that you could potentially fund in pandas through by donating money to NumFOCUS.

Jon Krohn: 01:17:15

There you go, Daniel. So either you or your company, or maybe some listener, can pick up the slack on that particular problem. Go for it. We’ve got one here from Doug Eisenstein, who is a financial technology data expert. We also had him as a guest on the show earlier this year. I don’t have the episode number to hand, but easy to find. So Doug is wondering, Wes, where you see the Rust ecosystem a few years from now. He says five years from now.

Wes McKinney: 01:17:47

Yeah, I do know Doug. I haven’t seen or talked to Doug in years. Thanks for the-

Jon Krohn: 01:17:50

Oh, no way.

Wes McKinney: 01:17:52

Yeah, thanks for the-

Jon Krohn: 01:17:53

Oh, cool.

Wes McKinney: 01:17:54

Thanks for the question. I do think that Rust will play an increasingly important role as a systems language in distributed systems and also in streaming, streaming data and distributed data processing. So if you look at the work that’s happening in Arrow, it’s already happening now, and so what we’re seeing right now in the Rust Arrow project, Data Fusion, which is an analytic query engine for Arrow built in Rust, is that the early adopters are already there. They’re building systems in Rust for doing data processing, and so I think that it’ll become more mainstream and increasingly replace things that people used in the past. I think Scala will lose market share to Rust, for example. Rust, it has a learning curve and I’ve never written any Rust code, but my understanding is that it’s not exactly a walk in the park. But once you do learn Rust, it has wonderful properties around the idea of fearless concurrency, memory safety, so providing some of the same kinds of benefits that you see in Go, for example. But yeah, so very excited about that. I’m definitely bullish on Rust.

Jon Krohn: 01:19:38

Super cool. Thank you for that answer, Wes. And then, one last one here, so this is from Brett Tully, with whom I did a PhD in Oxford, so he’s an Australian. He’s worked on all kinds of exciting projects. He’s worked on fusion power, and now he’s the director of AI output systems for Nearmap, one of Australia’s biggest tech companies. And he has a question that is, “I’m super interested in challenges of supporting Python packages with compiled binaries and where Wes sees the community moving for this. For example, TensorFlow is notoriously bad for conda.”

Wes McKinney: 01:20:27

I would say I don’t have great answers except that I am also concerned about our growing DLL Hell in Python and the binary dependency management with Python Wheels. It’s definitely possible to create, because some packages go the route of being completely self-contained. And so, in the fullness of time, there are packages which ship a static version, a fixed version, of Arrow. And so, if you load that package and you load an incompatible version of PyArrow, that will not work. So these problems are already happening. Conda by making everything a shared library and having a dependency management resolution is the clear way around this so that each project is using the same version, the same shared library.

Wes McKinney: 01:21:48

But yeah, but not everyone is playing well with conda, and so I hope that they do and conda is not perfect by any means, but we have a really passionate developer community building conda-forge, providing open packaging infrastructure for conda so that obviously Anaconda, the company, has provided an extraordinary service to the community over the last 10 years in not only building conda but providing the package hosting infrastructure for conda, Anaconda.org and so forth, and the Anaconda distribution itself. It depends on the community of maintainers as well. So I don’t have all the answers, but I do worry about it, and I hope that TensorFlow and other projects will become better conda citizens in the future.

Jon Krohn: 01:22:52

Nice, another project for some open-source gladiators to get on top of. Brett, we’re counting on you. All right then, Wes, and then do you have a book recommendation for us, by chance?

Wes McKinney: 01:23:06

Oh, this year, I read A World Without Email by Cal Newport. And since we live in an age of information and notification overload, I think optimizing ourselves as humans and how we interact with others, how we communicate with others and how we organize our work, I think, is one of the principal challenges of our generation and I think that book has some good ideas. It isn’t about getting rid of email but using tools, like email or other notification project management tools, in a way that is better for your brain, and where you can be less stressed and more productive.

Jon Krohn: 01:23:53

Cal Newport, that guy writes books more quickly than I can read them. It’s crazy. I didn’t know about that book. I’m piling up a stack of Cal Newport books. Deep Work is one that I have read, and then now you just keep piling on more and more of them that I feel like I need to read so I can figure out how to be more efficient and have more time to read his books. All right, thank you so much, Wes. It’s been absolutely amazing having you on the show. You’re such a rich resource for understanding the pandas project, the Arrow project, the opportunities that exist in the open-source community and development world, what the future of data analytics could look like, how we can be leveraging hardware more effectively in the future. Clearly, a lot of listeners will want to be able to get more information from you. How should they follow you?

Wes McKinney: 01:24:46

I think follow me on Twitter. Any information I put out is usually there. Sometimes, I post things on LinkedIn, but generally, yeah, Twitter is the authoritative source of information, but don’t be surprised if I go a few months without tweeting.

Jon Krohn: 01:25:08

Yeah, you’ve got to stay focused on the important work and Twitter is usually not it. All right, Wes, thank you so much for being on the program, and hopefully we’ll have the opportunity to have you on again in the future.

Wes McKinney: 01:25:21

Awesome, thank you.

Jon Krohn: 01:25:28

Wow, Wes is a rockstar that did not disappoint. He has endless relevant detail to dig into on any topic that came up and he communicates that detail remarkably clearly. Specifically, in today’s episode, Wes covered how pandas was born from his desire to more easily join data. He talked about the release of the third edition of his Python for Data Analysis book, which is expected next year, but can already be accessed as rough drafts via O’Reilly. He talked about the Apache Arrow project that he’s evolved alongside hardware and database analytics advances to efficiently handle datasets that are larger than can fit into the memory of a single machine, and he had lots of really cool tools to tell us about, including Emacs, C++ perf, the GDB debugger, OpenTelemetry, Notion for project management and the reMarkable 2 digital notepad. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials or tools mentioned on the show, the URLs for Wes’ Twitter profile as well as my own social media profiles at www.www.superdatascience.com/523.

Jon Krohn: 01:26:44

That’s www.www.superdatascience.com/523. If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel. I also encourage you to let me know your thoughts on this episode directly by adding me on LinkedIn or Twitter, and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. All right, many thanks to Ivana, Mario, Jaime, JP and Kirill on the SuperDataScience team for managing and producing another extraordinary episode for us today. Keep on rocking it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)

SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)

Podcast Transcript

Share on

Related Podcasts

November 18, 2025

November 14, 2025

November 11, 2025

Podcasts SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)

Share

SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow)

Podcast Transcript

Share on

Related Podcasts

November 18, 2025

SDS 941: Multi-Agent Human Societies, with Dr. Vijoy Pandey

November 14, 2025

SDS 940: In Case You Missed It in October 2025

November 11, 2025

SDS 939: Mixture-of-Experts and State-Space Models on Edge Devices, with Tyler Cox and Shirish Gupta