SDS 826: In Case You Missed It in September 2024

In this episode of In Case You Missed It, host Jon Krohn gives us a round-up of his interviews in September 2024.

These clips showcase the breadth of the month’s data science-centric episodes, and we hear from guests who have helped develop next-gen IDEs and efficiency-boosting open-source Python libraries, who teach a global student body all about the latest developments in data science and AI, and more.

Hear from Posit’s Julia Silge (episode 817) and how its open-source, polyglot IDE, Positron, gives data scientists a fully interactive, exploratory console that takes Rstudio’s loved Environment pane one step further. Luka Anicin (episode 819) puts PyTorch under the microscope to identify how the machine learning library can help data scientists build more accurate and efficient models. Marco Gorelli (episode 815) explains the joy of the Python libraries Polars and Pandas. And Marck Vaisman (episode 821) tunes us into why companies’ data science hiring strategies must change.

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?

What do you think it will take for hiring companies to change their strategy when hiring for data scientists?
Download The Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 826, our “In Case You Missed It” in September episode.

00:19

Welcome back to the Super Data Science Podcast. I’m your host, Jon Krohn. This is an In Case You Missed It episode that highlights the best parts of conversations we had on the show over the past month. My first clip is from episode number 817 with Dr. Julia Silge. As engineering manager at Posit, Julia leads the development of a next generation IDE called Positron that is designed from the ground up for people who write code to work with data. I asked her what makes Positron so useful to data scientists relative to the established IDEs out there.

00:51

The most exciting thing that you’re working on right now is that as an engineering manager for Posit, which formerly known as RStudio and the makers of RStudio, you’re now working there as an engineering manager, and your project that you’re leading the development of is something called Positron, which is described as a next generation IDE, integrated development environment, for data science. So that is what RStudio was many years ago. I mean, I was using RStudio since 2007 I can kind of… At least since then, and it was definitely my go-to IDE when I was primarily an R developer back then, an R data scientist. Although I guess I wouldn’t have used the word data scientist in 2007.

Julia Silge: 01:38

Yeah, right? Right? Yeah.

Jon Krohn: 01:42

And so, with Positron, what are the gaps or limitations that you’re addressing that aren’t covered by things like RStudio, VS Code, or Jupyter Notebooks, which might be the go-to IDEs for data scientists or software developers today?

Julia Silge: 01:57

Yeah, yeah, yeah. Yeah, if I was going to sum up the one gap I feel like that Positron is working to address, it’s that there isn’t something out there right now that can be one place you go to do all your data science. So Positron is not a general purpose IDE. It is specifically an IDE built to do data science. I come from a science background, and I’ve always been someone who wrote code for my data analysis, but I’ve always really felt that my needs were a little different than someone who was writing general purpose code to build a website or to make a mobile app. People who write code to analyze data are different in some real ways. It’s not that it’s like they’re worst coders or-

Jon Krohn: 02:45

[inaudible 00:02:49]

Julia Silge: 02:49

No, no, I really do think that.

Jon Krohn: 02:50

You take that data scientists.

Julia Silge: 02:53

It’s not. It’s not. I don’t think it is that people who write code to analyze data do a worse job writing code. It’s that their needs are different and that they’re writing code in a different way. So folks who have been, for example, who have been using VS Code as a data science IDE have really felt that tension where they’re like, “This is really general purpose,” and instead, and I’m trying to kind of customizing it using extensions to fit my needs. So Positron is meant to specifically be a data science IDE. Positron is also a real driving reason why we’ve built it the way it is is that it is a multilingual or polyglot IDE. A lot of the environments you might download to do scientific computing or data science or data analysis are built specifically for one language. So I know all of us have used these. So RStudio is an example of one of these like MATLAB, Spyder. There are a lot of environments in which you would do data analysis that are just built for one language, and increasingly, I just think that’s not how as many people work.

04:00

Many, many people use multiple languages. Whether it’s on one project that literally uses multiple languages over the course of a week, they pick up different projects that use different languages, or almost certainly, on the span of years or your career, you use different languages because things change in our ecosystem. Like you said, you started with R, and now, you use other languages. There are so many people who use combinations of R and Rust, or they work on projects that’s like Python plus front end kind of technologies, JavaScript, HTML, et cetera, or almost any data science language plus SQL, right? IDE that is built to use one language, for very few people, is that really going to fit all of the needs that they have over the course of a week, a month, or multiple years?

04:50

So Positron is built with a design such that the front end user-facing features are about the tasks you need to do, whether that is interactively write code, whether that’s dealing with your plots, whether that’s seeing, exploring your data in a visual way, and then there are backend language packs that provide the engines for those front end features. Positron, it’s very early days for Positron. We only made it public about six weeks ago as of the day we’re recording this, so it is currently shipping with support for Python and R, but it is designed in such a way that other data science languages can be added because there’s a separation between the front end features and what is driving them. So we look forward to adding support for other languages as we collaborate with other data science communities or new things come up, new exciting ways of doing data science come up.

Jon Krohn: 05:55

Nice. So the polyglot IDE part, to me, that makes a huge amount of sense. I get that, especially as a contrast to RStudio. For people who are writing code as a data analyst or a data scientist, people who are working with data, what is different that we need specifically relative to another software developer?

Julia Silge: 06:12

Yeah, so I think one piece that is very different is that the process of writing code is more exploratory, is more interactive, and that’s not wrong or bad. That is actually just the fact that instead of getting a spec from a product manager and building a product, that’s not what data scientists, data analysts do. You start with data, and you often don’t know what you can or should do in detail until you start that process. And if you have a code writing process that is more exploratory, you need more supports for writing in that interactive exploratory mode. Some things that support that are things like a truly, truly fully featured interactive console. Of course, that does exist in various ways. People get at that in various ways when they use notebooks or using, say, a Python REPL.

07:12

If you get to a truly fully featured interactive console where what happens in the console is then reflected in the rest of where you’re working like, say, in Positron, we have what we call a variables pane. If you come from RStudio, you may be familiar with something called an environment pane where you see all the things you… And it updates as you change things, or the plots that you see, you have them all right there. You can scroll through them. If you change and make a new plot, you see it pop up there, and you have that really interactive way of working.

07:47

Some of the other things that I know really make a difference for people, help inside of the IDE where you are working. So you know you’re working long. You’re like, “Ah, wait. What is the function signature,” or “Maybe I want to look at the docs for this.” So instead of having to get out of a flow state and go somewhere else and read docs on the website, you can open up help right there and copy paste, go right back and forth, and stay in that kind of flow state.

08:19

Another thing is you’re building interactive apps. You need a way to have that right there and that it updates as you change your code versus having some sort of build process going somewhere, looking at a browser. There’s really quite a lot that if we put it together, we can make people more productive. The company that I work for, Posit, the company formerly known as RStudio… Posit, it’s a really fun place to work as someone who likes thinking about the process that people bring to their task because I do think we are huge believers in code-first data science, not no-code solutions, not GUI-based tool. People who do data work should be writing code, and at the same time, their needs are different. Their needs are different. And so, pretty much every single thing my company does is deeply informed by this belief like how deeply we know that data practitioners are different, and that’s good and fine, and we can make them more productive by building tools that are specifically for the kinds of tasks they need to do.

Jon Krohn: 09:31

Very nice. And yes, so not only do you have Code-OSS as a kind of a backbone that’s providing building blocks for Positron and offering that kind of extensibility through all the VS Code extensions like you mentioned Databricks there, any number of extensions that people might want to be able to import.

Julia Silge: 09:47

Rainbow tabs. Whatever you want, it’s out there.

Jon Krohn: 09:51

In addition to all those things, the Positron project itself is open source, so if people are listening and they want to be contributing, right now, at the time of recording, there are 27 people including I can see your face as one of those GitHub contributors, listeners can go, and they can contribute to this developing and very exciting project.

Julia Silge: 10:12

Yeah, yeah. So Positron is licensed such that it is source available. Anyone can come and look at the source, change it, contribute to it, and it is also licensed such that it is free to use, including for commercial purposes. You can use it, of course, in academia, for personal projects, but you can use it at work. It is licensed in such a way that it is free to use in your work as a data scientist, data analyst. So it’s free to get it, so you can read the code. There’s real benefit to that kind of model for building and making software.

Jon Krohn: 10:50

I love what Julia is saying that the more sophisticated needs in data science require fewer guardrails, making Positron an IDE that’s particularly well-suited to our needs. So while Julia is focused on making tools that make data scientists more efficient, in episode number 819, my guest, Luka Anicin, explains his top tips for how he makes his ML models more efficient as well as more accurate.

11:14

Luka, what kinds of tips do you have for building really… I guess we never know whether something is truly the best that it can be, but how can we build more accurate or more efficient models in PyTorch? What kinds of tips and tricks do you have?

Luka Anicin: 11:33

I don’t have any unique tips and tricks. I’m just using whatever works. So I always start with a really simple way of thinking it, “Okay, start with the simplest model and then build from there.” You tend to think that over-complicated models will work better or because you have these easy tools, that you need to put more layers, more different types of layers, or just number of layers, bigger layers, bigger networks, and stuff like that. And it does work sometimes, but if you didn’t test much simpler models, then it comes to different comparison methods, and you can’t get to the point of what actually is the bare minimum of what works. So what is the baseline there? So start from there.

12:28

And the other point is not my point. It’s Andrew Ng’s point that you have two ways of thinking about implementing and optimizing your models. That is a model way or data way, so a lot of especially beginners in this area think about model way. So okay. Now, I have algorithm. It’s not working properly, so what can I do with this? Can I increase the number of layers? Can I increase the complexity of single layer? Is the loss function the wrong one or hyperparameters one? All of the right questions, but you are not thinking about the data part. So sometimes your data is not right, so you can do and optimize your models as far as it can go, but sometimes you will go up to the ceiling, and you will not be able to increase the accuracy of your models or the general performance of everything just by changing the model itself. So it’s much easier to think about that because it’s a couple of lines of code because of these libraries, but then, in most real world cases, collecting more data or relabeling something is the way to go.

13:42

One of the projects where I worked in a couple of years ago, we reached an amazing accuracy with the model that we designed ourselves, so fully custom model. It’s computer vision task, and then we reached a point of 70 to 80% accuracy. Then, researchers, my team, so I had a team of about 20 engineers that they led there. And of course, they thought about, “Okay, can we implement better algorithm? Can we add more layers to it?” And we did a lot of experiments there. But then, a couple of engineers said, “Okay, but is the data part wrong in this case?” And it turned out that 10% of our whole data set were wrongly labeled. So even if we increased the complexity of our layers or the models, we would reach the ceiling of, “Hey, but this is completely wrong.”

14:45

So what we want to predict is not aligned with what we have. And sometimes the data view is much more safer and, of course, tedious, more tedious than just model one. But basically, going that route is more safer in the long run just to eliminate it and confirm, “Okay, the data part is correctly labeled.” We can’t or why we can’t reach more data points and then start thinking about model itself.

Jon Krohn: 15:17

I see. That makes a lot of sense. And then, what about transfer learning that could potentially allow us to-

Luka Anicin: 15:22

Of course.

Jon Krohn: 15:23

… take advantage of a more powerful model that was trained on more data than we have access to?

Luka Anicin: 15:28

Yeah. For context, for listeners that are not aware of the transfer learning part, transfer learning unlocked a lot of opportunities, and natively, when you start thinking about it, doesn’t make sense, but it actually works. So what the transfer learning is basically large models that were trained by different companies like Google, Microsoft, Amazon, Facebook or Meta. There are a lot of tasks and different models train on those tasks. They invest millions of dollars just to get a small percentage better than their competitor and to prove in some data set that they can do better basically. But at the paper, it doesn’t make sense, but for us, researchers or developers makes a lot of sense because they invest a lot of money, so they train the model, and they provide weights to those models. So it’s trained on a certain task, let’s say a COCO dataset or let’s say ImageNet.

16:32

So it’s completely open-source dataset on 1,000 classes, and now, you have a specific task for your startup, for your company where you work where you need to, let’s say, make a classification between this pencil and this pencil for a factory. In the original way of thinking, you would need to collect thousands and thousands of different data points or samples depicting between these two types of samples, but that might be impossible, or you didn’t have time or resources. Because these companies train these large models, what you do is basically take those models, remove the last layer, and say, “Okay. Now, you start predicting binary classification between these two pencils,” and you freeze the full network except the last layer that you just added. So you are basically optimizing just the last part of the network where the rest of the weights are optimized for the different dataset.

17:33

That’s why it doesn’t make sense on paper because it was trained on completely different task, but what they proved that a lot of these weights are transferable, and they can be used in many, many cases, and it actually works. So instead of collecting thousands, you can collect hundreds of images and bam, it works like a charm. So in PyTorch and TensorFlow and other libraries, especially Hugging Face, these days, you can import these big models like a one line of code, and those models are really powerful for image classification, object detection, OCR, transfer learning for text these days because of LLM. So you can do the transfer learning or basically any task that you can think of, and you have potentially a model built inside of these libraries, so you can just import it and start working from there.

Jon Krohn: 18:26

Excellent. Yeah, so to kind of recap back to what you said there, we, as machine learning practitioners, we can take advantage of these huge models trained on huge datasets that might’ve cost millions of dollars to train. In some cases, like the most extreme examples like Llama 3 today at time of recording, they might’ve spent tens or hundreds of millions of dollars on this whole project to create these gigantic large language models trained on billions or trillions of tokens, of pieces of natural language information from the public internet and also from maybe their own proprietary sources. And we can use things like Hugging Face, which you mentioned there, to easily download these model weights and then fine tune them to some specific task of ours.

19:17

And then, with some relatively small number of data points, like you mentioned, very commonly now just hundreds of data points, you can have an extremely high performing model in natural language processing and machine vision for some specific task that you want to have running for your business or your platform or some personal use case. And that fine-tuning because you’re only fine-tuning a small number of the model weights from the whole architecture, fine-tuning on those hundreds of data points might cost tens of dollars or maybe even less. And you can see you can have these really powerful models for very low cost.

19:57

Reduced expenses should always excite a data scientist because this is so often why we are brought into a company. We’re there to solve problems and improve the bottom line. So the fact that we can now lean on LLMs and fine tune them to solve specific business problems is a huge boon for data scientists.
20:14 In episode number 815, my guest, Marco Gorelli, and I grapple with another time and money saving tool for data science. We discussed Polars, an open-source library that is anywhere from 10 to 100 times faster than Pandas, the incumbent Python library for working with data frames.

20:30

Back to Polars more specifically, so now we know it’s Rust background. We know that even the RS suffix on it is related to the Rust filenames, and so that’s clever. We know that you develop in Rust in order to be developing the Polars library. For somebody who is a data scientist who isn’t necessarily a software developer like you are, but for somebody who wants to be taking advantage of Polars, why should somebody install Polars into their Python instance instead of Pandas?

Marco Gorelli: 21:05

I hate to give the boring answer of it depends, but that’s often the answer to lots of technology questions.

Jon Krohn: 21:11

Yes.

Marco Gorelli: 21:11

So my general advice is if it’s not broken, don’t fix it. If you’ve got an existing Pandas project that works absolutely fine for you, then I think there’s probably better things for you to focus on than rewriting it in Polars. But if you’re starting a new data science project, then that’s when I typically recommend people, “Okay, this is a good time to give Polars a go.” I think if you start a new project and you try to think in Polars right from the start, you’ll end up writing idiomatic code, and you’ll have a lot of fun. Something a lot of Polars users say is that it’s surprisingly pleasant to write Polars code, and it’s nice to see what the library does for you. The syntax is very nice. I think that’s one of the major APIs major innovations that the library has brought aside from just a phenomenally good implementation.

Jon Krohn: 22:03

Got you. So I would’ve maybe assumed that the API, that the syntax would be similar to Pandas, but actually what you’re saying is it’s quite different.

Marco Gorelli: 22:14

That’s right. Yeah, so the idea of trying to re-implement the Pandas API but with a different faster backend and all of that. It’s been tried with varying degrees of success. With Polars, I think this is a really nice success story. Ritchie just had the courage to try something different to say, “Well, Pandas, it’s successful. It’s popular. It does what it does. Let’s try doing something different. Let’s try not having row labels. Let’s just not have an index.” I think any of your listeners who are familiar with Pandas, most of them are probably used to having to do reset index every two or three lines of Pandas code in order to get things to work.

22:55

There are Pandas users who use the index very intentionally, and they can make great use of it. You can get performance improvements from using the index very intentionally, but I think the majority of Pandas users, for them, it’s probably more of an annoyance than anything else. And so, I think Polars has really made a good design decision here. Most users don’t need to worry about their rows having labels. A side effect of this is that it makes certain performance optimizations easier, and the company is now working on distributing Polars and-

Jon Krohn: 23:37

The company Quansight?

Marco Gorelli: 23:38

Sorry, the Polars company.

Jon Krohn: 23:39

The Polars company.

Marco Gorelli: 23:40

Yeah, exactly. So when it comes to distributing Polars, then it should be easier to do that if you don’t have to worry about having an index, whereas companies that have tried distributing Pandas like Dask, they do have an index, but it does cause some difficulties.

Jon Krohn: 23:58

I see, I see. So there is a Polars company that is commercializing the Polars open-source library that anybody can access and install?

Marco Gorelli: 24:07

Yeah, that’s right. So there’s a company that’s behind the open-source software. Most of the core developers are hired by the company, so the open-source software Polars is and always will be open source according to Ritchie. However, they’re also going to make some other offerings like cloud offering distributed, and these are things that are going to be paid services, and that’s what the company is working on.

Jon Krohn: 24:40

Makes perfect sense. Another aspect of Polars that I understand, so you’ve mostly so far been talking about Polars being a great choice for people who want to be manipulating data frames and kind of have more fun, have an easier time with the syntax relative to what they might in Pandas. But you’ve previously, on another interview, you described Polars expressions as functions that only take effect once you put them inside the data frame context. Can you provide an example of how this lazy evaluation benefits, data processing and any kind of maybe concerns people should be concerned about as users when they do evaluate in this way?

Marco Gorelli: 25:17

Oh, that’s fantastic. Yeah, expressions really one of Polars innovations. I don’t think it’s something that Polars invented. PySpark had something similar, and some R libraries, I think they had something similar. But the way they work in Polars, I think of an expression as a function from a data frame to a sequence of series. Most users don’t think of it in these terms. Most users just think of it as grabbing a column from a data frame and then doing some operation on it. People usually get an intuition for what expressions do fairly quickly. In terms of what advantages, apart from just how nice the syntax is to manipulate, the fact that an expression is just a function, so it doesn’t need to be evaluated right away. It means that when you’ve got the data frame context, Polars can analyze the different expressions which you’ve passed in, and it can apply certain optimizations.

26:21

For example, the classic example that Ritchie gives is if you’re taking a column and doing a sort and then selecting the first five elements, then this has got nlogn complexity, but you could just do a top K algorithm, and then the complexity there should be linear. I think something like that. Another example is you might be doing feature engineering. You might be making two features which both start with something very similar like, I don’t know, take the absolute value of the logarithm of something, and then one feature, you’re doing like shift one. In the other feature, you’re doing shift two. You know, people are often making features where part of the calculation is very similar. So then, Polars can do common subplan elimination. It can see that some parts of the expressions are very similar. It can just assign that to a temporary variable just to calculate that once and then reuse that between the different features.

27:22

Another advantage of using expressions in data frames is that it lends itself very nicely to parallelization. So if you’re just making a single operation on a single column, then it’s often just not worth it to set up the overhead of doing multi-threading. But if you’re calculating, let’s say, five different features which are independent of each other, then it’s quite natural to say, “Okay, we’ll do these five in parallel.” Yeah, People can often get 10, 20, 100x improvements by writing things in Polars compared to what they might have got with some other frameworks.

Jon Krohn: 28:07

Wow. That 10, 50, 100x, that includes the parallelization or-

Marco Gorelli: 28:13

Including everything, so including parallelization, including query optimization that we get from doing things lazily, just the whole package. It’s going to give you quite a significant advantage both in terms of runtime and in terms terms of memory.

Jon Krohn: 28:29

Nice. Let me try to break down that lazy term a bit for listeners who might not know it and maybe in the context of what you just said. So if the code that I was working with was working in an unlazy way, which could be a Pandas data frame, and if I have a Pandas data frame with only a hundred rows or a thousand rows and I want to do a sort like you described before I take the top five after a sort, with only a hundred rows or a thousand rows in my data frame, in real time, I’m not going to notice any problems with that kind of evaluation. But if I have a million rows or a billion rows, then that Pandas data frame, I’m going to be just sitting there for who knows how long while I’m waiting for that sort to actively execute, but with this kind of lazy evaluation that is supported by Polars behind the scenes, so it doesn’t actually execute the code until I ask for some kind of output.

29:39

And when I ask for that output, there’s lots of performance optimizations behind the scenes like you described in much better detail than I could. But the net effect is that it means that if I need that sort on a huge data frame to happen because it’s not actively executed in kind of a more simple-minded way, it’s lazily executed in a more clever way. And so, lazy meaning that it doesn’t execute until it has to. Because of that lazy, doesn’t execute until it has to, performance optimized behind the scene execution, you get these huge speed-ups like you described with the sort scenario, use some computer science terminology. It was a linear increase in compute as your data frame gets larger as opposed to nlogn, which is much more, much more computationally expensive when things get larger. Did I do an all right job of trying to recap what you said there?

Marco Gorelli: 30:41

Yeah, it does. I think you got the spirit of it perfectly.

Jon Krohn: 30:44

Nice. All right, so another aspect of Polars that allows it to differ from other libraries is that it optimizes string operations and data processing in particular. Do you want to talk about that?

Marco Gorelli: 31:00

Sure. Right. We need to make a little Pandas and NumPy comparison here. So we need to go back in history a bit. Pandas originally built on top of NumPy. NumPy has not traditionally had a string data type. They do since NumPy version two, but traditionally, if you wanted to store strings and maybe they’re of different lengths and all of that, you’re going to have to just use an object data type in NumPy. So in object data type, every element is just a pointer to string, and that comes with all kinds of performance and memory footguns. So that’s the historical part.

31:51

Then, in Pandas, this has traditionally been a bit of a weak point, so I think since Pandas 1.5, it’s been possible to leverage PyArrow to use a specialized string storage. So how that works is there’s like a really long string behind the scenes, and for each string in your series, Pandas is recording where the string for that particular row starts and where it ends. And like this, it ends up with better performance and memory characteristics compared to just using the classic object data type NumPy ones. Polars have taken it even further, and they’ve got a whole different kind of string. They’ve written a whole blog post about this, and that enables further optimizations, especially if you’ve got repeated strings. So that’s the deal. Polars makes working with strings really nice. It also just does this natively. You don’t need PyArrow installed in order to make use of Polar strings.

Jon Krohn: 33:08

And I guess this is of increasing and increasing importance with how natural language processing is becoming more and more and more of data science.

Marco Gorelli: 33:18

Definitely.

Jon Krohn: 33:19

So there was a time when Travis Oliphant, who we talked about at the outset of the episode, when he would’ve created NumPy and SciPy, almost everyone who was using Python… I don’t have the stats on this, but just based on my experience and seeing what was happening out there, most of the time, you’re working with tabular data, and those tabular data, by and large, they were numeric. I mean, for sure with NumPy. And Pandas was designed to go a bit beyond that and be able to handle lots of different data types in kind of one matrix structure where you have one column that’s strings, one column that’s numbers and so on. So more like working with the kind of data that are in a spreadsheet that you might have in Excel. But we are now in this era of data science where natural language processing capabilities are so profound thanks to things like large language models, transformers, generative AI. We have so much more interest in natural language processing than ever before, and so it seems to me like having these string optimizations will come in handy.

Marco Gorelli: 34:25

Yeah, definitely. I mean, even if you’re not working in NLP, if you’re working in traditional data science, you’re probably working with some columns which are strings, like maybe you’ve got a column which tells you the name of your vendor or the name of your supplier and all of that. You can see the difference that this makes with the TPC-H queries, so this is a set of popular database benchmarks. It’s originally written for SQL engines, but it’s been adapted to data frames, and you can see the difference of running those in Pandas, just a classic data types, and then in Pandas where the only difference you make is to use PyArrow strings instead of the classic object data type. And typically, most queries get about twice as fast, even though in those queries, you’re not doing anything string specific like just doing a join that includes string columns even if you’re just comparing two columns for equality. Any operation where strings are there in the middle, it benefits from this.

Jon Krohn: 35:27

And from optimizing operations in open-source libraries, we move to optimizing the work landscape for data science. In episode number 821, I speak to Marck Vaisman about what companies hiring data scientists should be looking for.

35:40

Excellent. So armed with this information, armed with all of these categories and definitions, what can our listeners do to make the world a more sensible place for data scientists practicing, for data scientists interviewing, for data scientists hiring?

Marck Vaisman: 35:59

That’s a loaded question. So for the hiring side, I’d love to see hiring managers or organizations have more reasonable expectations in the kinds of… Really try to map what are their needs to the skills and competencies that they need and not make people go through like rigmarole of hoops and take-home assignments and trivia problems and things that are not applicable to the core skill sets that you need to do for the job. That’s on the hiring side. On the growth side, on the personal side, on the skilling side is you have a roadmap, and you don’t have to check every box as Jon said before. You really don’t. You can’t because it’s going to take a long time, and time is finite, but you should have general knowledge of all of these things, right?

37:00

It’s going to be hard. You can’t be an expert in every single part here. It’s impossible. I mean, I’ve been doing this for 15 plus years, and by all means, there’s a lot of holes in my knowledge, and I think it’s unreasonable to expect that level of mastery, again, coming back to the unicorn because I feel that that’s what sometimes it comes across is that design expectation that you need to be a master across all of these areas, and it’s just unfeasible.

Jon Krohn: 37:29

Brilliant. Yeah, that is a great tip. I mean, I agree 100% that it is insane how the standard in data science interviews today has become this getting deep into the theory of decision trees or transformers or whatever when the practice of the job in most roles… And again, there’s exceptions, that kind of stuff.

Marck Vaisman: 37:54

Yeah, of course.

Jon Krohn: 37:54

Those types of evaluations-

Marck Vaisman: 37:55

Of course.

Jon Krohn: 37:57

… make sense if you’re going to be a researcher at Google DeepMind. But for most data science roles, it should be about applications. The kinds of things that make somebody a great data scientist are their grit, their perseverance, their curiosity-

Marck Vaisman: 38:15

Curiosity is a big one. Yeah.

Jon Krohn: 38:18

… communication skills.

Marck Vaisman: 38:19

You can’t teach curiosity. That’s one thing that you cannot teach. I think you can build it, but I think it’s part of those core skill sets, and the reason I say why is because right now, I’m working on some stuff, and I’m looking at data, and I’m like, “This is wrong. This just doesn’t make sense.” And you kind of have to start thinking, “Okay, where is this data coming from? How is it generated? Who’s creating this?” I want to know what the upstream data source is. These are just really small examples, and I think a lot of it’s just driven by curiosity, just wanting… Again, I think I consider myself like an everlasting learner, and obviously, I think that for me, the teaching helps with the learning because I always want to be up-to-date like a perpetual student, I guess, for lack of a better term. But I’ve always been very curious, and I think that’s a benefit. You can definitely build it by just ask questions and ask questions and ask more questions.

Jon Krohn: 39:17

100%. All right. Well, fantastic, Marck. This has been an eye-opening episode. Thank you for your work over the past decade defining these kinds of categories and making it easier for us to understand what it means to be a data scientist, for providing frameworks for the kinds of skills that everybody should be learning, and for providing us with a kind of sense that we shouldn’t be overwhelmed because these kinds of expectations heaped on us by job descriptions or by hiring managers aren’t realistic. But the kind of framework that you’ve described today where there are some baseline skills that everyone should know, but there are also these domain skills where you can just pick ones that you dive deep into those, find JDs that feature those, and find a company and a hiring manager that is willing to be realistic about [inaudible 00:40:09]

Marck Vaisman: 40:08

I think that’s the hard problem. Unfortunately, that’s a hard problem. I think some organizations are better than others in that respect, but if you come in armed and saying, “Okay, look, if this… Part of the reason why you want to know about all of these things is because you also want to ask specific questions about the job. What does the job really entail? If it’s something being, and I think we’ve all probably, we’ve all had our fiascos over the course of our careers where we go in for an interview and they’re looking for X and what they really want is Y, but they call it X because that’s what everybody else is calling it. If you read Analyzing the Analyzers, there’s a couple of stories there about that, and that’s not new. But it also helps you frame yourself to be that problem solver for whatever the need is of the organization.

Jon Krohn: 41:05

Nice.

41:06

All right. That’s it for today’s In Case You Missed It episode. To be sure not to miss any of our exciting upcoming episodes, subscribe to this podcast, but most importantly, I just hope you’ll keep on listening. Until next time, keep on rocking it out there, and I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

Podcasts SDS 826: In Case You Missed It in September 2024

Podcast Transcript

Share on

Related Podcasts

November 14, 2025

November 11, 2025

November 7, 2025

Podcasts SDS 826: In Case You Missed It in September 2024

Share

SDS 826: In Case You Missed It in September 2024

Podcast Transcript

Share on

Related Podcasts

November 14, 2025

SDS 940: In Case You Missed It in October 2025

November 11, 2025

SDS 939: Mixture-of-Experts and State-Space Models on Edge Devices, with Tyler Cox and Shirish Gupta

November 7, 2025

SDS 938: Frontier AI Agents for Data Science, with Sphinx’s Rohan Kodialam