Podcasts SDS 531: Data Science at the Command Line

50 minutes
Business, Data Science

SDS 531: Data Science at the Command Line

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

In this episode we detail doing data science at something like a Bash terminal is an invaluable skill, we discuss Jeroen’s PhD research, how his company grew organically out of his success as an author, and more!

About Jeroen Janssens

Jeroen Janssens, PhD, is a data science consultant and RStudio-certified instructor. His expertise lies in visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. He’s passionate about helping and teaching others to do such things. Since 2013, Jeroen runs Data Science Workshops, a training and coaching firm that organizes open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. He is the author of Data Science at the Command Line. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Overview

Like Jon, Jeroen got his start at the Open Statistical Programming Meetup. He was invited to give a talk and decided he could provide some insight as a data scientist working in the command line. He wrote a blog post on the topic that took off in the community which eventually lead him to write a full book on the topic which was published by O’Reilly after the success of Jeroen’s online content sharing. The first edition was released in 2014 and the most recent edition was released in October 2021 which features a new chapter on polyglot data science—having multiple languages in your command line.

In the book, Jeroen discusses over a hundred different tools. So, one aspect that was a challenge was making it easy for folks to get started in command line technology. But he thinks a lot of folks could benefit from learning skills in the command line. It makes the work risk-free inside the Docker container, protecting the rest of your machine from any mistakes while coding. Docker containers and Docker images are both a great way to get started in this and also the norm in the industry that’s important to learn to be successful at the command line. Jeroen’s book is perfect for folks looking for an overview of multiple tools with which they can play inside a Docker container.

Another aspect Jeroen has covered is creating your own tools within the command line once you become comfortable enough in it. He utilizes tools across language because the command line, as Jeroen says, doesn’t care about the language something is written in. He utilizes tools from R, Python, JavaScript, and others—as long as they produce text, which is the universal interface of Unix. There are so many languages out there and truly, you can’t go wrong with any of them.

Jeroen and Jon then pivoted into discussing his company Data Science Workshops. After his book came out, Jeroen was contacted by a company about giving a one-day workshop to an engineering team. He enjoyed the workshop so much he decided to do it more regularly until, in 2017, he decided to make it his full-time work.

Jeroen’s data science path, like many, started in academia and “stumbling into” the industry. His PhD work was in machine learning where he worked to detect anomalies through algorithms. He did that for 5 years. He decided to leave academia to pursue data science, which at the time was a brand-new job title in 2011. Jon had a similar experience where he left in academia in 2012 and while working in finance learned about the new job title of a data scientist.

In this episode you will learn:

The genesis of Jeroen’s book [3:24]
Data Science at the Command Line [8:55]
Creating your own command line tools [22:07]
Polyglot data scientist [24:29]
Data Science Workshops [27:01]
Jeroen’s PhD research [30:38]

Items mentioned in this podcast:

Follow Jeroen:

Follow Jon:

Episode Transcript

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 531 with Dr. Jeroen Janssens, author of Data Science at the Command Line.

Jon Krohn: 00:13

Welcome to the SuperDataScience Podcast. My name is Jon Krohn, chief data scientist and bestselling author on Deep Learning. Each week we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today, and now let’s make the complex simple.

Jon Krohn: 00:43

Welcome back to the SuperDataScience Podcast. Today’s guest is the Data Science command line virtuoso, Jeroen Janssens. Jeroen wrote the popular book, Data Science at the Command Line, the second edition of which was released by the publisher O’Reilly in October. He’s also the founder of the aptly named Data Science Workshops, which provides, well, hands-on workshops on data science to global organizations, such as Amazon and the New York Times. He’s also the organizer of the Data Science Netherlands Meetup, which has over 3,000 members. He formally worked in academia as an assistant professor of data science at Jheronimus Academy, and he worked in industry as a data scientist for the publisher, Elsevier, as well as the New York-based startups YPlan and Outbrain. He holds a PhD in artificial intelligence from Tilburg University.

Jon Krohn: 01:33

In today’s episode, Jeroen details why being able to do data science at the command line, for example, in a batch terminal, is an invaluable skill for a data scientist to have. He also talks about how mastering the command line is the glue that facilitates polyglot data science, the ability to seamlessly borrow functions from any programming language in a single workflow. He talks about his PhD research on detecting anomalous events in time series data, why LaTeX is a great typesetting language to consider using, particularly for creating lengthy documents or technical figures that adapt automatically to new data, and he talks about how his company, Data Science Workshops, grew organically out of his success as an author. Today’s episode will appeal primarily to practicing data scientists and related professionals, like software engineers and machine learning engineers, since it largely covers topics relevant to people who use a command line prompt as part of their workday, or would like to consider doing so. All right, you ready to rock? Let’s go.

Jon Krohn: 02:32

Jeroen, welcome to the SuperDataScience Podcast. I’m so excited to have you on. I’ve been looking forward to this episode for weeks, and now you’re here. Where in the world are you calling in from?

Jeroen Janssens: 02:51

Thank you very much for having me. I’m very excited to be here. I’m calling in from Rotterdam, the Netherlands.

Jon Krohn: 02:59

The shipping center of the globe.

Jeroen Janssens: 03:01

Oh yes. There’s a lot of maritime activity over here.

Jon Krohn: 03:08

Nice. We know each other directly through Jared Lander, who was an episode 501. Jared runs the New York Open Statistical Programming Meetup, which has made a huge difference in my life as well as in yours. There’s a common thread for us with this Open Statistical Programming Meetup in that it launched my book writing career, and it sounds like you have a similar story.

Jeroen Janssens: 03:35

That’s right. Yeah. Jared reached out to the startup in New York where I was working at the time, and I was given the opportunity to give a talk. I was thinking, what should I do a talk about? Well, I was a data scientist at this startup, and I was using command line tools. I was using the Linux command line to do some of my work, and I thought, hey, I have a few things to say about that. While I was preparing to talk, I also wrote a blog post and this was in 2013 already. Well, that blog post eventually got number one on Hacker News.

Jon Krohn: 04:17

Oh, wow.

Jeroen Janssens: 04:18

For a full day.

Jon Krohn: 04:20

[crosstalk 00:04:20]

Jeroen Janssens: 04:20

Yeah, I never had a blog post like that ever. Set the bar very high. No, but then I thought, hey, maybe there is something here, and that eventually led me to writing the book. Of course, I got a lot of help from plenty of people along the way, and one of them is Jared.

Jon Krohn: 04:42

Yeah. O’Reilly picked up the book, and probably the most well known technical publisher on the planet, I guess they saw that you had this incredibly popular blog post and reached out to you?

Jeroen Janssens: 04:55

No, it’s the other way around. I got to help from, yes, many more people, including Michael Doer, and he introduced me to Mike Loukides, who has been working for O’Reilly for a very long time. I was able to convince him that we should do a book about data science at the command line.

Jon Krohn: 05:23

Nice. Well, I can see why for me it’s a fascinating topic. I try to use command line tools wherever I can, and we’re going to dig into this a lot in your episode. Before we get to that, we’re going to talk about your book in a lot of detail, but you also have a connection to another recent guest who is Veerle van Leemput. She was in episode 491. Yeah. Tell us a bit about that connection.

Jeroen Janssens: 05:47

Yeah. I was very pleasantly surprised to find out that she did a podcast. I did my PhD at Tilburg University, and I also taught at that same university, and that’s where Veerle did her master’s degree. I was lucky enough to be supervising her master thesis, and I was very fortunate to be able to introduce her to some R packages that I had recently learned, including the Tidyverse. Well, it’s obvious that she has picked that up very well and moving into developing shiny applications. Yeah, that was great to see.

Jon Krohn: 06:33

Yeah, exactly. That was the topic of the episode. We talked a ton about R Shiny and the Tidyverse.

Jeroen Janssens: 06:36

Yeah.

Jon Krohn: 06:38

Yeah, and thank you for correcting my pronunciation of her name. I butcher Dutch names.

Jeroen Janssens: 06:43

That’s all right.

Jon Krohn: 06:43

Oh, my goodness. But-

Jeroen Janssens: 06:44

We are so much more than names, of course.

Jon Krohn: 06:50

I’m trying to get better. Back to data science at the command line. That ended up being the title of your book. You had a first edition that came out in 2014, and now you have a second edition that just came out. It came out in October and it’s actually freely available online. It’s not exactly the same as the version published by O’Reilly that you can order printed copies from O’Reilly or you can go to O’Reilly.com and go into their learning platforms, your digital version. The free online version, slight differences in that it’s your screenshots or sketches in the free version, but all the content is there. That’s something that listeners can check out risk free at no cost, and then hopefully if they like it, they’re willing to buy the physical copy or access the digital copy so that you … I think that’s the right thing to do in those kinds of circumstances. But it’s an amazing book. It covers how to clean and explore data at the command line. It covers how to model on the command line, of course. It covers how to create your own command line tools, which is actually something that you spoke about recently at the New York R Conference. We were both speaking at that, and that’s where we actually recorded a SuperDataScience episode. It was the first ever SuperDataScience episode live in front of a studio audience, number 511 with Drew Conway, whom you probably also know. Yeah. [crosstalk 00:08:25]

Jeroen Janssens: 08:24

We’ve interacted a couple of times.

Jon Krohn: 08:26

I am unsurprised. Yeah, creating command line tools is a big focus, and it has a new chapter that wasn’t in the first edition on polyglot data science, about having many, many languages in your data science toolkit. I imagine there’s something there on how command line tools allow you to bring all those together and make the most of everything.

Jeroen Janssens: 08:54

That’s right.

Jon Krohn: 08:55

Yeah. Tell us about the book beyond the high-level summary that I just gave.

Jeroen Janssens: 09:01

Yeah. Okay. The book is indeed freely available, datascienceatthecommandline.com. Once I have updated those images, it will actually be the same as the book.

Jon Krohn: 09:13

Oh really? Oh, wow.

Jeroen Janssens: 09:15

Yeah. Yeah, yeah, yeah. Yeah, I was very happy that O’Reilly allowed me to put [crosstalk 00:09:21]

Jon Krohn: 09:21

Wow, yeah.

Jeroen Janssens: 09:24

I like that you mentioned risk free because, next to this book, I have created a docker image that allows you to try all these tools out risk free, right? Because it’s an isolated environment. The command line, when you’re first starting out, can be very intimidating and you can indeed … Well, it’s not without risks. You can very easily wreck your system. When you are in this isolated environment … I used to have a virtual machine for this back in 2014, but now I’ve switched to a docker image. Then of course, once you run it, comes a docker container. I’m still thinking about what’s the easiest way to get people started? Because I discuss over 100 different tools, and if you were to install them all manually, it would take you a good portion of the day, and that’s of course not a fun way to get started. I am thinking about this a lot and how to make it easy for people to get started with this. Because while it’s been around for over 50 years, this technology, the so-called command line, the Unix or Linux command line, there are still … Yeah. I think that a lot of people would benefit from learning a trick or two from this environment.

Jon Krohn: 10:56

I wholeheartedly agree, and I love the way that you’re doing it. Having it in a docker image, that for me is today the easiest way that you could be conveying this to the audience. Yeah, there’s a few steps, maybe even at the command line, that somebody will have to execute to get that docker image turned into a running container on their systems. But then, as you say, it has all of the dependencies that you need. In your case, like you’re saying, over 100 different software tools already pre-installed and everything is inside that container. It’s separate, and like you say risk free, from all of the rest of your systems. The worst case scenario is that you mess up something inside the docker container, but the rest of your computer will be absolutely fine. Yeah, I think you’re doing things absolutely correctly there.

Jeroen Janssens: 11:45

Having a docker image was also for myself because I wrote the entire book in R Markdown, which would take all the Bash code and all the pipelines that I had written, compile it, and then stitch the output from each command back into the Markdown source, which then is in turn being converted to a PDF and an ebook and a website and so forth. What I wanted to avoid is that the code and the output from that code got out of sync. That’s why I decided, let’s do this properly, have a docker image so that others can also benefit from this.

Jon Krohn: 12:30

Awesome. Yeah, that makes a lot of sense. Eliminating unnecessary distractions is one of the central principles of my lifestyle. As such, I only subscribe to a handful of email newsletters, those that provide a massive signal to noise ratio. One of the very view that meet my strict criterion is the Data Science Insider. If you weren’t aware of it already, the Data Science Insider is a 100% free newsletter that the SuperDataScience team creates and sends out every Friday. We pour over all of the news and identify the most important breakthroughs in the fields of data science, machine learning and artificial intelligence. Simply, five news items. The top five items are hand picked. The items that were confident will be most relevant to your personal and professional growth. Each of the five articles is summarized into a standardized easy-to-read format and then packed gently into a single email.

Jon Krohn: 13:34

This means that you don’t have to go and read the whole article. You can read our summary and be up to speed on the latest and greatest data innovations in no time at all. That said, if any items do particularly tickle your fancy, then you can click through and read the full article. This is what I do. I skim the Data Science Insider newsletter every week. Those items that are relevant to me, I read the summary in full. If that signals to me that I should be digging into the full original piece, for example, to pour over figures, equations, code, or experimental methodology, I click through and dig deep. If you’d like to get the best signal to noise ratio out there in data science, machine learning and AI news, subscribe up to the Data Science Insider, which is completely free, no strings attached at www.superdatascience.com/dsi. That’s www.superdatascience.com/dsi. Now, let’s return to our amazing episode.

Jon Krohn: 14:34

Yeah, for listeners who aren’t already familiar with docker containers, docker images, this sounds like a great way to get started with that as well. When you get into using data science in production systems, at companies, docker containers, docker images, these are the standard today to allow your machine learning models to scale up across many different servers and execute efficiently. Definitely, a skill worth having if you don’t already. All right. Once a reader of your book has their docker environment set up, then what do they do? You must teach them some Bash before getting into R and Python, or tell us about the flow through the book.

Jeroen Janssens: 15:19

Yeah. The book starts with an introduction to … Well, first of all, explains of course what data sciences is according to, well, some. Right? If you have ask 12 different people what data science means, you get back 13 different answers. By the way, the book leans on the OSEMN model by Chris Wiggins and Hilary Mason. After it has explained why the command line-

Jon Krohn: 15:49

What do you mean by that? The OSEMN model? It’s a template for book writing?

Jeroen Janssens: 15:54

No. No, it’s O-S-E-M-N, which you could pronounce as OSEMN, where each letter stands for a task that you could do in data science or that you would need to do, and they are obtaining data, that’s the O. The S stands for scrubbing data. Then you have exploring data, then modeling data, and lastly you have interpreting data. But then it’s the N which is capitalized.

Jon Krohn: 16:25

Interpreting.

Jeroen Janssens: 16:26

[crosstalk 00:16:26] this acronym, which I don’t discuss because I think that interpreting data, right? What do your results mean? How do you communicate that to your peers is not so much an activity you do at the computer. That’s very much a human activity. But having said that, every step, the other four steps each have a chapter in this book. There’s a chapter on obtaining data where I talk about how you can open up Excel files or just CSV files, how you can download data, how you can query RESTful APIs. Then there’s a chapter on cleaning that up, of course, because rarely is the data that you get into a format that allows you to immediately work with. This could be anything that you could also do in Python or R, but sometimes you need to do a little bit to cleaning before you can even get that data into pandas or into the Tidyverse. Right?

Jon Krohn: 17:41

Mm-hmm (affirmative) Does that involve things like awk and sed, by chance? [crosstalk 00:17:47]

Jeroen Janssens: 17:47

Yeah. Awk and sed are two popular command line tools that can do a lot of data cleaning there. Yeah. Yeah. Awk has to be one of my favorites, but I’m not sure if it’s really fair because it’s a programming language on its own, albeit very restrictive. It’s not like Python or R. But it allows you to do a lot with your data. It reminds me of a story, a project that I once did for a client in Norway. They had this two gigabyte JSON file that they just needed to change a little thing, but it had big implications. They tried to use a tool called JQ, which is another command line tool that allows you to work with JSON data. JQ is one of the younger ones. Yeah? They couldn’t do it. They just ran out of memory all the time. This person, he had followed my workshop, Data Science at the Command Line, and he reached out and he said, “Hey, could you perhaps help us?” I tried a bunch of things, and in the end, I was able to fix the problem using awk, a tool and language from, if I had to guess, the ’80s, and-

Jon Krohn: 19:11

I think so. Yeah. [crosstalk 00:19:13]

Jeroen Janssens: 19:13

… I was able solve this under a second and didn’t use any memory. That just keeps reminding me that we don’t often need bright and shiny new tools. Right? A lot of tools are already available that have been up optimized throughout the years, and this is just one of those personal success stories there. But yeah, awk and sed, and then there are about 100 others. Yeah. You don’t need to know all of them, just as you don’t need to know every single function from the Tidyverse-

Jon Krohn: 19:50

Sure.

Jeroen Janssens: 19:51

… in order to be effective.

Jon Krohn: 19:53

Sure. But a book like yours is perfect for giving people an overview. Maybe they don’t need awk or sed today, but-

Jeroen Janssens: 20:00

Nope.

Jon Krohn: 20:00

… by being introduced to that and the 100 other tools, you see, okay, there’s another possible tool in my toolkit. They can do a little bit of playing around safely inside of a docker container that you provide them, and then they can … Okay. That’ll stick in their memory, and maybe a year or two later, you’re like, “Oh, perfect opportunity for me to make use of one of these other tools.”

Jeroen Janssens: 20:25

There are tools written, to this day, every day. I don’t know about the majority of tools, right? I still find new tools on a regular basis. But what I think really matters is that after a while you get accustomed to the environment that you’re in. Tools come and go, but it’s the environment, the shell, the command line, also sometimes called the terminal, if you will, although there are slight differences between these concepts. But once you are comfortable in what is sometimes a stark and unforgiving environment, once you’re used to that, the tools themselves don’t really matter because you have a bunch of them and you know how you can stitch them together, and then solve your task and then move on to the next one.

Jon Krohn: 21:17

Yeah, being comfortable with the command line in a programming language like Bash is a career long investment as a data scientist. You will always be able to come back. Unix isn’t going away. Jupyter Notebooks aren’t going to be around forever. Something else is going to come along. Any of these click and point user interfaces, they’re going to come and go over the decades long that you are practicing data scientist, but it is unlikely.

Jeroen Janssens: 21:46

Yeah. No, it’s-

Jon Krohn: 21:47

[crosstalk 00:21:47] command line is going to be [crosstalk 00:21:48]

Jeroen Janssens: 21:49

… example of the Lindy effect. Right? It’s been around for 50 years, so we can expect it to be around … Well, for the rest of our lives, at least. Yeah. It’s a worthwhile investment. Exactly.

Jon Krohn: 22:02

Definitely. All right. All right. We’ve covered cleaning and exploring data a little bit. Do you want to tell us a bit more about creating your own command line tools, which was the topic of your New York R Conference talk in the fall?

Jeroen Janssens: 22:16

Yeah. Yeah, absolutely. After a while, you realize that whatever you type at the command line can also be turned into a command line tool. It’s conceptually similar to writing a function in a programming language, right? You’re able to abstract away from some code and to treat it as a black box. But the great thing about the command line is that it doesn’t care about the language that a tool is written in. On any given day, I use tools that have been written in Perl, C, R, Python, JavaScript, even. As long as they take in text, produce text, and that sounds a bit odd right now, but text is the universal interface in Unix. But then you’re able to combine those, and then, yeah, the possibilities are limitless.

Jon Krohn: 23:19

Yeah. It’s the ultimate melting pot across all of the programming languages. Whether you today happen to be a kind of data scientist who uses Python more, R more, or SQL most of the time, it doesn’t matter. At the command line, you can use all of these equally, and as you add more languages, more tools to your toolkit, they can all blend together in these kinds of scripts, which you so elegantly describe as a way to then abstract away things that you do repetitively and do everything more efficiently.

Jeroen Janssens: 23:51

Yeah. One more thing about that, it’s indeed … You’re expanding your own toolbox because of this, but you also have the ability to expand other tool boxes, the toolbox that others have. If someone is not familiar in say Python or R, just to name a few, it doesn’t matter. If you provide a little bit of documentation on how to use the tool, then they should be able to use it, install a few prerequisites. But then the interface is very familiar because it’s all on the command line. That is one approach you can take when you want to be a polyglot data scientist.

Jon Krohn: 24:36

Nice. Yeah. Then we’re onto that new chapter. Is there anything else about that in particular that you want to focus on right here? I absolutely love this idea. In the same way that R and Python and SQL are the most popular programming languages in data science today, again, that’s going to change. We don’t know what they’re going to be in five years or 10 years. Those three languages will still be around in five years and 10 years, but who knows what else will come up? Again, being able to understand the command line means that it doesn’t matter, whatever comes along, you’re going to be well prepared for blending that together with the tools you already have. I don’t know. Anything else from you?

Jeroen Janssens: 25:19

Yeah. Yeah, of course, we mention Python and R quite a lot because they are the two most popular opensource languages when it comes to doing data science. But also something like Spark, Apache Spark. It can take in … It has a pipe method, right? Where you can pipe all the items in an RDD to a command on the command line. Then once you realize that, again, there are so many possibilities, but what really matters here as well is the mindset that you are, that you realize that you don’t have to do everything in a single language. A lot of people, when they go into data science, right? They are of course overwhelmed, just like with everything else when you go into a new field. But then they’ll see, there are all these languages and all these tools, which one should I learn? I think it’s important to know you can’t go wrong with either of them. If you, at a certain point, see a package that really solve your problem, it’s a very specialized package in your language, then it’s not a big deal. You’ll be able to still use [crosstalk 00:26:43] that package and remain inside your preferred language. Yeah.

Jon Krohn: 26:50

Nice. That is a beautiful summary, Jeroen, of why this polyglot data science is so valuable. All right. Let’s move on to your business. Your business, Data Science Workshops came about as a result of your first book. The first edition, as I mentioned earlier, came out in 2014, and then by 2017, you had a full-time business of your own, Data Science Workshops, that came out of the book. I’d love to hear about that. Data Science Workshops does data science training and coaching, of course offering workshops as the name would suggest, but also in-company courses, hackathons, meetups, and even what you bill as inspiration sessions that could be shorter sessions that provide anyone, whether they are hands-on data scientists or not. They could be managers. Providing them with an inspiration of what’s possible with data science, transforming your organization towards a more data-oriented machine learning-oriented organization. Yeah, you offer all these different kinds of formats and you offer them in a wide range of languages, which given everything that we’ve talked about so far, should not be surprising. You do workshops in Python, in R, in Bash, and even there’s some JavaScript in there. Tell us more about your business beyond what I’ve just introduced.

Jeroen Janssens: 28:30

Yeah. Well, it sounds so good when you say it like that, but I didn’t plan for this. When the book came out, I was contacted by a company in Barcelona, Spain, asking me whether I wanted to give a one day workshop to their team of engineers, because he wanted them to get more exposure to the command line and he didn’t want to do this himself. They had some specialized command line tools for obtaining some data from their databases, which I really liked. In fact, I liked giving the workshop so much that I decided to do this more often, and not just about the command line, but also R and Python, indeed. Of course, languages are only tools, right? A means to an end. I also give workshops that are more focused on data visualization and machine learning. But yeah, it grew out of that. I started giving more and more workshops at different companies until I decided in 2017 that it was time to just do this full time because I enjoyed it a lot, and yeah, it allowed me to put food on the table. Here we are, four years in, and I’m still enjoying it very, very much.

Jon Krohn: 30:07

Nice. Yeah. It’s wonderful to be able to find something that stimulates you so much, that you enjoy so much and that you can also do to put food on the table. It’s a very nice situation to be in, Jeroen. Yeah. Prior to running this Data Science Workshops company, you had a very strong technical background. You did a PhD at Tilburg University, and then you were an assistant professor at the Jheronimus Academy of Data Science. It’d be interesting maybe for our listeners to talk a bit about what your PhD research was, how that led you into this data science path.

Jeroen Janssens: 30:51

Yeah. Yeah, yeah. We can go one step back more, and that’s, I did a master’s in artificial intelligence, and stumbled into doing a PhD, which was more about machine learning and then specifically detecting anomalies. I spent five years working on algorithms that would detect anomalies. Yeah, five and a half years, a long time. I think we spoke about LaTeX before the show. I think I spent about six months of those five and a half years perfecting the layout of my PhD, which I really enjoy doing that, but designing is not particularly a skill which is appreciated too much, or it’s not really used to advance your career within academia. I decided to leave academia right after or even before my PhD. I was lucky enough that most of the things that I had learned were all of a sudden being used in this role called data scientist. It was back in 2011. It didn’t exist in the Netherlands, this title. There were no vacancies with this title. But on the other side of the pond, as they say, in New York, there were a lot of roles. Yeah, I was very lucky that I could start working as a data scientist in New York City.

Jon Krohn: 32:39

Yeah. Outbrain, YPlan, Elsevier. Elsevier was back in the Netherlands. By that point, they’d caught up. You could be a …

Jeroen Janssens: 32:48

That’s right.

Jon Krohn: 32:49

Yeah, you finished the PhD in 2013. Well, I guess you-

Jeroen Janssens: 32:55

No. That’s right.

Jon Krohn: 32:56

… finished offending in 2013. It’s very much the same timeline as me. I left Oxford where I did my PhD in 2012, started working, but I didn’t have my dissertation submitted and defended until 2013, or 2014? I think it was ’13. But I started work, and I had the same experience. When I started work, I worked as a trader at a hedge fund, and I did that for two years and I had never heard of this titled data scientist. I didn’t love being a trader. I had a hard time staying motivated about making money for its own sake. I left being a trader with the intention of actually … I’d done my PhD in England. As a trader, I was between New York and Singapore, and when I quit, my intention, and it was literally in my resignation letter, was that I was going to move back to Canada and study medicine, which is something that I’d always intended on doing in a way. But then at the same time, in this period, I had a month left on my rent in New York and socially with a friend who worked at another company in New York that I’m sure you’ve heard of from your time here is Zocdoc, which is a marketplace for matching people looking for medical services with doctors. I met somebody who worked there who had the title data scientist, and I was like, “Whoa! It sounds like I could do that job.”

Jeroen Janssens: 34:32

It sounds very sexy.

Jon Krohn: 34:33

Yeah. They were like, “Well, as it happens, there’s a lot of people looking for you in this field,” and I was like, “Wow, all right. Maybe I should look into this instead of going on a 10-year path before I’m earning an income again down the medical route.” I love data science in case it isn’t obvious. Anyway, yeah. I’m sorry, I’m cutting off your story by just talking about myself.

Jeroen Janssens: 34:57

Oh, no.

Jon Krohn: 34:58

But it is interesting parallels there, I think.

Jeroen Janssens: 34:59

Definitely, and it’s a shame we didn’t bump into each other in New York city. But yeah, in the Netherlands, when you do a PhD, you got a salary, but only for four years. After four years, you’re on your own, and I wasn’t quite finished, mainly thanks to the LaTeX. Well, a bunch of other reasons. But I had to find a job. I had to put food on the table, again. That’s when New York came around. Exactly. That’s also when I made the switch from MATLAB. I remember you started out with MATLAB as well. But-

Jon Krohn: 35:36

Wow. You’ve listened to a lot of episodes. Yeah, that’s right.

Jeroen Janssens: 35:40

I like your podcast. I do.

Jon Krohn: 35:44

That’s great. Yeah. In my undergrad, I was primarily using MATLAB, and then over the course of my PhD, more so into R, and then as a professional data scientist now, mostly in Python. But yeah, that’s right. MATLAB.

Jeroen Janssens: 36:00

MATLAB was great, but it was also really expensive. I decided being a Dutch person, I, of course, don’t want to pay for these things. I moved to a free and opensource alternative that was called Python.

Jon Krohn: 36:17

Octave.

Jeroen Janssens: 36:17

I remember I was using pandas version 0.4, and still Python 2.6, I think. Yeah. Good old days.

Jon Krohn: 36:27

Yeah. It makes a lot of sense. MathWorks, who creates the MATLAB software, they provide it for free to university students. That gets people hooked. But then once you move into the professional world, the licenses for MATLAB are expensive. The nice thing about paying a lot for software like that is that it’s really tidy. Everything works together exactly as you’d anticipate, and everything’s very well documented. In the opensource world, obviously a bit more of a wild west, but it’s free and it has all of the latest things. Somebody publishes a paper, you see that paper right away in archive, and then, oh, you go to the associated GitHub repo to get access to that modeling approach right away. I think that’s why we’ve, as an industry moved, gravitated, towards opensource. But yeah.

Jeroen Janssens: 37:27

It can go two ways. I’ve also been using commercial software, commercial database. Because it was so expensive, there wasn’t a large community. When [crosstalk 00:37:40] you were in trouble, it was very hard to get answers to your questions. It goes both ways.

Jon Krohn: 37:47

Yeah. That’s a really good point too, that if you’re using more popular opensource libraries, you can just type the error that you’re getting in into a Google Search and you get back exactly the Stack Overflow page, the exact command that you need to fix your problem. Yeah. That’s a really good point. It’s interesting. I wanted to come back to you’re talking about LaTeX and spending six months of your PhD perfecting the layout. LaTeX, for listeners who aren’t aware, is a typesetting language. I wrote my book in LaTeX, I wrote my PhD thesis in LaTeX. It allows you, actually at the command line if you’d like to, to write beautiful PDFs and it scales up nicely. If you’re writing something that is only a few pages long, doing that in a real time editor like Microsoft Word or Google Docs, that isn’t too cumbersome. But when you have documents that are hundreds of pages long, it makes it nice if you can have your chapters or sections broken out into different text files, and then those all compile together and it makes the whole process a lot more manageable. When books are ultimately published, they’re often done in LaTeX. It also can give you the crispiest layout possible.

Jon Krohn: 39:18

The reason why I’m talking about LaTeX so much is it’s interesting that the very next episode that I’m recording, which I anticipate will be episode 533 that will probably be released the week after listeners are hearing this episode, is with someone named Brett Tully. Brett Tully and I did our PhDs at the same time. We were roommates. Well, we were doing our PhDs at Oxford together, and he made this amazing LaTeX template for his PhD, I just took it and used it. It was to all of the formatting guidelines that we had at Oxford. For me, it’s interesting that one of the reasons why I then loved LaTeX so much is it meant that I didn’t futz around with formatting. Prior to getting into LaTeX, I would primarily use Word or Google docs. When I do that, I spend a crazy amount of my writing time futzing with how this particular page looks or this particular image looks as opposed to just getting the content in there. When you’re working with LaTeX, particularly if you do it at the command line, you’re just hammering out the content, and I wasn’t worrying about formatting at all. But anyways, two very different experiences with LaTeX.

Jeroen Janssens: 40:37

You took the smart route.

Jon Krohn: 40:41

I don’t know. I just got lucky. If Brett hadn’t been there, I don’t know, I would’ve been in the same situation as you, and I would’ve needed another year to finish my PhD.

Jeroen Janssens: 40:49

Oh yeah. Having said that, I also did my figures in LaTeX. There is a package called TikZ that I used that don’t know how to use anymore. Forgotten. But that produced one … If you’re listening to this and you ever need to produce a good looking figure, programmatically, then TikZ might be worthwhile looking into. T-I-K-Z.

Jon Krohn: 41:19

Yeah. That’s a really good point. Actually, something that we haven’t even mentioned yet, one of the really useful things about LaTeX is writing math equations, which is how a lot of people get into it. It is the most sophisticated framework that I’m aware of for being able to write equations however you’d like. Then as you say, you can use TikZ for creating your figures. You can create beautiful figures programmatically at a command line, if you’d like to, or an in-text file, whatever text editor you want to use. All of these things can be seamlessly weaved together, which is … Yeah. There’s a bit of a learning curve. But if you’re making large documents or if you’re going to be wanting to automatically generate a lot of the same style of figure, definitely something worth investing in, and Beamer for presentations.

Jeroen Janssens: 42:19

But also if a figure or graph is based on data, it’s really nice that that graph gets updated as soon as the data changes.

Jon Krohn: 42:30

Totally.

Jeroen Janssens: 42:31

Maybe you’re doing some experiments, right? That produce some results that you have, I don’t know, in a CSV format, then it’s really nice. Of course, you can do this also with R and then Python. There are many ways to do this, but just being able to reproduce that is really nice. Yeah. Yeah.

Jon Krohn: 42:51

Totally. Yeah. Yeah, didn’t expect to go off on this LaTeX tangent, but hopefully it has been interesting for some of our listeners. Jeroen, something that I ask all of the guests on the show is for a book recommendation, and I know that you have one for us. Do you want to tell us about it?

Jeroen Janssens: 43:09

Oh yeah. Yeah. I have here Masters of Doom by David Kushner. He writes about how Doom and Quake and the team behind it, its software, came into existence. It’s a fascinating read, not just … It’s of course related to gay games, but it’s still related to technology, and of course, lots of drama in this.

Jon Krohn: 43:42

No kidding.

Jeroen Janssens: 43:44

Yeah. I still play these games regularly, also creating a map for Quake, one that is, the game that came out in ’96, if I’m not mistaken. Yeah, it’s not a data science book, but it is a nice one if you’re looking to read something in a different [crosstalk 00:44:06]

Jon Krohn: 44:05

Yeah, it blends technology and history in an interesting way, especially if there’s drama. That can be really interesting to read.

Jeroen Janssens: 44:13

Yes.

Jon Krohn: 44:15

Yeah, those games, when they came out in the ’90s, I’m getting the impression from the things we talk about that were very much the same age. But I remember when Doom came out, there had been some 3D games before like Wolfenstein 3D, but Doom, it was a game changer in terms of being able to have a 3D video game experience so much so that I remember insisting that my grandmother come with me to the bookstore so that I could get the strategy guide to how to play Doom. But we didn’t even have a computer. We didn’t have computers at home. I didn’t have access to a computer at school at that time, but I just knew that this super cool thing was going on, and so I learned about all the different kinds of guns and monsters from the guidebook. Anyway, that’s a really cool-

Jeroen Janssens: 45:04

[crosstalk 00:45:04] death match, at some point.

Jon Krohn: 45:10

What does that entail? It sounds painful.

Jeroen Janssens: 45:13

Oh, a death match is just a term that was coined by John Romero. Right? One of the founders of its software. It’s a type of game where you just try to kill each other in the game as often as possible. Yeah. This can be-

Jon Krohn: 45:34

Got it.

Jeroen Janssens: 45:34

… one on one, or one versus many, team versus team. Yeah. No, that’s a different name. That’s a different type of game actually, but yeah. Many versus many death match, and it used to be all the rage on LAN parties. Yeah.

Jon Krohn: 45:54

Yeah. Yeah. Prior to widespread internet when you were on dial-up modems. You’d literally physically network [crosstalk 00:46:02]

Jeroen Janssens: 46:02

We’re showing our age here, Jon. Yeah.

Jon Krohn: 46:06

Yeah. I’d say very cool, but I’m not sure that this conversation has been very cool.

Jeroen Janssens: 46:13

Well, to make it relevant again, John Romero’s wife, Brenda Romero, who’s also a game designer, she recently had a tweet about … Okay. She was wondering how many game designers were actually using Python and pandas in order to analyze their data.

Jon Krohn: 46:33

No kidding.

Jeroen Janssens: 46:34

Or were they still using Excel and sorting these columns? Of course, I immediately reached out. Still waiting to hear back from her. But it’s all tied together. There’s data in every field, and that’s what makes data science so interesting, right?

Jon Krohn: 46:52

Yeah, and so brought applicable. Cool. Yeah, definitely an application that I didn’t anticipate discussing today, but absolutely love doing. Jeroen, we’ve learned so much from you today, not only on LaTeX and Doom, but also on the command line and how it can be so useful in data science. How can people stay in touch with you, follow your work? What’s the best way to stay up-to-date on the latest on working at the command line?

Jeroen Janssens: 47:20

Right. Two main websites. The one for the book is datascienceatthecommandline.com, where you can read, well, both the first edition and the second edition. I can recommend the second edition, although I am a bit biased. If you want to get a hold of me or follow whatever it is I am doing, I am on Twitter and LinkedIn, and the easiest way to get to me is through my company’s website, which is datascienceworkshops.com. I always love to hear from others whenever they are stuck with something or whether they have a suggestion for some cool tool that they’ve found. Yeah.

Jon Krohn: 48:05

Awesome. All right, Jeroen, this has been such a fun episode to film with you. I’ve really enjoyed this. I’ve learned a lot, and yeah, hopefully-

Jeroen Janssens: 48:15

Likewise.

Jon Krohn: 48:15

… we can have you on the show again sometime in the future.

Jeroen Janssens: 48:18

I would love that. Thank you so much for having me, Jon.

Jon Krohn: 48:26

I loved filming this episode with Jeroen today. He’s so effortlessly cool while explaining relatively complex technical topics. He seems like a genuinely exceptionally happy individual who would brighten anyone’s day. In today’s episode, the beautiful ray of sunshine, Jeroen, covered how both his book, Data Science at the Command Line and his associated docker container of hundreds of software tendencies are freely available online. He talked about how the newly released second edition covers how to do everything a data scientist would need to at the command line, including creating command line tools, modeling data, and using the command line as the polyglot glue between different languages like Python, R, JavaScript, anything. He talked about the OSEMN, O-S-E-M-N, the OSEMN approach to data science defined by Hilary Mason and Chris Wiggins, which involves obtaining, scrubbing, exploring, modeling, and interpreting data. That’s the approach that he follows in his book. As always, you can get all those show notes, including the transcript for this episode, with the video recording, any materials mentioned on the show, the URLs for Jeroen’s LinkedIn and Twitter profiles, as well as my own social media profiles at www.www.superdatascience.com/531. That’s www.superdatascience.com/531.

Jon Krohn: 49:47

If you enjoyed this episode, I’d of course greatly appreciate if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel. I also encourage you to let me know your thoughts on this episode directly by adding me on LinkedIn or Twitter, and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. All right, thanks to Ivana, Mario, Jaime, JP and Kirill on SuperDataScience team for managing and producing another fun and informative episode for us today. Keep on rocking it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Podcasts SDS 531: Data Science at the Command Line

SDS 531: Data Science at the Command Line

Podcast Transcript

Share on

Related Podcasts

October 31, 2025

October 28, 2025

October 24, 2025

Podcasts SDS 531: Data Science at the Command Line

Share

SDS 531: Data Science at the Command Line

Podcast Transcript

Share on

Related Podcasts

October 31, 2025

SDS 936: LLMs Are Delighted to Help Phishing Scams

October 28, 2025

SDS 935: Global Issues Accelerated by AI (with Solutions), feat. Stephanie Hare

October 24, 2025

SDS 934: Is AI Replacing Junior Workers?