Kirill: This is episode number 61 with data science speaker Daniel Whitenack.
(background music plays)
Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, data science coach and lifestyle entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex simple.
(background music plays)
Hello everybody and welcome to the SuperDataScience podcast. Very excited about this new episode. Today we’ve got an interesting guest, Daniel Whitenack. Daniel is a Data Scientist at Pachyderm, but in addition to his normal role, he is a renowned speaker in the world of data science. Daniel presents at dozens, literally dozens, of conferences per year, and he talks about things like Python, Pachyderm, the language Go, machine learning, and data science workflows, machine learning workflows, and lots and lots of other topics.
So today we have a very great opportunity to have Daniel on the podcast, and what we spoke about were quite a few things. We talked about Pachyderm and what the mission of the company is, we also talked about data science workflows. And I think that was a very powerful part of data science, something that is developing in data science, and if you want to see and prepare yourself for the future of data science, this can be a very valuable set of skills to look into. And in this podcast you will find out exactly what data science workflows are and how you can get more involved in learning about data science workflows and getting up to speed with what is going on in this space of data science.
Then we’ll also talk about a bit about the programming language of Go. We talked about mentorship and what role it plays in both the careers of mentees and mentors. Of course, towards the end we talked about a bit of Daniel’s experience in data science.
So a very saturated podcast. You’ll learn quite a lot, especially if you are interested in understanding the end to end data science process. Not just the ad hoc data analytics, but becoming a well-rounded data scientist. You will get some very valuable tips from today’s show.
And on that note, I can’t wait to get started. So without further ado, I bring to you Daniel Whitenack, a data scientist at Pachyderm.
(background music plays)
Welcome everybody to the SuperDataScience podcast. Today I’ve got a very exciting guest, a multi-time speaker at data science conferences, Daniel Whitenack. Hi Daniel, how are you going today?
Daniel: Hi, I’m doing really well. Thank you for having me.
Kirill: Thank you for coming, it’s a great honour to have you on. I’ve seen a couple of your YouTube videos and actually we crossed paths at the ODSC conference. Tell us a bit about where you’re calling in from. Where are you located?
Daniel: Yeah, I’m in Lafayette, Indiana, which is about 2 hours from Chicago and an hour from Indianapolis. It’s where Purdue University is, if anybody is familiar with that.
Kirill: That’s very interesting. But it’s a rare cse for you to be at home, right? As I understand, you’re at all these conferences, always presenting. Tell us a bit more about that. Why are you so passionate about presenting at conferences?
Daniel: Yeah, I really found that where I learn the most is when I’m in community with other data scientists, with other engineers. I really thrive off of things like coder views and other things like that, I think have really been where I’ve learned, so I’ve gained an appreciation for the community, the tech community, and want to be able to be involved as much as I can in that community, and give back content and really get feedback on the projects that I’m working on, hear about what other people are working on, and yeah, just be plugged in in that way.
Kirill: Gotcha, gotcha. And what do you present on most of the time? Or is it a variety of topics that you talk about usually?
Daniel: Yeah, so there’s a few different areas I guess. I talk pretty frequently about Pachyderm, the open source project that I work on full time. Then I also sometimes talk about Go. So I’m a little bit of a Go programming language nerd, and sometimes talk about that and some of the interesting data things going on there. And then the third case, I’m really passionate about data science workflows and sustainable data science workflows. As we develop the field of data science, what are the best practices that we need to be following, and those sorts of things. That’s a really interesting topic for me.
Kirill: Wonderful. And I hope we can cover that off and go into a bit more depth on the podcast about that. I think that’s a very exciting topic as well.
Daniel: Yeah!
Kirill: Ok, beautiful. So tell us a bit about what you do. Who is Daniel Whitenack, what are you currently doing? You mentioned you were working at Pachyderm, yeah?
Daniel: I’m working full-time with Pachyderm. They’re an open source project, like I mentioned, but there’s also a company around it, so similar to Docker. Docker is a company and open source project. It’s a project that does data pipelining and data versioning, so I work full-time with them as a data scientist. Most of the rest of the team are really brilliant distributed systems and backend engineers that have really built up the core of the system, and I came on as a data scientist because really where the system is relevant is for data scientists and for data engineers. So I’m kind of like a liaison to that community. I also work with a lot of our users that have machine learning workflows and other data science workflows and I help them utilize the system. And then I also produce a lot of tutorials, demos, work on the docs, and then occasionally talk at different events. So I’m a little bit all over the place, but generally kind of data science guy.
Kirill: Okay, gotcha. And tell us about how is the product or the things that you create at Pachyderm, how is that used by users or companies or other people that are users of these things that you create?
Daniel: Sure, yeah. So, one of the things that I worked hard on is really trying to understand I guess a sustainable production scale machine learning workflow, kind of like a template that people could use for their various machine learning workflows, whether that’s in Python or R or TensorFlow or whatever it is. So I kind of created this template and then wrote up some docs around it and then when users come in with those sorts of workflows, then there’s an R example, a Python example, a TensorFlow example that I’ve written, to point them to and show them how they can start developing a data pipeline with this kind of template as a model.
So that’s kind of one example of something that I’ve worked on, but right now I’m working on some streaming examples, streaming analysis, and kind of figuring out what are some of the best practices with our system around streaming workflows and how’s the easiest and best way for a data scientist or an analyst to start streaming analysis in Pachyderm and start from not nothing, but kind of like a template that’s a good foundation.
Kirill: Okay, gotcha. Is my understanding correct that somebody has a data science project and then they want to turn that into something that’s repeatable, something that can be taken from end-to-end, not just as an ad hoc analysis but can be done many times, then they use Pachyderm to create that workflow and then they just have to input the data, tweak the parameters and they get their output? Is that what Pachyderm is for?
Daniel: Exactly, yeah. It’s a production environment for data science workflows. For example, you can create a data pipeline that includes model training and pre-processing and inference and have all those steps connected and also be scalable and, like you said, be repeatable as well. So it’s a very holistic view of “These are all the steps of my processing. I’m going to connect them all so I know what’s going into where and then I’m also going to version and track my data so I know what data was coming into and out of the various stages and so I have reproducibility and tracking or provenance of all these things that are happening.
Kirill: Gotcha. I understand. Until you gave me that sticker at ODSC—remember you gave me the sticker for Pachyderm with the elephant?
Daniel: Yeah. It’s a pretty good sticker. (Laughs)
Kirill: Yeah. And until you gave me that sticker, I honestly didn’t know about Pachyderm. So what would you say to people who are listening to this podcast who don’t know about Pachyderm? Why would you say that it can be an advantage for them to go and learn about Pachyderm and start applying that in their work and see how they can bring that into their organizations?
Daniel: Yeah. I really think, you know, in the data science community, there’s a lot of struggle right now around giving data science teams ownership of their data pipelines. And there’s a lot of inefficiencies in data scientists actually pushing things into production, a lot of times, because maybe the only distributed framework or production scale framework that they know about are like Java/Scala type frameworks like Hadoop and Spark. And maybe the data science team, as a lot of data science teams are, they don’t really like working in Java/Scala or maybe they have a bunch of different tools that they’re using, you know, R and Python and other things.
So Pachyderm is a really great solution for those data science teams to be able to create a cluster very easily that’s able to run production scale things, but also able to run the tools and the frameworks and the languages that they’re used to and that they already like so that they don’t have to waste time re-implementing things in Java/Scala. They can have a stage in R, a stage in Python, they can run a batch script in one stage. And all of that is kind of unified and tied together in a very unified and very sustainable way, but it also allows them to maintain that simplicity and ownership over the tools and the frameworks that they really like using.
Kirill: I understand. And probably a big advantage to that as well is that in a lot of organizations, data science is seen as kind of an additional function and once you’ve done the analytics, you pass on the models and everything to the business intelligence group. I’ve been in situations like that where you have to pass it on to the BI team and then they re-implement all of that in SQL. And in this case you’re kind of taking ownership of that so you don’t have to tell somebody how to re-implement that, there’s no middleman, you have full ownership and full control of the data science models and things and products that you develop for the organization.
Daniel: Yeah, I totally agree. And in doing that, really what you’re doing is you’re setting yourself up hopefully for a little bit easier management of your processes. Because when you have that sort of hand-off scenario, whether it’s re-implementing within the business intelligence team, or you’re giving it to Java/Scala engineers to implement on big data infrastructure, what you end up with is data scientists who don’t understand the implementation and engineers who don’t understand the modelling or the analysis, so then when things go wrong, or when you need to update things, there’s a lot of inefficiencies with understanding “Who should I contact? Who has the right knowledge to fix this?” and there’s also inefficiency, of course, in the re-implementation.
Kirill: Gotcha. I understand. I totally agree with that. Yeah, that’s really cool. I already learned so much. I might as well just end the podcast here. I think that was so much value. I’m joking, of course. We’ve got so much more to cover. Are you happy to go into a bit of data science workflows and tell us a bit about that?
Daniel: Sure. I would be happy to.
Kirill: Okay, cool. So what is a data science workflow, and what are the components of a data science workflow?
Daniel: Data science is very diverse, so it’s hard to pin down one workflow for everything because data science now, when we talk about producing data-driven processes, that’s happening at all levels of a business from optimizing what servers you spin up in AWS all the way to optimizing your sales pipeline. So it’s kind of at all levels of the business, both impacting internal processes and external processes. But in general, I would say a data science workflow often involves some type of stage where you’re doing some cleaning of your data, you’re gathering data, maybe you’re aggregating it to produce some sort of dataset that you’re working off of. And then you’re doing some sort of arithmetic or processing on that dataset, maybe that’s statistics or maybe that’s a convolutional neural net, it’s some sort of arithmetic, and then there’s some sort of conveyance of those results or maybe post-processing of those results.
And these different stages could themselves involve various different stages. You know, you might have to transform your dataset or aggregate your dataset in five or six or seven different stages or ways before you actually get a feature set that you want to use in a model. So there’s these workflows that are by their very nature multi-stage, and one of the problems that I’m really seeing in data science and I think will probably resonate with the listeners very well is that you get all these stages and then you very, very quickly start losing track of what’s going into what. And especially if you’re handing off things to other members of your team to review or to build on, they have trouble understanding the whole workflow that you intended, especially if you leave the company and then they have to build on it.
So there’s this real problem around workflows. And then if you add on top of that not only that you’re doing all these multi-stage things, but that eventually you actually have to deploy some of that stuff into your company’s infrastructure, that’s kind of overwhelming on top of things. There’s just this giant overwhelming elephant in the room that is this workflow thing. And that’s really, like I said, where a lot of my passion is. I think these are the problems that the data science community is really dealing with a lot right now. You know, we’re developing very sophisticated modelling techniques, which is really cool and we should be doing that, and I think we will continue to do that, but in a lot of business scenarios the problem is not that you’re not able to be sophisticated enough on the modelling side, the problem is you actually need to put together this workflow in some way that it produces value within a business. So I think that piece has still got some kinks in the data science world.
Kirill: Okay, gotcha. That’s a great description. For example, I can totally imagine that Pachyderm is a great tool that helps solve that problem. But is Pachyderm sufficient on its own to solve that problem, or does the person that’s faced with this challenge, do they need to have some sort of knowledge about workflows and understanding? I’m kind of getting to what kind of tips and hacks or tricks or things can you recommend and suggest to people that are faced with a challenge like that?
Daniel: Yeah, what I recommend, even outside of Pachyderm—like you said, at Pachyderm, this is kind of our passion. This is why we’re building our system and at least we think that it addresses many of these issues. But I think in general, whether you ever look at Pachyderm or not, some really best practices that I think that we should be pushing in our workflows first are celebrating and striving for simple solutions to problems, the simplest solution that does provide value.
So if I’m on my data team and someone brings a solution to me and I’m supposed to review and it’s some very un-interpretable model that has a lot of complication and we could have solved the same problem with k-Nearest Neighbours or linear regression or something like that, then I’m going to have a big problem with that. Because if we can find those simple interpretable solutions, we should be using them. We should celebrate the fact when we’re able to solve a problem with a simple solution, because that by its very nature is going to be easier to maintain, easier to deploy, it’s going to bring more value to the company because we’re actually going to be able to accelerate that to production faster. So I think that in general is a good tip.
I also really recommend to people that, outside of everything else, we shouldn’t be making excuses around reproducibility. Sometimes when I talk about reproducibility, people give me a little bit of pushback because they say, “Oh, well, now we have these random processes and non-deterministic models and other things.” And I kind of push back to them on that because I come from the physics world, and if people kind of know a little bit about the physics world, I worked in quantum mechanics and the electron may be over here when you measure it, it may be over there when you measure it, it’s not always in the same place, but that doesn’t mean that there’s not a very strict theory around it. We know exactly the distributions that we should be expecting.
In a similar way, if you have a non-deterministic model or you’re producing more complicated models, you should, before you ever consider actually putting things into production or having your models influence people, whether that’s decision makers in your company or your users, you should have a very good understanding of the range of results that you expect, whether that’s very repeatable results, or whether that’s a range of results, or a distribution. You should have that understanding because it’s very important, as you put those things into your workflows, that you have a very good understanding of how they should be behaving. Because we’re now at a point where the things that data scientists are producing, they’re not just reports for salespeople or something. We’re deciding which way your self-driving car should turn and we’re deciding whether you should or shouldn’t get an insurance policy and other things that are really directly user-impacting. I think having that responsibility and really taking ownership of that responsibility, having a little bit of empathy for the end user and the people that are consuming our models is a really good practice to get into.
Kirill: Yeah, gotcha. So the tips are: build simple solutions to make them easier to deploy, maintain and get them faster to production, and create reproducible solutions so you kind of know what to expect. And that kind of ties into what you talk about in one of your YouTube videos. Like when your model, you expect it to do one thing, but it’s doing something else, people lose confidence in the reliability of the results that they’re getting.
Daniel: Exactly, yeah.
Kirill: Okay, gotcha. Thanks a lot for that. Moving on, I wanted to touch a little bit on Go. I think through your videos I found out about this programming language and I never knew that it was such a huge community. Tell us a bit about Go. What is Go and how does it compare to programming languages like Python for data science?
Daniel: Go is a programming language that came out of Google. It initially was used for a lot of infrastructure projects, so listeners have probably heard of things like Docker or Kubernetes ETCD, maybe InfluxDB and some other things. And it’s been really useful in these projects, mainly because it’s a statically typed language, but it’s very simple and you can be very productive in Go. Engineers have found that they can be very productive in Go while maintaining kind of strict typing, the efficiency advantages of that and the integrity advantages of that, and it’s very easy to deploy it. So you can compile down a Go program into a single statically linked binary and just throw that wherever you want and it will behave exactly the same anywhere.
If you don’t have Go installed, that’s fine. It will just run on any architecture you’re running on, so that makes it very, very easy to deploy. I would say, in general, people kind of cling onto it because they can be very productive in the language, it’s very simple and readable, easy to deploy. It just has a lot of nice built-in things. For example, with Python — I think one of the reasons why I initially came to Go is doing asynchronous programming in Python is challenge. You might have to pull in things like Python Twisted and things like that, which get kind of hairy. And in Go, you know, I think I learned Go and implemented my application in Go in the same amount of time as it took me to learn Python Twisted. So I think that’s a testament to just how productive you can be in the language.
Kirill: Gotcha. And can you explain for us what is asynchronous programming?
Daniel: Yeah. So, in modern times, we might have a bunch of micro-services in our organization, and maybe we’re consuming off of a queue, like RabbitMQ, or Kafka, or something, rather than pulling from a database or having very static files that we’re processing. So in those cases you need to be able to process requests or to process events concurrently. And by its nature, Python is single-threaded, so you can implement certain frameworks – I mentioned Twisted – that kind of help you with this problem, but Go just naturally has these primitives that allow you to handle concurrency very easily, so it’s very quick to implement those things. And this is being utilized also in packages. Like there’s a neural network package called Gorgonia, which utilizes these primitives in the modelling context, and then there’s packages like Gonum. Listeners are probably familiar with NumPy, but Gonum is kind of similar for the Go world, a lot of numerical computing stuff, and they utilize these primitives. It’s still a relatively new language, so there’s not like a really great consolidation and standardization of everything data science in Go. There’s great neural network stuff, there’s great modelling stuff, there’s great numerical computing stuff and there’s even data frames, there’s a kernel for Jupyter, but all of these things kind of — they’re scattered about the internet and they’re gradually kind of coming together, the people are starting to talk to one another and talk about “We need to decide on formats for this and that.” So some of those things are happening and there’s a lot of momentum in that area, but right now it’s definitely developing.
Kirill: Okay. Some very insightful thoughts. And just finishing up the conversation on Go, for somebody who’s never heard of Go before and who’s aspiring to be a data scientist and build a career in this space, at which point or in which circumstances would you recommend looking into learning Go?
Daniel: Yeah, so I would say for a very, very new data scientist who’s maybe looking for their first position, pretty much all the data science positions will want you to know Python or R, so it’s probably not great for you to just learn Go and say, “I know Go.” But maybe data scientists that are already in companies or maybe looking for new positions, a lot of new companies are maybe implementing their entire stack in Go. And I’ve seen this countless times where people are building their applications on top of Kubernetes and these other infrastructure projects, and maybe a whole company’s stack is written in Go or a lot of it. That’s a really great case for you to say, “Now you want to build some data science applications on top of things. We want to build in some modelling, some analytics.” That’s a really great case where you can say, “Let me dip into this world of Go.” And I can definitely say that you will have a hard time. So there’s pretty much everything you need to do productive data science in the Go world. Like I said, some of it is still developing, but if you’re in that scenario, I would say, “Take a look at the Go world, see if it has what you need, and try out Go because I think you’ll be very pleased with it and your company will be very pleased that it fits very well within the direction they’re going and the engineering efforts in the organization.”
Kirill: Gotcha. Thank you. And then, once you’ve done that, you can call yourself a ‘Gopher’, is that right?
Daniel: Exactly, yeah. A Gopher, yeah. So if our listeners are interested, there is a website gopherdata.io. Hopefully someday this will be something like PyData is for Python, but right now it’s mostly just a site with links and some blog articles. But visit that site, there’s a link to resources there, and a really great listing of Go data science projects and different modelling packages and data frames and Jupyter and all that stuff is listed out there.
Kirill: Gotcha. And the Go community is very big, right? Like thousands and thousands.
Daniel: It is. There’s a public Slack channel now. If you just search for ‘Gopher Slack,’ there’s a public Slack team. Actually, I have it pulled up here right now. Let me get the last count. Right now there is 16,220 people in public Gopher Slack. I mean, 24 hours a day, it’s a very vibrant community. And those are the ones that are just on Gopher Slack.
Kirill: Gotcha. Understood. Okay, thanks a lot for that excurse. And now I’d like to kind of move a bit to your background. Our listeners are always interested in understanding how people came to data science and your journey is a very interesting one because your journey started in physics, you have a PhD in physics.
Again, in one of your YouTube videos you were talking about how you were analysing how electrons bump into atoms and things like that, and what’s going on there. And then you were starting or thinking of applying the same principles to data science. Walk us through a little bit of that. How did you start and how and why did you transition from what you were studying in your degree to data science?
Daniel: Yeah, I started out in physics. Originally I wanted to be a research professor. I kind of got disillusioned by some things in academia and decided to go into industry. At the time I went into industry, this was six or seven years ago now, at that time, data science wasn’t quite as big as it is now, it existed for a while, but it wasn’t quite as hyped back then. And I really didn’t know what I could do in industry as a physicist. Kind of the urban myth amongst the grad students was that some people somehow ended up in finance and made a lot of money, but that was kind of like the only thing I knew about in industry.
So I looked around for a while, and I got the first job with basically whoever would hire me. So I got a job with an IP firm as a ‘technical specialist,’ so basically what I did is I worked with a lot of data scientists and engineers and researchers kind of translating a lot of their math-y stuff into normal human speak for the lawyers. I felt very fortunate that I landed there because actually I saw just a ton of different stuff that was going on in industry all at once, so I was flooded with all of these really brilliant people doing all of these things at Google and at Wolfram and at other companies that were doing really interesting modelling, and a lot of them maybe had physics backgrounds or science backgrounds and I thought, “These people are doing amazing, interesting stuff. I should just do that. That sounds like a lot more fun.”
So that’s kind of where I started exploring getting into data science. It seemed like my math and a little bit of computing background would play well with that field. So I started looking around for a first data science position. I did some kind of online trainings. I went through the Thinkful data science course, which I think existed at the time, and was paired with a mentor that helped me figure out what data scientists were doing in industry and the different techniques they were using. That was useful for me not so much in terms of learning new techniques, because actually a lot of these techniques I already knew. It was more just like learning jargon, learning processes that people wanted to hear in interviews. I think that’s one of the real problems with people coming out of academia, is, “Oh, I don’t understand what all this regression stuff you’re talking about is,” but then, “Oh, you’re just doing ordinary least squareds. I’ve done that like 3 million times in my grad school.”
Figuring out all that jargon stuff can be daunting, but that really helped me with that and then I got a data scientist position with a start-up in Chicago. I worked there for a number of years. Then I got into consulting as a data scientist. I worked on a bunch of projects, eventually worked on a project with ‘The New York Times’ and ‘The Washington Post’ doing analysis of comment data and then finally ended up with Pachyderm, working on that project with them. So that’s kind of been my journey.
Kirill: That’s really cool. And it’s interesting you mentioned that you had a data science mentor, because from your LinkedIn I can see that you yourself became a data science mentor at Thinkful in 2015. Can you tell us a bit more about that?
Daniel: Yeah, I did. I think the program that they have has changed slightly since then, but basically the goal of that was to pair data scientists in industry one-to-one with students who are trying to get into data science to really give them a real world perspective on data science and also help them build up some of the skills in Python and modelling and SQL and statistics to get them prepared for hopefully a junior data science position. I did that for a number of years. I had a bunch of different students, and that was a really a great thing as well.
This is one thing I recommend. Maybe some of your listeners are new data scientists, but they have some experience – I would highly recommend, whether it’s formal or not, that you develop some mentoring relationships where you’re able to pour into other people. Because like I said, it’s very daunting for people coming out of academia and other fields to kind of figure out these data science worlds. So, just giving them some pointers, maybe meeting with them once or meeting with them once a month or once a week – that can be a huge impact on their life.
Kirill: Yeah. And I can attest to that, that through those processes, through those meetings and conversations, you also learn yourself. Like, you learn a lot for yourself and cover off the things that you thought you knew but you actually don’t or the person that you’re speaking with, they always have something to offer you back. Even if they don’t know about it, you learn a lot in the process.
Daniel: Yeah, definitely. Everyone has a different background and the questions that every different person asks are very different. They make you think about problems in so many different ways that it’s really useful for you.
Kirill: And for somebody who is looking for a mentor, what would you say is the best way to go about that? Maybe Thinkful or maybe there’s other platforms or other tips that you can give to people who are looking for mentors in data science?
Daniel: Yeah, I would say there’s definitely the online programs, Thinkful and a number of other online peer-to-peer mentorship programs. I would really recommend, maybe as a starting point for people, to get plugged into the local meet-up scene, find a good vibrant data science or machine learning meet-up in your area. You’ll go to like two or three meet-ups, maybe people just care about the pizza and beer that are there and not really about the topic, but then that third one that you go to, the discussion will be really good, people will be really into the topics, so get plugged into that particular one and just put yourself out there and start talking to the different people there, get plugged in locally. That’s maybe going to get you a good local mentorship from one or more people in that community, but it’s also going to really help you as far as developing connections in your local community as you’re trying to get into data science as well.
Kirill: Okay, gotcha. That’s been very helpful and I think a lot of people will find that useful in their aspirations for data science. Now I wanted to move onto a bit of an interesting question that I formed as we were going to the podcast. You moved from industry and from consulting into Pachyderm now and you are mostly developing these really cool solutions and products that people can use to make their life easier. But how do you feel about moving away from actually doing data science yourself? Like, in consulting and in the startup, you were actually performing data science to solve business problems. Are you doing any of that now or do you feel any nostalgia about that?
Daniel: Yeah, that’s a really good question, actually. I think this question generally applies to — you know, there’s data science positions where maybe you work for a company that provides a data science problem and you’re working with a whole bunch of users but maybe you’re not actually involved in the different companies versus like being plugged in every week all the time to the specific problems of a single company. You know, it’s gone in different seasons for me.
At first, like you said, I started out in the latter. I was plugged into these companies and I was working on maybe one specific project for a very long period of time and really trying to develop good solutions for that project or maybe a handful of them at a particular company. And I think there are certain advantages and certain learning that goes along with that as far as really understanding a problem very deeply and understanding specific types of data very deeply.
And then there is advantages to the former. If you’re working with a lot of different users, like I am now — like, yesterday I was working with image data and today I’m working with log data — it’s a lot of context switching, but you learn a lot about the different scenarios that people are actually dealing with across the landscape of data science.
Maybe there’s some bit of nostalgia, but I think for me it’s a different season and it’s a different type of learning. I don’t doubt that later on in my career, I will probably switch back to a season of very intense focus on a small set of problems relevant to a certain company and then I’ll probably fluctuate back. I think for me, that kind of flow back and forth is good because it triggers different parts of my mind and helps me learn different things and it also gives me diversity in what I’m working on. So, yeah, I think it’s just kind of a back and forth for me.
Kirill: Okay. And for somebody who is starting into data science, like completely new or who just wants to change their current career and be more focused on data science and move into data science from a different type of world, maybe it’s BI world, or a database role, or statistics, or even arts, anything, the question I have is, what’s the best place for them to start? Should it be something that’s heavily focused on a specific company or on a specific type of work, like you said you had previously, or should it be something that’s more diverse that gives them exposure to more different tools and more different scenarios like what you’re doing now?
Daniel: I would say probably the former. I would recommend that people starting out, they do get plugged into a company that is working on a specific handful of problems. I think that for me it was really useful to get plugged in to a start-up where I was initially part of just a two-person data team or maybe it’s like a three or four-person data team, and I think those scenarios are really good for people starting out because, by your very nature in those positions, even though you’re focused on the specific problems at a specific company, you have to wear a lot of different hats. So you’re going to be forced to deploy some things yourself, you’re going to be forced to deal with the data pipelines, you’re going to be forced to learn some ETL, you’re going to be forced to learn database stuff, all of that in addition to whatever statistical or modelling analysis that you’re doing.
So in that scenario, for a new person, I think those are really the essential things that you really want to build up. You want to become a well-rounded person that’s comfortable with the end-to-end data process. Probably you’re never going to become the world’s greatest expert on MongoDB or something, but it gives you a little bit of confidence to where if you go into another position or something and maybe it’s not Mongo in that case and they say, like, “Hey, we’re using Cassandra,” and you say, “Oh, I’ve dealt with different databases before. I’ve dealt with different infrastructure. I have at least an exposure that makes me a little bit more comfortable jumping into this scenario.” I think that’s what’s important. And then later on, it’s fine to jump into those consulting and very diverse sets of positions, but I think also on those positions you want to have that confidence in a variety of tooling first, because you’re going to be thrown into a lot of different scenarios and you have to really quickly pick up on those different scenarios.
Kirill: Gotcha. And I’m really glad you mentioned that because those who are following me for a while, they’ll know that I have a different opinion on that question. I personally always recommend for people to start in consulting if they’re getting into data science, because that’s the path I took. Yes, I did some work before that, but my biggest learnings were when I was at Deloitte for two years, and I was just thrown into different tools, into different industries, into different scenarios. And as you correctly pointed out, you don’t get the opportunity to develop that all-roundedness of a data science approach, but on the other hand you get an exposure and you see all these different areas of data science and I feel you understand better where your career might go. At the same time, I totally respect your opinion on that. I think it’s just two different approaches to how you would go about starting into data science.
Daniel: Yeah. And to follow up on that, I think you’re very right. I wouldn’t necessarily disagree with that. I think the important thing is that when you’re starting out you’re in a scenario where you’re learning a lot about a diverse set of things. Like you said, in the consulting world, you’re going to be thrown into a lot of scenarios. You’re going to be forced to learn in kind of a smaller team environment. Like I mentioned, you’re going to be wearing a lot of different hats so you’re going to be forced to learn.
I think the scenario that is maybe different than both of these is like a very large data science organization where you’re thrown onto a team and really your job is to just produce the next time series model or something like that. And you don’t get quite an overview, like you would in either one of these other two scenarios. Big companies deal with that in various ways, like having a boot camp sort of program or on-boarding sort of program, and some of those work very well. Yeah, that was my comment there, I guess. (Laughs)
Kirill: Gotcha. And I agree, whichever approach you take, the last thing you want to get is pigeonholed, right? You won’t learn anything if you’re just doing the same thing.
Daniel: Yeah, that’s a great way to put it.
Kirill: Okay, cool. Thanks a lot for that. I’ve got some questions about your experience with data science, kind of like rapid-fire, but totally feel free to go in-depth on them. Are you ready for this?
Daniel: Yeah.
Kirill: Okay, let’s do this. We already talked about the tools a little bit, so I won’t ask that one about what tools you use on the daily basis. Next one is, what’s the biggest challenge you’ve ever had as a data scientist?
Daniel: I would say that probably the biggest challenge that I faced is more mindset-related than tooling or modelling-related. When I was first starting out as a data scientist, as often is the case with data scientists, I was working very closely with the CEO and COO of the company that I was working with. There is very much this mindset of, like, “Let’s be data-driven,” and also there was the mindset of we have come to this scenario and “Wouldn’t it be cool if we could predict this?” And then in my curiosity I would say, “Yeah, that would be really cool.” As a scientist with curiosity, I would say that would be really cool and immediately I would jump into two months of work, predict that, I would predict it and then I would bring it to the CEO and COO and they would say, “Oh, it’s cool that we predicted that.” And then everybody sat around and was kind of like, “What do we do now?” And there was nothing we could do with that prediction, let’s say.
I think the biggest challenge for me has been to kind of rein in that mindset and really focus when a new problem comes up on asking the right questions when I’m starting out a project, you know, “If I produce this result, will it have actionable consequences? Will it produce value within a company? What form do I need to put this result in such that it produces that value?” I think that’s been the biggest challenge for me.
Kirill: Gotcha. Okay, that’s a really cool one. I like that. Next one is: What’s a recent win that you can share with us that you had in your career as a data scientist, something that you’re proud of?
Daniel: Good question. I think that on the Pachyderm side, we’re seeing a bunch of wins with our users in actually proving out larger production scale data, like data scientists actually being able to produce solutions that scale to this production-sized data. So on that side, I think it’s really great and valuable just to see users actually using what you’re producing in a way that makes them happy. That’s extremely fulfilling and I think that’s a big win.
Also, like you said, I’ve been talking at a bunch of places and when I first started talking about some of these workflow things and reproducibility and Docker and Go and other things, I was a little bit of the odd man out, no one really knew what I was talking about. And now — we met at ODSC — and this isn’t a win on my part, I would say, because I didn’t do this, but through the efforts of many people talking about these things, you know, people are talking about these issues now and that just really excites me. People are talking about how do we solve these problems of reproducibility and workflows.
Kirill: Gotcha. Yeah, that’s really cool. And just speaking of Pachyderm, do you know why the logo is an elephant?
Daniel: Yeah, so “pachyderm” is the word that is now a defunct classification of animals. At some point, people thought it would be great to classify animals based on the thickness of their skin, which turns out to be an incredibly poor way to classify animals, but pachyderm was like thick-skinned animals, which included elephants and hippos and that sort of thing. So in one way, it’s kind of a poke at the Hadoop ecosystem, whose logo is an elephant, but it’s also in having a data platform or a distributed processing platform, it’s probably a good thing to be thick-skinned and robust, so that’s kind of the dual meaning.
Kirill: Gotcha. Thanks for that. And you may have answered this question in your previous answer about your recent win, but nevertheless I’ll still ask: What is your one most favourite thing about being a data scientist?
Daniel: For me, probably—when I was in the academic world, I really thrived on the creativity of the research process. I loved it when there was an elegant and creative solution to a problem. In data science you not only get that, but in the research world, oftentimes the end goal is you write a paper and then it’s published and then maybe nothing comes of it.
But in the data science world, you get to have that satisfaction of creating a creative elegant solution but then in addition, you get the satisfaction of seeing – if you actually implement it – value being produced very immediately from that solution within the company either for your users or internally. And I think that dual benefit is really what is my favourite thing.
Kirill: Fantastic. I can totally attest to that. It’s just super great that you get so many intrinsic motivators in data science to continue doing the work you’re doing and wake up in the morning. Okay, question of the day: Where do you think the field of data science is going? From everything you know, from all the things you see, from all the conferences you go to, where do you think this field is going and what should our listeners prepare for to be ready for the future?
Daniel: One thing that we’re starting to see is data science influencing every single part of a business. Even if you’re not a machine learning company, you’re thinking about data science in terms of your infrastructure, in terms of your sales pipeline, in terms of your marketing, in terms of your recommendations for your users. It’s impacting every single part of the business. That’s one thing I see.
The other thing that I really see right now is — traditionally, for the past number of years, we’ve seen one side of your organization be data engineering, developing pipelines, doing streaming and other things, and then one side of your organization being data science, doing analysis and modelling and that sort of thing. In my mind, one trend that I see in the industry is really a pressure to push those sides together. You know, I might be a little bit biased because Pachyderm sits in the in-between of those two sides, but I think in general – and this was displayed at ODSC because everywhere you turn there was another data platform solution or a solution to help data scientists deal with infrastructure or deal with their GPUs or whatever it is – I think there is a pressure to push these two sides together and I hope that this will end up with a scenario where data scientists will have real ownership over not only the solutions that they’re creating, but how those are implemented within an organization so that there’s this seamless kind of flow between data science and solutions that are actually impacting business very quickly.
Kirill: Gotcha. Thank you very much for that. That’s a great insight, meaning that if this is a new area of data science that’s evolving, people should start looking into that early and start thinking about it, kind of preparing and anticipating the future. If that’s going to be something that’s super big in two or three years, why not start learning that stuff now? Why not start getting into it? And then when it comes, you are ahead of the curve and you get the good jobs and you get the high salary and you’re leading that industry rather than just following the trend.
Daniel: Exactly.
Kirill: Cool. Thank you so much, Daniel, for coming on the show. It’s been really great hearing first-hand from you about all this, all these insights. How can our listeners contact you or follow you or find you or find out which conferences you’re presenting at if they would like to learn more from you?
Daniel: Sure. So, I’m @dwhitena on Twitter. So you can find me there. Additionally, if you join the public Pachyderm Slack channel, there’s one of those, you can go to pachyderm.io, there’s a link to it. Or if you’re on the Gopher Slack channel, I’m also dwhitena on those Slack channels so you can message me there. I’m dwhitena on GitHub, you can find out what projects I’m working on there. There’s a repo under there called ‘slides’. I should probably update the name, but for now it’s called ‘slides’ and under the readme there, it does list out where I’m going to be at different conferences coming up this year.
Finally, like I mentioned, there’s this Gopher data site if you’re interested in – GoDataScience. There’s a bunch of good resources there and links how to get involved in that community. So there’s a few different ways.
Kirill: Gotcha. And you also have a blog, datadan.io.
Daniel: Oh, yeah. Thanks for the reminder. I knew I would miss something. So if you go to datadan.io, I have a blog there where I have a bunch of articles and I also write for the Pachyderm blog and a couple of other places like Intel and YC and some others.
Kirill: And is it okay for people to connect with you on LinkedIn?
Daniel: Definitely, please do. I just ask that, if I don’t know you, shoot me a message. That way I kind of know the context of what you want to talk about and we can connect.
Kirill: Gotcha. Just to summarize it for everybody, because you just listed a huge amount of resources, the best place to go, I think, is datadan.io because all of the links are listed on the left – Twitter, GitHub, LinkedIn. You won’t miss anything if you just go there and you’ll see it on the left right away. Okay, thank you very much. I just have one more question for you. What is one book that you can recommend to our listeners for them to become better data scientists?
Daniel: I thought about this a while, and there’s a bunch of great books out there. Actually, there’s a bunch of great free books out there, but I think one thing that I would recommend, and I really think that people should take a look at this, it’s called “Rules of Machine Learning.” It’s actually not even a book, it’s just a document that’s online. So if you search for “rules of machine learning Google” or something, this is a document that Martin Zinkevich put out. He’s a research engineer at Google, and the premise is that these are the rules of machine learning or best practices for ML engineering at Google and it’s kind of things that they emphasize there that have been distilled down for a general audience. I would highly recommend those. It talks about your first pipeline. It talks about things that I mentioned: choose a simple, observable and attributable metric, interpretable models. It talks about feature engineering and skewed distributions, complex models. I think it’s a really great resource with a bunch of nuggets of truth in there.
Kirill: Fantastic. Thank you very much for that. So, that’s “Rules of Machine Learning” by Martin Zinkevich from Google. And on that note, thank you so much, Daniel, for coming onto the podcast and sharing all these valuable insights with our listeners. I think it was great and I think a lot of people will learn so much from what you shared today. Thank you so much.
Daniel: Excellent. Thank you.
Kirill: So there you have it. That was Daniel Whitenack of Pachyderm and a very renowned speaker on data science, so if you’re interested to hear more from Daniel, definitely check out his upcoming speaking events, which you can see on GitHub or through his blog. You can find GitHub through his blog and find more on where he’s speaking.
In terms of today’s podcast, I hope you enjoyed it. For me personally, the best takeaway or the most valuable takeaway — well, there was a lot of great things mentioned, but probably the most valuable for me was the data science workflows and how that is becoming a growing area of data science as data science is maturing. We all know this is happening. Like, seven or ten years ago, data science wasn’t even a popular thing. People weren’t talking about it as much and you wouldn’t be able to get a degree in data science. But now it’s becoming a field on its own, alongside things like physics, chemistry, biology. It’s being taught at universities and companies are more and more applying data science and this field is slowly maturing. And as that is happening, it is essential, it makes sense, that data science workflows are going to become more and more popular.
So it’s a great insight that Daniel shared with us. If you’re looking to build a career in data science, then looking at data science workflows could be a valuable thing, because even though they might not be as popular and as pronounced right now, in two or three or five years’ time, data science workflows are going to be an essential skill. Most likely they’re going to be an essential skill that you will need to have as a data scientist. So, as they say in ice hockey, you could be skating to the puck. You know where that puck, that little black thing they shoot around in ice hockey is going to be, so you’re skating towards the puck to catch it in advance, you anticipate where it’s going to be. You could be doing the same thing here. You could be skating to a puck and learning about data science workflows, so when it does come into play, it becomes big in the world, then you already know these things or a thing or two about data science workflows.
So that’s what we talked about today and of course, if you’d like to connect with Daniel, check out his blog. Once again, it is datadan.io. You can get all of the links to his other profiles there as well. Also, we’ll put links to all of the resources for today into the show notes which you can find at www.www.superdatascience.com/61. And if you enjoyed today’s podcast and you’re listening on iTunes, then take a minute to rate this show and help us spread the word about data science into the world.
And on that note, thank you so much for being here. I really appreciate you taking the time and listening to this podcast. I can’t wait to see you next time. And until then, happy analyzing.