Podcasts SDS 285: Bringing Dev & Diverse Communities Into Data Science

76 minutes
Business, Data Science

SDS 285: Bringing Dev & Diverse Communities Into Data Science

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

A very real and very insightful chat with Jon Skeet about implementing best practices in data science and making sure this young industry is more aware of communication, community, and diversity.

About Jon Skeet

Jon is a software engineer working for Google from the London office, building .NET libraries to access Google APIs. He’s best known for answering a lot of questions on Stack Overflow, primarily on C#. He’s also interested in date and time APIs, and is the author of the Noda Time library for .NET.

Overview

Jon Skeet is the number 1 contributor on Stack Overflow with over 34,000 questions answered on the site. For 10 years he’s gone on Stack Overflow once a day and his profile has been viewed over 1.8 million times. While he is proud and humbled of his work helping people, he notes that thousands of other people have helped others and he often makes mistakes and bugs during code work and absolutely needs to consult documentation, despite his “mythological reputation” as a perfect programmer.

Jon works with both Java and C# professionally and in his free time. I’ve worked in C# before where my brother put together a Sudoku, C++ is my favorite language. But, before the podcast started, Jon and I sat down to discuss what was going to be most beneficial for this podcast. Jon’s an expert in C# but Jon’s work in the community is possibly the most important part of his work. So, we focused on the importance of community and communication in data science, considering its youngness compared to the development community.

On that topic, versioning is not something that data scientists seem to be particularly focused on while publishing work. Jon has his own practices in versioning that works best for him to ensure that not only do the versions progress but previous versions should easily be able to upgrade to the most recent version published. There’s nuances to it, however, and Jon’s skills in C# make him an expert at efficient versioning. In data science it’s important to remember that not only do you version the code, you need to version the data behaving under the code. It’s an important organizational component as you work through data and you need to be aware of the dependencies in your libraries and within your code. Data science, I’ve always thought of, as a creative and exploratory practice where you build sand castles or work with clay to create your models while programming feels more like working with a blueprint. Jon makes the argument, however, that languages like F# allow for exploratory approach and F#, over C#, might work better for data scientists as a way to bridge that philosophical gap.

Jon’s expertise is diagnostics. Giving him a problem in a way specified enough to allow him to reproduce the problem, he can often find the solution to the problem. He notes that he could do that one hundred times but he’s more interested in teaching one hundred people to diagnose the problem so they can go off and help others. He calls divide and conquer as the silver bullet of diagnostics: reduce things to the smallest possible size to isolate the problem and don’t lose track of your goal.

Where community is concerned, it’s varied and vast. There are educational, somewhat impersonal communities like Stack Overflow, and then there’s user groups where you can socialize with fellow developers to share information and get to know other people. One important note Jon stresses is inclusivity of all walks of life and diversity and also inclusivity to newcomers. Community shouldn’t be competition.

In this episode you will learn:

Jon’s status at Stack Overflow [7:33]
Intro to community in data science [12:00]
Versioning [16:08]
Interpreted languages [24:43]
Diagnosing problems [36:00]
Jon’s “secret sauce” to preventing problems [45:30]
Community and Diversity [48:58]

Items mentioned in this podcast:

Follow Jon

Episode Transcript

Download The Transcript

Podcast Transcript

Kirill Eremenko: This is episode number 285, with top contributor on Stack Overflow, Jon Skeet.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur, and each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now, let’s make the complex simple.

Kirill Eremenko: This episode is brought to you by SuperDataScience, our online membership platform for learning data science at any level. We’ve got over two and a half thousand video tutorials, over 200 hours of content and 30-plus courses with new courses being added on average once per month. You can get access to all of this today just by becoming a SuperDataScience member. There is no strings attached. You just need to go to www.superdatascience.com and sign up there, cancel at any time.

Kirill Eremenko: In addition with your membership, you get access to any new courses that we release plus all the bonuses associated with them. Of course, there are many additional features that are in place or are being put in place as we speak, such as the Slack channel for members, where you can already today connect with other data scientists all over the world or in your location and discuss different topics such as artificial intelligence, machine learning, data science, visualization and more. Or just hang out in the pizza room and have random chats with fellow data scientists.

Kirill Eremenko: Also, another feature of the SuperDataScience platform is the office hours, where every week we invite valuable guests in the space of data science and interrogate them about their techniques, about their methodologies in the space of data science and you actually get a presentation from the guests and you get an opportunity to ask Q&A at the end.

Kirill Eremenko: In some of our office hours, we just present some of the most valuable techniques that our hosts think are going to be valuable to you. All of that and more you get as part of your membership at SuperDataScience, so don’t hold off. Sign up today at www.www.superdatascience.com. Secure your membership and take your data science skills to the next level.

Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies and gentlemen. Super excited to have you back here on the show today, and I’m super humbled by our today’s guest, Jon Skeet.

Kirill Eremenko: Jon has submitted almost 35,000 answers on Stack Overflow and his advice has reached an estimated 276 million people worldwide. That’s 276 million. Quite an insane number, if you take a second to think about it.

Kirill Eremenko: I just got off the phone with Jon and the podcast you’re going to hear is going to be very interesting. We had a great discussion and is going to be a different perspective today. The reason for that is that Jon is not a data scientist, he’s a C# developer, an expert in C# and also some other programming languages.

Kirill Eremenko: Don’t let that scare you away, because, a couple of reasons. First of all, there’s a lot of similarities between data science and development. Both use programming and things like versioning, and diagnosing problems are common between the two, so we can learn quite a lot of things from Jon. The other reason why this is very relevant is because data science is more and more coming closer to product development, is being integrated more and more into products. Before, data science was just, let’s get some insights, let’s do some predictions.

Kirill Eremenko: More and more, we see that companies are integrating analytics, machine learning, artificial intelligence, data science, into their products. You will, eventually, it’s highly likely that in your career, especially if you go and work in startups, for startups, you start startups, that you will encounter situations where you need to combine your data science knowledge of developing knowledge in order to productionize data science. Therefore, already in this podcast, you can get a head start and understand how these two worlds meet and what are their intersects.

Kirill Eremenko: Finally, the third reason is, maybe you are coming to data science from a world of development. Maybe you have some experience in programming languages like C# or compiled languages. It will be interesting for you to see Jon’s perspective on the world of data science.

Kirill Eremenko: All in all, a fantastic podcast, I really enjoyed our conversation. You’ll hear a lot of very valuable technical topics that we covered and also, at the end, we actually talked about the importance of community. What it means to be part of a community and how communities grow, which you can do as a data scientist, to make our community be more inclusive, more welcoming and prosper further. I think this is valuable, these are valuable insights for somebody who’s been heavily involved in the development community. These are valuable insights for data scientists and for us all to grow much faster and better and stronger as a community.

Kirill Eremenko: On that note, I can’t wait for you to check out today’s exciting podcast. Without further ado, I bring to you the top contributor on Stack Overflow, Jon Skeet.

Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies and gentlemen. Super excited to have you on the show, because I have Jon Skeet on the other line.

Kirill Eremenko: Jon, how are you going today?

Jon Skeet: Very well. Thank you. Very well.

Kirill Eremenko: Very, very nice to talk to you. Could you please remind me, what city are you calling from, from the UK?

Jon Skeet: I’m in Reading, which is just to the west of London.

Kirill Eremenko: Just to the west of London. Very cool. You said you’re having some fantastic weather these days?

Jon Skeet: Yeah, it’s been really nice recently. A few occasional downpours, but generally, we’re escaping from the normal British wet weather of a summer, so it’s very fine. The only downside is, by the end of the day, the shed from which I usually work is pretty warm.

Kirill Eremenko: That’s a good problem to have in the UK.

Jon Skeet: Yeah.

Kirill Eremenko: It was so cool to see your drums. That’s so awesome. That is very exciting. I wish you could … Maybe one day you can play something, and we can …

Jon Skeet: I think it’ll be quite a while before I’m even slightly good enough to play for anyone else. I only bought the drum kit a week and a bit ago. I’m practicing hard, but I’ve got a long way to go.

Kirill Eremenko: Fantastic. Well, so you are in Reading. How long have you been in Reading for?

Jon Skeet: Just over 20 years, actually.

Kirill Eremenko: Twenty years.

Jon Skeet: Straight out of university, I ended up working for Digital Equipment. That was in Reading and moved to my first house in Tilehurst, which is the sort of village near Reading. It’s a bit bigger than a village, but we tend to call it a village. I’ve moved within Tilehurst, but stayed basically there, even from before I was married.

Kirill Eremenko: Wow. Fantastic. You’re married now?

Jon Skeet: Yes. We celebrated our 20th wedding anniversary fairly recently. Yeah. Very, very happily married.

Kirill Eremenko: Wonderful. That’s so cool. Congratulations.

Jon Skeet: Thank you.

Kirill Eremenko: Jon, what fascinates me is that from a … Would you say Reading is a little place or a big place?

Jon Skeet: Reading is a very large town sort of bordering on being a city. It’s not officially a city, but I wouldn’t be surprised if in the next five or 10 years’ time it was given the official designation of city. It’s quite close to London and there are really good rail links that are improving over time, actually, so while a lot of people do commute from the outskirts of Reading into Reading, an awful lot of people also go from Reading into London to work in London.

Jon Skeet: But it’s great because it’s nice and close to London, so I can get to the office when I need to, and also go to see plays and musicals, which I love doing. But also, it’s really close to the countryside, so house prices are bad, but not awful, and I can get to the countryside nice and easily get to other places in the UK easily. It’s a really nice place to be.

Kirill Eremenko: Oh, fantastic. That is exciting. What I find very interesting is that from a almost city size town of Reading, which is very exciting that it’s growing, from the town of Reading, you have done something extremely unfathomable. You are the number one contributor to a little website called Stack Overflow. You have answered over 34,000 questions and you have reached over 276 million people. If I was wearing a hat, I would take it off for you right now. That is huge. Congratulations on that.

Jon Skeet: Thank you very much. Thank you, but it doesn’t feel that big a deal, because it’s sort of just something I’ve been doing for whatever 10 years now. I answer fewer questions than I used to, because there are fewer questions that sort of seem like they are appropriate for me to answer, but I do still … I go on there every day. I think it’s probably been nearly 10 years since I last missed a day on Stack Overflow, because I take my laptop on holidays and things.

Kirill Eremenko: Oh wow.

Jon Skeet: I manage to disengage from main work, but I do always like to keep an eye on what’s going on Stack Overflow.

Kirill Eremenko: That’s very impressive. Your profile has been viewed over 1.8 million times and it’s just incredible how you’ve contributed to so many people, such a great cause. How does that make you feel?

Jon Skeet: Obviously, I’m thrilled to have helped lots of people, but I think it’s worth bearing in mind that there are lots of other people who have helped huge numbers of people as well, and huge numbers of people who’ve helped just a few people. So, the cumulative effect is enormous. Now, I am privileged that being number one draws a certain amount of, potentially, undeserved praise. There is the sort of myth of Jon Skeet as this perfect programmer who never needs to consult any documentation.

Jon Skeet: In fact, just over the weekend, there’s been an interesting Twitter thread where someone, I believe a venture capitalist, gave his impression of a Tenex software engineer who never needed to consult documentation, knew every line of code that had been deployed into production, and various things that I actually thought weren’t particularly positive for really empowering a team, a whole team, rather than one person to drive forward a product.

Jon Skeet: But there is this myth of me never writing a line of code that’s incorrect and all kinds of things, which I hope for most people understand just is not reality at all. I am a pretty regular guy. I make bugs just like everyone else does. I kicked myself after losing an entire day or two over something that turns out to be a tiny typo. I happen to have just gained this mythological reputation by just contributing a bit more than other people have on Stack Overflow. So yeah, it definitely doesn’t reflect reality, but I enjoy it at the same time.

Kirill Eremenko: Got you. Thank you. For our listeners, we’re going to set the scene. Jon, you’re a expert programmer in C#, correct?

Jon Skeet: That’s right. Yes.

Kirill Eremenko: C#.

Jon Skeet: I have loved C# since it started. I think I played with some of the betas before it went to a general availability, 1.0, in 2001, 2002. I’ve been working with it, either professionally or on an enjoyably amateur status, ever since then sort of. I’ve alternated between working with Java and working with C# professionally, but whenever I’ve been working just with Java professionally, I’ve kept up with the C# in my free time.

Kirill Eremenko: Fantastic. I also have played around C#. I was helping one time, my brother created a Sudoku for [inaudible 00:12:52] Salmons in C#, which was fun. I totally love … My favorite language is C, I would say and then C++, because of it’s object-oriented nature. C# is fantastic as well, although I don’t know it really great.

Kirill Eremenko: What I wanted to do is, before the podcast, this works better for our listeners, Jon and I sat down and actually discussed what we were going to be talking about, because as you can imagine, while C# can be relevant to some data scientists and can be used to deliver, deploy, develop data science applications in some cases, in most cases it’s not our language of choice. You might be surprised, what are we going to be talking about with Jon if he’s an expert in C#?

Kirill Eremenko: If you hear some notes about C# in this podcast, and you are interested in C#, that is awesome. That is for you, but at the same time, what we’re actually going to be focusing on with Jon is the importance of community and importance of what it is like to be in a tech profession. Because there are lots of similarities between development and data science, and through his work on Stack Overflow and through his exposure to the community of developers in Stack Overflow and this, in general, is community that’s helping each other out, it will be very interesting to gain some insights. Because the data science community, as far as I know, is not that old. As old as the development community, so maybe there’s some takeaways that we can apply to the data science community and to how we interact with each other. That’s what I expect we’re going to focus on, but you never know how the conversation is going to go.

Jon Skeet: Absolutely.

Kirill Eremenko: It’s going to be fun.

Jon Skeet: My experience is that, well, certainly podcasts [inaudible 00:14:33] tend to meander away from what we expected [crosstalk 00:14:36]. Often, including things around versioning or dates and times, which are two other topics that I’m fairly passionate about. I suspect that we’ll find, in the course of this discussion, that there are various touch points where the problems that the data science community face are similar to the problems that the more regular programming community faces. There will be various similarities and, hopefully, a few differences we can note and sort of learn from each other, new approaches that we can take.

Kirill Eremenko: Totally. Totally, even this one that you mentioned, the versioning, that is such an important thing. In data science, we don’t have … maybe in the silos and in certain companies, maybe there are certain frameworks that are coming out where there’s very rigorous, methodical system for versioning. But, overall, when somebody starts the data science, that’s the last thing they probably learn.

Kirill Eremenko: They learn about machine learning and so on, but they don’t have this habit of versioning files. Like I, through my work at Deloitte, where they have very specific ways to version anything, like I even version my, I don’t know, tax documents, PowerPoints, they all have like version 1.1, version 1.2, 3.7. Everything I create almost always has a version. Whereas in data science, I don’t think that’s the case. Tell us about the importance of versioning in development.

Jon Skeet: Within the .NET community in particular, we’ve adopted SemVer, Semantic Versioning, which is not .NET specific and is fairly widely used within programming where an artifact, whether that’s a library, an application, whatever, but probably something that other people will depend on, they need to know how it’s going to behave, that gets a three-part version number. It has a major, minor, and patch version and also, potentially, some other information like a dash beta 01 or whatever that says, “This is pre-released and can change sort of arbitrarily.” But then, if you say, “I’m following Semantic Versioning,” that means that, if I’ve published a 1.0.0 of something, then if I publish something else within the same major version number, then it should be backward compatible. If I publish a 1.1.0, then anything that was previously using 1.0.0 should be able to upgrade to 1.1.0 without being broken.

Kirill Eremenko: It should be able to use 1.1.0 without you changing the code of that thing that’s using these?

Jon Skeet: Exactly. There are different levels of compatibility. One thing would be, and this depends on your programming language and environment and things, but in something like C#, which is compiled, there’s a separate compilation step that happens long before execution that can be different things where you may say, “Well, it’s source compatible,” so your code that built against 1.0.0 can still build against 1.1.0. There’s the other aspect of binary compatibility, which is, while I compiled this code against 1.0.0, but actually at execution time for whatever reason I am loading 1.1.0 of the library and that should still be okay as well. Then you get patch versioning, where you should be able to go backwards or forwards in time. If I could build against 1.1.5, I should also be able to build against 1.1.4, so it’s sort of forward compatibility as well as backward compatibility.

Jon Skeet: Then you get into really difficult problems, where you’re writing an application, and you depend on one library that depends on another library at version one, but you want to depend on that same library at version two, and those aren’t necessarily compatible with each other at all. There could be all kinds of differences and, certainly, in .NET that causes a problem, because while some aspects of the execution environment can handle multiple versions being loaded at the same time a lot of the tooling doesn’t support it. I wrote a blog post on that fairly recently, saying, “Hey, we need to get better at this.” I don’t know how many dependencies and what level of that sort of versioning problem you have in data science. I would say the most important thing isn’t even versioning in terms of making sure that everything has a number, but at least keeping a versioned history of things, whether that’s in gate or in subversion or some other source control, so that you can get back to, “Oh, I know I had a working version a few days ago. Let me have a look at that.”

Jon Skeet: I’ve been to some machine learning talks and done sort of workshops, but don’t have significant experience. I can definitely imagine the importance of keeping a log of, “Well, I tried this and this was the result.” That sort of goes on to another topic that I’m absolutely passionate about within programming, which is, how do you diagnose problems? A lot of that is making sure that you can keep a log of exactly what you did and exactly what the result was, and being clear enough about that without spending hours and hours doing it. I would imagine that’s a skill that data scientists sort of pick up naturally, because it feels like it’s probably closer to one of your core competencies. I would love it if the data science community could try to teach the programming community about how to keep good logs of what happened when you tried things.

Kirill Eremenko: That’s fantastic. Before we dive into more into diagonizing problems, I wanted to also mention that for data scientists, there’s a very specific component that needs to be also remembered. Is that, you don’t not only need to version the code that you’re creating, but you also need to version the data that you’re using to train.

Jon Skeet: Absolutely, yes. The same data set behaving differently under different versions of your code versus different versions of the data behaving differently under the same version of your code.

Kirill Eremenko: Exactly, exactly. That’s another moving part in the equation.

Jon Skeet: Right.

Kirill Eremenko: I love that. That there is that similarity of versioning, but there’s a difference that data is such a crucial part of what they’re set to do. Then, more you need to have these, not just say what kind of data was like have a backup preferably of that data, because maybe that data is not in your control. Maybe you’re getting it from a server where somebody might change it and then you’re completely stuck. You know?

Jon Skeet: Right.

Kirill Eremenko: You have no way … It’s important, right, in versioning to be able to go back to the previous version in case the new version is broken.

Jon Skeet: Yeah, and to know which version you did things against. We seem to be, whether I am driving it to topics that I’m interested in or not, there’s something similar in programming that many people are unaware that they’re depending on version data with time zones.

Kirill Eremenko: Oh wow, it’s a good one.

Jon Skeet: Many people assume times zone rules just stay the same forever. No, while you go into daylight saving time at this time and then you come out of daylight saving time at this other date, and the rules are set, but no. The rules change several times a year, and I don’t mean because things go into or out of daylight saving time, but a country might decide, “We’re not going to have daylight saving time anymore.”

Kirill Eremenko: Yeah.

Jon Skeet: In fact, the European Union at the moment is deciding, I think in principle it has been agreed that from 2020, I think, countries will have the option of no longer using daylight saving time, so everyone who has recorded some data that is time zone or were in some form or other, they have recorded it with, presumably, the current version of time zone data that they were using at the time. But I’d be very surprised if more than 1% of developers actually recorded, “Yes, I was using Iona time zone data 2016 J, or whatever it is.” The rules that we knew about at the time, which predict future and past things, is just an aspect of version data that people don’t expect to be versioned.

Kirill Eremenko: Yeah, I totally agree. Even Russia had this a few years ago when they stopped using daylight saving times for a few years and now they’ve started back using it or something like that and try keeping track of all those things. That has a massive impact. Your analysis can be completely wrong, especially, you’re doing something, I don’t know, for example, on the data relating to financial markets. Bam, all of a sudden it’s not 8:00 AM, it’s 9:00 AM or it’s not 7:00 AM, it’s 8:00 AM.

Jon Skeet: Absolutely. Yes, yes, it really matters. It also matters how quickly you can get updated datas, because some countries don’t give much warning at all that they are changing their rules. Literally, sort of, there have been countries that were about to go into daylight saving time and announced the day before, “No, we’re not going to do that.”

Kirill Eremenko: Wow.

Jon Skeet: I had colleagues who were going through airports and half the monitors in the airport said one time and half the monitors said a time and hour later. Of all the places that you really, really want to be sure what the time is, an airport is absolutely one of them.

Kirill Eremenko: Oh, that’s crazy. That’s crazy. Okay. You mentioned a really interesting topic, which I love. Diagnosing problems.

Jon Skeet: Right.

Kirill Eremenko: Code is code, whether you’re coding in a … Oh, by the way, can you tell us quickly, you mentioned C# compiled language, Python, on the other hand, interpreted language. What’s the difference?

Jon Skeet: Yes. I believe that even Python, there can be compiled-ish versions, but to be honest, I don’t know very much about Python. Where I give opinions on Python for any time in this podcast, please treat them with a grain of salt, a very, very large grain of salt, but a compiled language like C#, you take the source code and you provide it and any libraries that you depend on, into the compiler and the compiler outputs a file, which contains a binary representation. Now, for compiled languages like C and C++, that compiled representation is pretty much machine code that can be executed directly.

Jon Skeet: For C#, it’s something called intermediate language, which is roughly equivalent to Java bytecode. If any of your listeners are familiar with that. Again, Java or a compiled language, you get out class files that are in this bytecode format that the JVM, the Java Virtual Machine, knows how to run. It gets even more complicated, because both Java and .NET almost always take those compiled, so that binary formats, not your original text source code, but they then do something called JIT compilation, which is Just In Time compilation.

Jon Skeet: They take that sort of nearly machine code and turn it into actual machine code, so they don’t need to go through that translation step several times. That happens at execution time, but there’s this first bit where you get to check that all your source code actually makes sense.

Kirill Eremenko: Got you. Got you. In summary, in some cases, C++ and C compiles straight to a sort of file that can be run. In the case of Java and C# is first compiled to intermediate file and that helps find any errors at compilation among other benefits, of course.

Jon Skeet: Right.

Kirill Eremenko: Then the second a Just In Time compilation is required, so you can run in multiple architectures, again, in addition to other benefits as well.

Jon Skeet: [inaudible 00:26:59] efficiently. There are ways of compiling, certainly, C# and I believe Java with ahead-of-time compilation, which is sort of doing the JIT compilation bit of bytecode into machine code, doing that ahead of time instead of when it started to run as well, so there are lots of different options.

Kirill Eremenko: Got you, got you. On the other hand we have interpreted languages such as Python. Any comments on that, what’s the difference?

Jon Skeet: In theory, if you take your very simplest idea of an interpreted language, you have this interpreter just like you have a Java Virtual Machine, but instead of working with the bytecode, it’s working with the source code, so it runs, and it maybe reads your whole source file into memory, but then it looks at one line at a time and says, “Right, what does this line mean? I will execute the code that’s in that line, and by execute I have to understand what it means.” If it’s something like, let X = Y + Z, then it needs to pause all of that and understand what it means, and then say, “Okay, now I need to load the value of Y, load the value of Z, add them together, save them in variable X.”

Jon Skeet: Now, the very simplest kind of interpreter, if you have that code in a loop, would be looking at that line saying, let X = Y + Z every single time, and have to understand it. Now, that is massively inefficient. Everything would run far too slowly to be useful. More modern interpreters might store some almost like the ILO or the bytecode representation of that, so that it doesn’t have to do the textual passing every time, or they might actually do something like the JIT compilation. Even though it’s sort of interpreted, I think very few languages are genuinely interpreted the whole time these days, because we’ve got good at doing things in, well, that’s JavaScript here, V8, et cetera.

Jon Skeet: The difference between static versus dynamic languages and compiled versus interpreted, they are different things, but often go hand in hand. Static languages tend to be compiled, dynamic languages tend to be interpreted. But the difference in execution time between those two sort of extremes has gone down massively over time, because we’ve got a lot better at dealing with interpreted languages.

Kirill Eremenko: Yeah. One of the differences that somebody programming with these languages would see, and this is quite important by the way for data scientists, because more and more data science is becoming not just, “All right, let’s do some analytics,” it’s becoming more product-oriented. Like in certain startups, data science is embedded into the product, so you will encounter-

Jon Skeet: Absolutely.

Kirill Eremenko: Yeah, you will encounter times when you will, especially if you’re going into the startup world or developing new products, you will encounter situations where you will need to work with compiled languages. The difference in what you observe can be that, if you’re typing some code in Python and then you run it, it will run. For instance, you have, let’s say, have 100 lines of code and you have an error in line 50, it’ll run the first 49 lines and actually execute them. When it gets to line 50 it will give you an error.

Kirill Eremenko: In a compiled language, when you try to compile that, it will give you an error and none of those lines will be run, so it’s important to understand that if you have some, for instance, data manipulation, data cleaning, some pre-processing in the first 49 lines of code, in one case they will be executed and your data will change. Whereas another case they won’t be executed, because you won’t be able to compile the file. I think that’s quite an important, quite radical, difference for people to understand that, not only it’s about what you see, but also the effect that it can have in the background on anything that you were doing before that error occurred.

Jon Skeet: Absolutely. Personally, I would like to see more support for static languages and, obviously, I would love to see C# used more in machine learning and data science in general. If I knew more about data science, I might be in a place to help that along. As it is, I’m almost entirely ignorant, so I don’t know how we would do that, but it does come back partly to the aspect of interpreted languages are often also used interactively. My understanding is that a lot of data science is done via Jupiter notebooks and the like, where you’re exploring things as you go. It’s not like you write all of your 100 lines of code and then run it and then find that there’s the problem, but you’ve built that code up over time by trying things interactively, and that’s where statically typed and compiled languages tend to, and this is always sort of caveat of, tend to, there are exceptions, tend to not deal very well with being done interactively.

Jon Skeet: You tend to have to do things by creating your source file beforehand. Now, that’s not always the case, and this may be how we build C# support for data science, or data science support for C#, depending on which way you want to think about it, is by allowing C# to be run more interactively. There are definitely projects available for that sort of thing already. C# scripting, approaches and ways of running C# in a browser and the like. Maybe there will be really good Jupiter notebook support in the future. I am sure that there have been some projects that explore Jupiter within C# already, but they haven’t gained the sort of traction that we’d need to see more mainstream support. But I think the benefits, as you were saying, of not running those first 49 lines of code before you find the error at line 50, there are significant benefits of that, so I would love to see more support for C# within data science. I just wish I could help with it, but I just don’t have the knowledge to do so.

Kirill Eremenko: Yeah. It’s really a difference in philosophy, isn’t it? For me, when I think of data science versus programming and compiled languages, data science, like you said, is very explorative in nature and even if you’re not just looking for insights, you know you want to build the model, it still requires exploration of different approaches during the process. For me, the way I imagine it, the analogy, is like building a sandcastle. You are trying this out, this falls over, you put a new tower on, then you put the wall and then the water washes it away. You build it again and so on. It’s like always you’re playing with clay or sand, this type of creative approach.

Jon Skeet: Right.

Kirill Eremenko: Whereas in programming, especially in the world of C# and more in the compiled languages, it was a long time ago for me, so you’re much better placed to draw the analogy here, but does it … It almost feels like you have a blueprint of what you want in advance. It’s like you’re building a castle, not out of sand, but of little bricks or a Lego piece.

Jon Skeet: To some extent. To some extent. With more test-driven approaches, it’s often, well, you write the test and then you make sure that that’s implemented. I don’t want anyone to get the impression that you write the whole application and then you can run some of the code. You can still do things iteratively, but it’s less interactive iteration. Now, it can feel somewhat close to it if you get a really tight test run, write some code, run the test again, et cetera, when you can get that loop fairly tight, it can be pretty good. I would want to mention F# at this point.

Jon Skeet: F# is a functional language which is still compiled to Io. You can inter-operate between F# and C# and other Io languages, VB, et cetera, but F# was designed from the start to support this interactive exploratory mode. My understanding, not as an F# developer, is that a lot of F# work does happen in that exploratory mode like data science, so maybe actually thinking about what would be a good language for data science in a compiled statically typed way would be F# rather than C# and maybe we can build on that F# work to also support C# over time. My understanding is some data scientists do already use F#.

Kirill Eremenko: No, I’m very interested. I didn’t know about F#. That’s very exciting. Okay. Let’s move on to something you touched on, how you diagnose problems. Are there any best practices of problem diagnosis that, like in code, that you can share with data scientists, because code is code. Even though it’s interpreted or compiled, whatever, it’s still a very, in any country, this was what I love about coding.

Kirill Eremenko: You can know how to code in Europe, you can know how to code in Africa, you can go then to China and code there. Code is pretty much the same around the world. Are there any best practice, something you’ve developed throughout your career, that you can share on how to find those errors in the code? Sometimes, errors, they don’t even pop up as an error, but it’s there.

Jon Skeet: Right. Yes. There are so many different sort of categories of error. There are errors that you find at compile time that you don’t understand why this doesn’t work, why it won’t compile, and they’re relatively easy, generally. There are errors where things go bang, with exceptions, at execution time and they can be reasonably easy to find and fix. There are errors where, “My code all runs, it just produces the wrong output,” and that’s where things start getting harder. Then it gets really hard with, “My code runs and produces the right outputs on my machine, but the wrong output in production,” and that’s fairly hard.

Jon Skeet: Then it gets even harder with, “The code runs and always produces the right output on my machine and 99% of the time it produces the right output in production as well, but just occasionally it’s very, very slightly wrong.” Diagnosing those errors can be really hard. We should probably timebox this almost, because I can talk about diagnostics for a very, very long time and I’m hoping, eventually, to write a book about how to get into diagnostics, because this feels like the silver bullet that I’m being, without trying to be too immodest, I’m pretty good at diagnostic things and that is the reason that I am able to help people on Stack Overflow.

Jon Skeet: You give me a problem and so long as you have given it to me in a sufficiently well specified way, ideally so that I can reproduce the problem, then I can apply the diagnostic steps and help you get to an answer. Now, I can do that, but if I have to help 100 people that way, then I have to go through the diagnostic steps 100 times. Obviously, it’s far more efficient if I can improve the level or help improve the level of diagnostic skills throughout the community and then each of those 100 people could have diagnosed the problem themselves and fixed it themselves. Now, that’s-

Kirill Eremenko: Got you. Well, this is a great opportunity to try out what you are going to share in the book.

Jon Skeet: Right. That’s a simplification. There are times where you need more knowledge than you have, et cetera, but I think the silver bullet in diagnostics is divide and conquer. You have a whole application that might involve some XML pausing, producing some JSON, it’s interacting with a database, it’s interacting with a web service. It’s got a web front-end. All these things and something is wrong. The first thing to do, in my experience usually, is try to isolate where the problem is.

Jon Skeet: If you can reproduce it without doing any XML parsing, then the problem, probably, isn’t in the XML passing. It may well be that that has problems as well. One of the fascinating areas in diagnostics is where you happen to notice other problems as you’re tracking down one major thing, but it’s really easy to be sidetracked by, “Oh, it turns out that’s broken, and the third thing is broken. Everything’s broken,” and trying to work out-

Kirill Eremenko: Or when you have two problems that cancel each other out.

Jon Skeet: Yeah, that’s even worse. You think, “Well, why is that code doing that? I will fix that as I go along. Oh no, that’s now caused another problem,” and making sure that you keep some kind of record of, “Well, I need to go back at some point and fix those things,” but without losing track of where you are on your main, “This is the problem I’m trying to fix.”

Kirill Eremenko: Divide and conquer.

Jon Skeet: Try to reduce things … Sorry?

Kirill Eremenko: Divide and conquer.

Jon Skeet: Yes. Divide and conquer. Try to reduce things to the smallest program that you can find that demonstrates the problem. Not just the smallest in terms of source code size, but the smallest in terms of environment and the friendliest in terms of environment. I’m a big, big fan of console applications, not necessarily for running them. I do like building tools that are console applications that do something useful, but if I’m trying to diagnose a problem in a web application, but I don’t think the problem is in the webiness of it, but yeah. Say, something I will be diagnosing today is trying to make calls to a diagnostic service, a Google API diagnostic service from a web application.

Jon Skeet: Now, that probably is going to depend on the website of things, because it goes into the logging framework, but if in the same application I found that I had a problem talking to our speech API, for example, then that wouldn’t be specific to the web application. I’ll probably pull that out into a separate console application, hard code the data rather than taking it from the user, because it’s then easier to share the program and easier to reproduce without having to type in the inputs every time and then I’ve got a console application that’s 30 lines long or something that doesn’t behave as I want to. That’s a really small amount of code to debug.

Jon Skeet: I can launch a debugger for a console application really easily without having to work out all of the intricacies of setting up the debugging for a web application, which … It’s not very hard these days, but it’s just extra steps, so you’re trying to get this as small as possible, as simple as possible. If you think about the amount of code that’s involved in a console app compared with a Windows gooey or a web app or a mobile app, all of these things, if you can get it into some really simple form, then it makes things so much easier to work with. Sometimes, you will still need to debug into it and you try lots of different things, but often just having simplified things, having separated out all the 99% of code that doesn’t matter, so that you can focus on the 1% that does matter.

Jon Skeet: Suddenly, it becomes really obvious and you think, “Well, why didn’t I see that before?” Well, because you had all this other code that was potentially wrong but turned out not to be. So yeah, it’s isolating the problem and knowing how to say, “Okay, for the moment, I will hypothesize that the problem isn’t in the XML. I will just hard code the data that would normally be passed from the XML. Maybe I will hypothesize that it’s not in the formatting of the JSON, so I will hard code the JSON output,” or whatever it is. Getting rid of those dependencies is the big thing for diagnostics within programming.

Jon Skeet: Now, I don’t know how well that sort of transfers into data science. Maybe you can simplify your data sets. If you’ve got an enormous dataset and it’s giving you some strange results, I can see there being significant problems in saying, “Well, I will take a much smaller part of the dataset.” Whether that’s fewer data points or removing half the features and saying, “Well, I’ll only concentrate on there on the people’s name, address and age or something and see whether I still get the same results.” I would imagine that everything’s so intricately bound in data science models, that you could even easily take the wrong steps there, but, hopefully, some of it transfers.

Kirill Eremenko: No. No, definitely. It’s very valuable advice to divide and conquer. The way I see is, basically, you have lots of degrees of freedom of where the error could be, try to cut as many of them off as possible.

Jon Skeet: Absolutely.

Kirill Eremenko: To lock them in. In terms of data science, again, I wouldn’t remove the features, but in terms of limiting a dataset, that is actually a quite common practice to develop a model with 10% of your dataset and then only expand to 100% or whatever you have, 50%, later on. That could, totally, be applicable.

Kirill Eremenko: Of course, every situation is different and it’s case-by-case basis, but having this philosophy in mind of, “Okay, I have an error, it’s quite a large code. All right, let’s hard code certain things into it. Reduce the degrees of freedom and try, and reproduce the error,” I think you’re totally right. That’s your silver bullet.

Jon Skeet: Yeah.

Kirill Eremenko: Awesome. Okay. Thank you. That’s a great tip on diagnosing problems. Any other comments? How do you not make the errors in the first place?

Jon Skeet: Oh-oh.

Kirill Eremenko: Or how do you … When you’re coding-

Jon Skeet: If I knew how to do that [inaudible 00:45:14].

Kirill Eremenko: Maybe there’s some best practice, like you code 10 lines or 100 lines and then you review them or you … I don’t know. What’s something that you’ve developed? What’s your secret sauce?

Jon Skeet: The two best practices that spring to mind are very far from secret sauce and are pretty widely used these days are code review and tests. If I come to a code base that doesn’t have any tests, that scares me, because I don’t know … If I make a change, I don’t know whether that’s going to have some adverse effect on some other bit, but a well tested code base, ideally, with different kinds of tests. There’s a certain amount that you can do with unit tests, where you’re not interacting with anything external. You try to test one piece of code in isolation from everything else. Something that’s only got unit tests, okay, that’s not so bad, but I also generally like to see integration tests.

Jon Skeet: Where in unit tests you tend to fake out external dependencies, I will assume that my database behaves in this particular way, and so I’ll fake out the interaction between my code and the database. Well, that’s fine so long as I’m right about the assumption. I want to see some integration tests as well, which use the actual database to say, “Well, what happens when I really, really try to do this against the database,” or against the web service or whatever it is? Those tend to be more expensive, either in terms of resources and time to set up the test, time to execute the tests. They may be actually financially expensive if you have to call some API.

Jon Skeet: I might have tens of thousands of tests that are unit tests, because they’re essentially free, but if I’m calling some translation API and I want to call that 10,000 times within my tests, then that’s going to take a long time, because it takes a lot longer to make any kind of network call than to do stuff in memory, and if I’m doing that an awful lot, then I may be billed for those translations that actually don’t end up being useful to me other than to verify the code. I would want to be able to run all the tests frequently, even if I haven’t made changes in half the things, so to some extent, I’m not getting much value for all those API calls that I’m making.

Jon Skeet: That’s why you want to have a good balance between unit tests that are free but sort of limited usage, versus the more expensive in time or billing or whatever it is, integration tests that test far more of the system. I would say, integration tests when they fail, you’ve then got a significantly bigger job to diagnose why they’re failing, because you’ve got a lot more surface area, you’ve got more degrees of freedom as you put it. Whereas, the unit test is generally only testing one of those degrees of freedom. If something goes wrong, you immediately know, “Well, it must be in that bit.” There are different pros and cons for those different kinds of tests, but they definitely help to reduce how many bugs will get into my code as well code review.

Kirill Eremenko: Fantastic. Thank you so much. This part of our discussion, I think, has been very useful, especially for data scientists who want to get an edge in terms of being prepared for product development and integration of data science into product. I think that’s a big thing to put on your resume.

Jon Skeet: Right.

Kirill Eremenko: [inaudible 00:48:49] to this. They can totally do that. In the interest of time, I only have about 10 minutes left. I do want to talk a bit about communities.

Jon Skeet: Absolutely.

Kirill Eremenko: This is a very important topic. We actually ran a survey recently among our students. We have about 850,000 students on Udemy studying data science and close to 100,000 on SuperDataScience. This survey, I think, went out just to the SuperDataScience community, or SuperDataScience students. One of the biggest, I was actually talking to our business development manager today and he said one of the biggest insights that we got from the survey is that people want a community. That, right now, the way data science is structured, the way people are learning it, it’s a very hot topic. Data scientists want to learn, data scientists are needed in the job market and people are learning.

Kirill Eremenko: But one of the things that still lacks at the moment is that, whilst there are courses and the knowledge is out there, and we are one of the providers of this knowledge, one of the things that we could do better is we could create a community for people to interact to have some kind of feedback loops, have some friends, have some buddies, have some conversations, interesting talks with people and so on and learn from each other, support each other, mentor each other. That’s a big thing. In the development world, as we discussed at the start, community has naturally evolved, and it’s already been around for longer than in the data science world. Tell us a bit about the development community. What is it like and why do you enjoy being part of it so much?

Jon Skeet: The first thing to say is that, it’s not like there’s one development community. I’m sure you know that already, but it’s important to emphasize because, not only are there lots of different communities, there’s the C# community, the Ruby community, the Java community, et cetera, and even within that, there are different sort of sub-communities that are massively overlapping and different ways that those communities come together. Just some examples where I’m involved in community, Stack Overflow, obviously, it’s not trained to be a social network.

Jon Skeet: In some ways, it’s impersonal, but it’s community of learning. I think it’s important to think about each community, what you’re hoping to get and think about whether you want people to be taking personal time and getting to know other people in the same sort of area, because something like Stack Overflow only does that marginally, because that’s not it’s aim. It’s aim is to have questions and answers. Whereas, at the other end, you’ve got user groups. I go to the Reading C# user group or Reading .NET and I’ve spoken at many other user groups, and often there’s much more of a feeling of, “Hey, I want to talk to other people in the same space.”

Jon Skeet: That’s partly for sharing information, just getting to know people, because it’s always nice to get to know people. We are sociable people, finding out about job opportunities, learning more information, et cetera. That’s typically in-person and much more of a social aspect. Then somewhere in between, there’s conferences where during the conference talks, obviously, that’s mostly one too many. The conference speaker speaking to the room and getting some feedback, but it’s relatively rarely discussion oriented. But then in the halls between talks or if you’re brave and [inaudible 00:52:43] say, “Well, I’m not going to get into any talk now, I’m just going to hang out in the lobby and talk with other people,” and that’s a totally valid thing to do, it can feel a bit odd to start with, but once you get used to it, it’s a great thing.

Jon Skeet: If there’s no talk that you’re particularly interested in, just chat with people and you’ll get loads out of it. There are more organized bits of communities. I’m on the board for the .NET Foundation, which tries to be a sort of community hub in terms of supporting various .NET projects, acting as, to some extent, a bridge with Microsoft, so we can represent .NET users in a cohesive way. Yeah, the .NET Foundation is still finding out what needs it needs to meet and how to meet them, but it tries to be an online community enabler as much as a community in its own right. There are all these different ways that you can be community and I would certainly encourage the data science community to think along similar lines of, it’s not like there needs to be one big player.

Jon Skeet: Arguably, it would be better for there not to be a single big player that, if it doesn’t work for some data scientists for whatever reason, then they feel they don’t have a community to be part of. Having lots of smaller self-organizing communities, I think, is generally better. To sort of anticipate a potential question, make sure right from the very, very, very start that those communities are as inclusive as possible right from the start. Just do not tolerate any discrimination, whether that’s in terms of the obvious aspects of race and gender, sexuality, et cetera, but also in terms of, be friendly to newcomers. There are some communities that have a reputation for being really hostile to people who are just trying to get into that community.

Jon Skeet: Why would you want to do that? If you’re passionate about cookery, then why would you want to discourage other people from becoming better at cookery? Even if they’re saying, “Well, currently I’m trying to boil an egg and it keeps just cracking, because I haven’t got any water in there.” Oh well, you don’t say, “Well you’re so stupid for not putting water in there,” you say, “Well, okay, let’s take a setback. Yeah, you need water. Let’s see what other things you might be doing that could be improved.” Try to bring people along rather than being some sort of “I’m better than you” kind of community. Community shouldn’t be competition. There can be competitive elements, if you enjoy some competitive coding competitive … I know that there are data science competitions, and that’s fine, but that needs to be one aspect and not the whole idea of the community to prove who’s best. That doesn’t really help anyone.

Kirill Eremenko: Totally. I totally agree with you. We actually have a conference in San Diego that we run every, what is it, this year or September? At the end of September this year, it’s called DataScienceGO, and one of the things that somehow just happened and now we are very happy about it and we’re promoting it in the sense we’re supporting it as much as we can is that it’s a very diverse community. We have, sort of last year we had like 350 people attend from all different walks of life, all different backgrounds, countries.

Kirill Eremenko: We had a much a larger presentation of female data scientists and aspiring data scientists. In the data science world right now it’s about 10%. I think we had over 20 or 30% at the conference. What we’re trying to do to promote this is, we were trying to showcase and specifically invite speakers from minority groups or female speakers in data science. Not to reverse discriminate against male data scientists, but provide a platform for anybody to show that it can be done, that you can achieve success and create these role models for people to look up to and to learn from. I think-

Jon Skeet: I hope you’re inviting them to talk about data science, not to talk about-

Kirill Eremenko: Of course, yeah. Of course.

Jon Skeet: It’s like, “Oh, what’s it like being a woman in data science?”

Kirill Eremenko: No.

Jon Skeet: Maybe somebody may want to talk about that, but the main thing is recognizing that there are some awesome women in data science and, in fact, most of the conferences … I go to various developer conferences and there are often machine learning topics, and I would say 90% of the conference talks that I’ve been to on machine learning have been given by women who have been awesome and really, really good.

Jon Skeet: I have to say, I’m disappointed to hear that the data science community is only 10% women, because I’d heard great things about it being nearly 50-50 at various user groups and things. Maybe there are pockets of data science community which already have discovered whatever secret sauce it is. I should caveat that any secret sauce like this is likely to involve a lot of hard work.

Jon Skeet: Being inclusive is not just a matter of saying, “Well, I’m not going to be nasty to anyone.” You need to be actively inclusive and watching out for any problems that will put people off. But I would’ve thought data science, even more than development, really needs to be diverse, because you’ll be dealing with data that is diverse and I suspect I’m preaching to the choir here when I say that, if your data is biased, then your results will be biased.

Jon Skeet: I think having a diverse community has to be part of trying to ensure that you don’t have biased data and that you can challenge assumptions that come in all through the process. Whether that’s collecting the data, working out what data to collect to start with, and then how you process the data, et cetera. It feels like if the data science community isn’t diverse, you really face an uphill battle there.

Kirill Eremenko: Yeah, I totally, totally understand what you mean. As you mentioned before the podcast, there’s quite substantial dangers of having a homogenous community, whether it’s in the development space or the data science space. There’s been studies-

Jon Skeet: It’s a danger, not only to the results, but also to how enjoyable it is. There are several reasons to be diverse, to encourage diversity and there are obvious moral reasons for discouraging people, excluding people, however implicitly, is just plain wrong. But it’s also not as fun for the people who are in the community. Diverse communities are more interesting to be part of. As well as getting better results and all kinds of things, there’s only benefits, pretty much.

Kirill Eremenko: Yeah, yeah. I was about to say that there’s studies to show that in diverse teams when you have people from all sorts of minority groups or from male and female representatives and from as many nationalities as possible, diverse groups like that get much better results.

Jon Skeet: Right.

Kirill Eremenko: It’s a question of why, maybe it’s the difference of opinions, difference of backgrounds, the different sort of communication and how people challenge each other and things like that. There can be multiple, like millions of different reasons for that, but the fact is the fact. If you measure the results, diverse groups, whether it’s in teams of developers or executive teams or data science teams, diverse groups get, in general, on average, get better results.

Jon Skeet: It correlates with other aspects. If you have teams where psychological safety is valued, so that people feel that they can voice minority opinions and you can have a minority opinion, even if you’re generally a homogenous group. I could be sitting in a group of other cis, straight, white males and still disagree with them, but being in a team where that is okay to do and is encouraged and supported and people don’t get shouted down, then that in itself is great.

Jon Skeet: Psychologically safe teams are likely to be more attractive for members of minority groups who might feel deliberately excluded from other places. There are benefits that sort of bounce off each other. You’re more likely to get a diverse team the more you encourage diversity, which in itself is useful even in a non-diverse team, encouraging the kind of practice that is attractive for diversity.

Kirill Eremenko: Why, is it because like a self-fulfilling prophecy almost.

Jon Skeet: Almost. Yeah. Yeah, and it takes work. I want to keep emphasizing this. It’s not something that happens because you say, “Okay, from now on we’re not going to be jerks.”

Kirill Eremenko: Yeah.

Jon Skeet: That needs to be step zero, but it needs to be, “Okay, well, I’m going to watch myself and others for behavior that I think might’ve excluded people or just not encouraged … not got the best from everyone.” That’s what it’s really about, is making sure that everyone is contributing as well as they possibly can, so even if you never told that junior team member to shut up, the fact that all the senior team members were constantly speaking and interrupting each other, may have made there be no space for that junior team members to speak up. So you’re not getting the benefit from them. Why have them there, but not get the value from them?

Jon Skeet: It’s watching very consciously and thinking, “How can we do better?” Sometimes, that will be a case of calling out yourself and saying, “Okay, I’ve realized that I’ve been interrupting and I need to stop doing that.” Sometimes, it can be calling out other people and that aspect of psychological safety that it’s okay to call out other people and it’s okay to be called out and your reaction needs to be not a defensive, “Well, I didn’t do that …” but so, “Okay, I will acknowledge that. I will think about that and I will try to improve.” That more positive environment takes work and we mustn’t underestimate how much work it takes, but also the benefits are so enormous.

Kirill Eremenko: Yeah. That totally echoes, for example, what Naval Ravikant says, he’s the founder of AngelList, that even if you’re selfish, even if you just want the ultimate best career for yourself and just … like, you only think about yourself, even in that hypothetical case of an extremely selfish person, it’s in their best interest to maximize the output that anybody on the team can give regardless of their background. Because that way, you’re exploiting their talents, which are inevitably … Everybody has different talents.

Kirill Eremenko: There can be major differences, there can be minor differences, but there are different talents and there’s no two people that are the best at exactly the same thing. Do you want to exploit other people’s talents the more, so that the team gets the most out of it, so that you create amazing products, you change the world, you do crazy cool things and that will allow you to grow, allow you to get the best benefits, allow you to progress your career as fast as and as quickly as possible.

Jon Skeet: Absolutely.

Kirill Eremenko: I totally agree with you. It’s all about creating this, I love how you put it, psychologically safe space for people to feel that they can speak up and share their opinions.

Jon Skeet: And its playing the longterm game into the short-term game. To take out any sort of identity politics that some people might feel difficult about, or might find difficult. Let’s suppose I wanted to make sure that the industry only rewarded with a surname beginning with S, because my surname begins with S. I could see immediately that that gives me a massive pay rise, I’m massively in demand, it’s all great. Unfortunately, that means that all the people who would be contributing great features, like the lead designer of C#, it’s Mads Torgesen. His surname begins with T rather than S. Therefore, I won’t benefit from anything that he would be contributing to the C# language.

Jon Skeet: All the C# compiler authors, whose name doesn’t begin with S are suddenly excluded and I don’t get a decent C# compiler. So yeah, I might be well paid right now, but I have to work with crummy tools. I don’t get the best hardware. I don’t get the best software. I don’t get the best, all the rest of the environment, because I’ve decided to exclude people. Realistically, I don’t think most people do want to exclude people. They just don’t … they can sometimes feel, “Well, if I widen my circle, then it means I get a smaller slice of the pie.” It’s really about making that pie ever bigger and bigger and bigger.

Kirill Eremenko: Fantastic. Jon, that’s very great examples or like, I think people now listening to this podcast have, if they weren’t agreeing before, now definitely are on board with the implications and the theory behind it. What are some practical steps that people [inaudible 01:06:51] take? You already mentioned the importance … Sorry, you already mentioned a step of the start. Could you repeat that one? Maybe there’s some other steps that people can take, practically, to help create these safe environments and safe spaces.

Jon Skeet: I sort of notionally said step zero of saying let’s not be jerks. To some extent that is the first step, but actually it’s learning about things. I started getting into feminism about four years ago and found just how much I needed to learn. The same is true in terms of all kinds of aspects of discrimination, which you might sort of discount as being, “Well, why do we need feminism? Now women have equality, they have equal rights. There can’t be a gender pay gap, because the law says that there isn’t allowed to be.” Well, it’s more complicated than that. Just take a humble approach to learning, and if you actively engage in, there’s so much material out there to find out where our industry has gone wrong and the situation we’re in and there’s no point in trying to be part of a solution without understanding part of the problem to start with.

Jon Skeet: The other sort of warning I would give is the temptation to solutionize, to use jargon. A speaker called Rhonda Bergman put this really, really well. I don’t think it was her analogy to start with, but she expanded on very well, about talking about the difference between knights and allies. That a knight goes in and wants to solve a particular situation and wants to end up being the hero, and that should never be the point of it.

Jon Skeet: An ally will go in and take a broader sweep of, “Okay, what’s wrong here and how can I contribute to the solution for the sake of the solution rather than to be some sort of hero?” It generally addresses the causes rather than the immediate symptoms. Obviously, immediate symptoms if someone’s being bullied or harassed or whatever it is, they need to be addressed, but just addressing those, and saying, “Right, job done now,” without addressing the underlying causes, ends up not being nearly as longterm productive. So it’s really about educating yourself and trying to find out where you can be of help rather than where you can lead the charge, because chances are someone else’s already leading the charge and could do with your support rather than could do with someone else sort of diluting the message.

Kirill Eremenko: Yeah, that’s a very valuable piece of advice, that you can’t have, like in any undertaking, you can’t have everybody be a leader. You need leaders and followers and it’s totally fine to be a follower. Even if you can contribute in a minor way to this, that’s already going to be a massive benefit. Jon, thank you so much. In the interest of time, please could you share with us, where can our listeners follow you, find you, read more much about your work or ask you a question maybe? That would be so fantastic.

Jon Skeet: Sure. I’m on Twitter, my Twitter handle is just Jon Skeet. I have a blog of blog.Jonskeet.uk and codeblog.Jonskeet.uk. My non code blog is mostly around feminism, although this weekend I posted a recipe for Tiramisu and Tiramisu ice cream. It’s anything non-codey, and the code blog is what you’d expect it to be, and, obviously, on Stack Overflow.

Kirill Eremenko: Fantastic. One last question before we finish up. Is there any book you can recommend to our listeners that has impacted your life?

Jon Skeet: Yeah. It sort of fits in very well with data science and what we’ve been talking about. I’d like to recommend Everyday Sexism, compiled by Laura Bates. The Everyday Sexism Project is a project that Laura started when she had a terrible experience on a bus once and speaking to other women found that her experience was not uncommon, but wasn’t being talked about. This book has lots of data points in it.

Jon Skeet: Lots of citations of reports and concrete data, so if you think that sexism isn’t a problem, then read the book and see concrete evidence about it. It’s sort of simultaneously inspiring and terrifying, I find. There are lots of other books around feminism and particularly in the tech industry, there’s a book called Brotopia, which looks at sexism in Silicon Valley, but yeah, but my main recommendation would be Everyday Sexism, compiled by Laura Bates.

Kirill Eremenko: Thank you very much. Everyday sexism, Laura Bates. Everybody, check it out. Jon, I just want to say thank you so much. I became a fan when I saw how much you contribute to community and now after this conversation we had, I’m a huge admirer of what you’re doing, both in the space of helping community and your expertise, unquestionable expertise in the domain of coding and how passionate you are about building community and making it accessible to everybody and making sure that all minorities are respected and everybody has … that equality is there. Thank you so much for spearheading this space in the world.

Jon Skeet: Thank you for having me on the podcast.

Kirill Eremenko: Thank you everyone for tuning into the SuperDataScience podcast. Super appreciative of you being here. That was Jon Skeet, the top contributor on Stack Overflow joining in for today’s conversation. How amazing is that? I completely enjoyed our conversation about the technical aspects at the start and about the community and being all inclusive towards the end. I hope you got a lot out of this. One thing to always keep in mind is that indeed data science is becoming ubiquitous and, eventually, you’ll see it be embedded all over the place.

Kirill Eremenko: It won’t be just creating models, providing business advice, but you were already seeing this as being embedded into products and that includes products where development is required, websites and apps and different … Basically, programs run everything we see around us, whether it’s Alexa that’s in your kitchen or whether it’s a washing machine or an airplane, it is a code that’s running that. As data science and machine learning, AI, gets more and more integrated into that, we will need to understand better the world of developers and developers will need to better understand the world of data science. If you want to get ahead of the competition, if you want to have a significant advantage or an additional significant advantage on your resume, in your career, this is definitely something to look into.

Kirill Eremenko: Data scientists who understand what development is all about, understand these differences that we talked about, such as compiled versus interpreted languages. What is versioning, how that affects developers, how that affects data scientists, what kind of problems you want to diagnose. What is the silver bullet in cold diagnostics? The divide and conquer principle, what are cold reviews, what are tests? All the things that we talked about, taken out of context, might seem that they are too far fetched for data scientists. Actually, it’s a massive advantage you can add to your career. I hope you enjoyed this podcast and got a lot out of it.

Kirill Eremenko: Make sure to follow Jon. His Twitter handle is @jonskeet and spelled J-O-N S-K-E-E-T without the H. J-O-N-S-K-E-E-T. Make sure to follow Jon. He already has 638,000 followers and that means he’s doing something right and, obviously, sharing very valuable knowledge. As always, you can get the show notes for this podcast, including all the materials mentioned and including the book that Jon mentioned at www … Well, you can get a link to the book, not the book itself, of course, at www.www.superdatascience.com/285. That’s www.superdatascience.com/285. There, you can also find the transcript for today’s episode.

Kirill Eremenko: On that note, thank you so much for being here today and I look forward to seeing you back here next time. Until then, happy analyzing.

Podcasts SDS 285: Bringing Dev & Diverse Communities Into Data Science

SDS 285: Bringing Dev & Diverse Communities Into Data Science

Podcast Transcript

Share on

Related Podcasts

June 19, 2026

June 16, 2026

June 12, 2026

Podcasts SDS 285: Bringing Dev & Diverse Communities Into Data Science

Share

SDS 285: Bringing Dev & Diverse Communities Into Data Science

Podcast Transcript

Share on

Related Podcasts

June 19, 2026

SDS 1002: Fable 5: The Full Story from Capabilities to Drama

June 16, 2026

SDS 1001: How AI Erased My Career Moat, an Episode #1001 Special: Jon Krohn interviewed by Kirill Eremenko

June 12, 2026

SDS 1000: Ten Years of the Super Data Science Podcast, with Jon, Kirill and Special Guests