Kirill Eremenko: This is episode number 345, with Senior Machine Learning Engineer at Twitter Cortex, Dan Shiebler.
Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in Data Science. Thanks for being here today. And now, let’s make the complex, simple.
Kirill Eremenko: Welcome back to the SuperDataScience podcast ladies and gentlemen and everybody. I’m super pumped for today’s episode. But before we proceed, I’ve got a question for you: Are you an advanced data scientist? Have you been listening to this podcast for a while? Well, maybe it’s time to come meet in person. Do you know that at DataScienceGO in the USA in October 23, 24, 25, in Los Angeles, which we are running in the UCLA Campus, we’re going to have a special track, a unique track dedicated to advanced practitioners in data science?
Kirill Eremenko: If you come to this track, you will hear from people like Dan Shiebler, who you’ll meet on today’s podcast and who works at Twitter Cortex, that’ll already be an exciting thing and he’ll be talking about task engineering. Dan has presented at DataScienceGO twice already, this will be his third, if you’re , this time he will be focused on advanced practitioners. So you’ll learn about task engineering. And what is that? Well, that’s a very interesting thing, you’ll hear about it more on this podcast. Basically it’s you designing and understanding what you will do with your model, what happens to your model when it actually affects the underlying population which is used to train that model. Very interesting, kind of like an inception type of scenario but it happens and it happens in Twitter Cortex. You will find out what you can do, a very hands on workshop for advanced practitioners.
Kirill Eremenko: Also, Morgan Mendis will be flying in all the way from Haiti, to talk about Airflow and how to build data science pipelines. Sinan Ozdemir will be coming to talk about Flask, Django, Docker containers and Kubernetes, or, we’re still deciding which topic, or it will be BERT, Transformers and Architecture and use cases for NLP and BERT. So as you can see already just from these three examples, I’m personally hand-picking the advanced practitioners to come talk with you. So if you are interested to sky rocket your data science career as an advanced practitioner, head on over to datasciencego.com and get your ticket today.
Kirill Eremenko: And now, on to today’s episode. So the guest for today’s episode is Dan Shiebler. It’s been over two years since he came on the podcast last time in June 2017. Since then, a lot has changed in his career. He’s left TrueMotion and now he works at Twitter Cortex and there is going to be a lot of interesting components to this podcast both on the technical side and the career side. So definitely you’ll learn about Dan’s career, how he makes his choices of moving and select the companies he works for and why; why he’s, in parallel to his career, doing PhD research, how that affects the lens through which he sees things in his career. Very interesting conversation, and his PhD, btw is on category theoretic machine learning. Boom. Blows your mind. Blew my mind for sure.
Kirill Eremenko: You’ll learn about his work at Twitter Cortex, in fact, you’ll learn about it quite intimately, so the things of course we could discuss. You’ll learn about vectors, embedding nearest neighbors techniques and methodologies that he uses in his work. So in a nutshell a podcast packed with value. Can’t wait for you to hear it, so without a further ado, let’s welcome to the show Dan Shiebler, Senior Machine Learning Engineer at Twitter Cortex.
Kirill Eremenko: Welcome back everybody to the SuperDataScience podcast, very excited to have you on the show, because today we’ve got a very special guest, returning for the second time, Dan Shiebler calling in from New York.
Kirill Eremenko: Dan, how you going today?
Dan Shiebler: I’m doing great Kirill. How are you?
Kirill Eremenko: Very good as well. And it’s crazy, like you said just now, that there was no snow in New York the whole winter.
Dan Shiebler: There’s a week when I was away and I heard that there was snow during that week, but I haven’t seen any at all the past few months.
Kirill Eremenko: That’s insane. Hopefully things get better. This global warming thing is getting quite out of hand, it was really hot in Australia the whole summer. Strange. Anyway Dan, it’s been a while. I looked, our previous podcast was June 2017, that’s two and a half years ago. And last time we caught up in person was at DataScienceGO 2018, which was one and a half years ago. How have you been since then?
Dan Shiebler: Doing great. Lots of really interesting projects I’ve been working on, but great things happening.
Kirill Eremenko: And you changed jobs, congrats. You’ve been at Twitter, so you were at TrueMotion before, now you’re at Twitter Cortex, you’ve been there for over two years. That’s awesome, congratulations man.
Dan Shiebler: Thank you. It’s been great.
Kirill Eremenko: What’s the best thing about Twitter?
Dan Shiebler: Well, it’s a really fast-growing platform, but it also has really established niche. There’s a lot of people who consider Twitter their favorite social network. It balances real high quality information and lighthearted commentary. I think that it’s in a real sweet spot as far as social networks go. I’m really proud to be working here.
Kirill Eremenko: And it’s interesting that it’s managed to stay that way for what is it? A decade now, or more. I remember I was at university and we had one lecturer, this was around 2012, or maybe 2011, we had one lecturer, he was teaching about an unrelated topic, was about finance. But he was talking about how, looking at some theories of how things change in the world. And how there are cycles of certain phenomena. And he was giving an example of social media, saying, Facebook Twitter. I remember his comments like Twitter specifically. He said that Twitter is not going to be around in four years that was the prediction for 2015. Because, simply, naturally, we see that with a lot of things, with a lot of other platforms that used to be popular. Facebook was very popular back in the day, but now there’s so many alternatives. I haven’t been on Facebook for years now.
Kirill Eremenko: But, it’s very impressive that Twitter has managed to adapt and stay afloat. What are your comments on that? Why do you think it’s so successful in that way?
Dan Shiebler: I think we have a good understanding of our user base. And we have a willingness to change, but also a really deep understanding of what are the things that made Twitter popular in the first place. I think knowing our strengths and knowing our users, gives us a real advantage.
Kirill Eremenko: Yeah. Okay. What is, you’d say, is a way that Twitter has changed, even since you’ve joined. You’ve been there what, just over two years. What’s a way you’ve seen Twitter adapt to the changes in user needs over that time?
Dan Shiebler: We recently launched a topics product that allows users to follow individual topics, rather than just following accounts. While following accounts is an excellent way to consume information for many Twitter users, some users have difficulty finding all of the information that they want on topics that they’re interested in. And topics is designed to better serve those users. I think this initiative and this product and family of products, really came out of an understanding of what are the challenges that some users have with the platform. And how can we better serve the users who we’re not serving as well right now. And bring them to see all the things that are great about Twitter.
Kirill Eremenko: Mm-hmm (affirmative). Fantastic. That’s a really cool thing. I use Google Alerts, I don’t use them that often, but I’ve set up a Google Alert for data science. And once a week I get an update on the most trending data science topics from Google. So, something like that, right?
Dan Shiebler: Yeah. It’s very similar. We’re still, every day, trying to make it better. And really understand how we can best serve users on the product service.
Kirill Eremenko: Okay. Got you. And apart from Twitter, which we’ll definitely get back to, you’ve got so much going on in your life. When I looked at your LinkedIn, it was so exciting to see. You finished up that research at Brown University on deep learning, but you seem to never allow yourself to slack off or get bored. You started a PhD, that’s is so cool man. That’s awesome.
Dan Shiebler: Thank you.
Kirill Eremenko: Why’d you decide to start one?
Dan Shiebler: I really enjoyed my research at Brown. I think that we really got a lot of high quality research and it was very exciting to me to have a pursuit, separate from my work, that allowed me to see things from a more academic mindset. And with more academic incentives. And I feel like that has really helped me grow. And doing this PhD is really something that lets me continue to do that at a more formal, and more intense pace. I’m very excited about it, really enjoy it.
Kirill Eremenko: What’s the topic of your PhD? Well, I see it on LinkedIn, but if you can share for our listeners.
Dan Shiebler: The topic is on category theory and machine learning. It’s a really defining category theoretic constructions for discussing and researching and understanding the links between different kinds of machine learning. And different kinds of fields that are closely related to machine learning. Category theory is a branch of mathematics that has shown a lot of applications in unifying previously disparate areas of mathematics. And there’s been a good push recently in applied category theory in taking category theoretic ideas and trying to apply the same kind of unification powers to more applied fields. Like a game theory, or biology, or physics. There’s been some really great research on quantum physics, that’s come from category theoretic perspectives. And I’m trying to utilize these same tools to increase our understanding of machine learning.
Kirill Eremenko: Wow, really cool. What’s an example you can give of unification in mathematics through categories theory?
Dan Shiebler: There’s some aspects of algebraic typology that were previously separate from similar concepts in analysis. And similar concepts in SEP theory that when we take a category theoretic standpoint and zoom out, and look at these things at a higher level of abstraction, we can see these individual constructions as particular instantiations of a higher order structure. It’s allows us to say, oh, these different kinds of transformations, they all satisfy these key properties that make it this particular kind of category theoretic transformation.
Dan Shiebler: And then, when we’re operating at that level of abstraction, we can simultaneously prove theorems about each of these different subareas of mathematics. We’re talking about things at this higher level. And let’s just get many theorems for free, is one of the key tenets of category theory. That’s similar to how a programmer might structure their programs so that core components only need to be implemented once, rather than multiple times. Category theory allows us to have that same degree of abstraction on multiple fields of mathematics, or ideally, applied fields as well.
Kirill Eremenko: Wow. Fantastic. I love your tagline, theorems for free. You should put that in your subtitle on LinkedIn or something.
Dan Shiebler: I’ll consider it. I didn’t come up with it, but it does do a good job of describing what it is.
Kirill Eremenko: When you were talking about that, a few light bulbs went off in my head, because it’s been while, but in my Bachelor Degree, I had algebraic typology back in my high school. And I had SEP theory in my degree. And I remember feeling these two are very similar, this SEP theory stuff. I’ve seen it somewhere, it really looks a lot like what we did in high school and stuff like that. Again, it’s been a while, so I wouldn’t be able to come up with examples, but I can see the value in that.
Kirill Eremenko: And that’s a really cool, very abstract though, thing to be doing. I’m just curious, it’s not really linked to your work at Twitter, is it?
Dan Shiebler: There are links, but the links I would say are at a very high level, birds-eye view. And I would say the fact that the day-to-day work is very different, was largely by design. It’s very challenging to have a full day at a job, even a job that you love, and then go home and do lots of other work that feels very similar, with similar frustrations and similar problems.
Dan Shiebler: One of the benefits of focusing my PhD research on something so different, and so much more theoretical from the concerns that I focus on at work, is that when I do, one, it doesn’t exhaust me for the other. Each one allows me to work a different part of my mind, that will allow other parts to rest.
Dan Shiebler: That really is a nice way to balance things in a more holistic fashion.
Kirill Eremenko: Love it. And that’s similar to what you did back when you were at TrueMotion and you were doing the research in Brown University. At TrueMotion you were doing machine learning, but at Brown University you were doing research on deep learning, as far as I remember.
Dan Shiebler: Yes. My research at Brown was far more focused on standard deep learning for image modeling, whereas my work at TrueMotion was much more focused on a signal processing perspective on machine learning. And machine learning for signal processing applications. It was very similar in that the two main pushes had overlap, but felt very different.
Kirill Eremenko: And it’s not like you have to do this research. It feels like you’re doing it more as a hobby. Is there any other motive for doing research? Or maybe, somebody listening to this podcast might think, oh, wow, maybe I should get into research too. What’s the reason to continuously doing research? First at Brown, now you’re doing your PhD at Oxford. Any comments on that?
Dan Shiebler: I think there’s a lot of reasons. I would say that for me, I genuinely enjoy it. I enjoy it at a very deep level. And I don’t think it would be the right thing for someone to do, who didn’t really genuinely enjoy it. But I think that there are a lot of significant benefits beyond just my own enjoyment. Really giving myself the opportunity to look at research from the perspective of a researcher, as opposed to perspective of a practitioner, allows me to see ideas on a deeper level than just, is this the right tool for me to use for this job right now? And more to think about things in terms of their deeper applications and other work that they might open up, other avenues of exploration.
Dan Shiebler: That kind of perspective, I think, it makes me smarter, it makes me more creative and it really gives me the ability to learn new things much more easily. It’s far easier for me to pick up ideas on some other team at Twitter who I haven’t worked with. Really understand what they’re doing. And understand the kinds of problems that they face, because I’ve drilled myself to be able to understand really complex topics, really quickly, through my research experience.
Kirill Eremenko: Wow. Very distinct explanation. There you go. If somebody listening to this is, I agree, is passionate about a topic, maybe consider research in that space-
Dan Shiebler: Absolutely.
Kirill Eremenko: It has its advantages like that. Let’s get to your work on Twitter. I read on LinkedIn the description, how you describe your role, very cool description. You develop systems and models to improve the performance and efficiency of machine learning at Twitter. It’s like you’re doing machine learning on machine learning, almost. Tell us about that.
Dan Shiebler: I think in order to really give it context, I can define the role of Cortex in general, then how I fit into that. Twitter’s engineering is split across a large number of product teams, serving different kinds of product services. The timeline, the advertisement products, the notifications products, or email products. Cortex is an organization within Twitter that develops machine learning systems and components of machine learning models that are incorporated into the modeling pipelines, each of these different things.
Dan Shiebler: And our work is at a level of abstraction, similar to the notion of category theory, where we are developing things that fit into multiple different products. Often what we will do is, we will assemble a couple of different product services that seem like they can be solved with a particular modeling approach. They seem to share similar restrictions on their current performance. And identify different ways that we can develop models that would serve each of these product surfaces. My team in particular focuses on models that utilize embeddings and nearest neighbors to serve products where we need to match users or other things, mainly users with large sets of possible candidate content. Like a large set of Tweets, or large set of potential notifications.
Kirill Eremenko: Interesting. So embeddings and nearest neighbors, let’s talk about nearest neighbors for a second. For nearest neighbors, or for any kind of categorical machine learning, I would expect you need a range of deals or a range of columns to be working with. What kind of columns are you working with at Twitter? Because, there’s mostly just the Tweets that people have.
Dan Shiebler: In this case, if we’re serving nearest neighbors, what we would be doing is first, the nearest neighbors are defined over an embedding space. The columns in this case are the embedding dimensions.
Kirill Eremenko: What is embedding? Sorry, can you get me up to speed please, what is an embedding space?
Dan Shiebler: An embedding in this cart is just a vector representation of some kind of entity. For instance, it could be a 300 dimensional vector, or 1000 dimensional vector, that represents a user, or a Tweet. The embedding plus nearest neighbors approach, for recommending content, involves constructing embeddings for users and constructing embeddings for content. Such that, the distance, or angle between two embeddings is indicative of some notion of affinity, or similarity where users will be close in embedding space to Tweets that they might like. We can utilize this to then create these users and all of these Tweets and then find nearest neighbors in the space to recommend content.
Kirill Eremenko: Got you. And then other teams can use the embeddings you’ve created to run their machine learning.
Dan Shiebler: Indeed. They could use the embeddings we create as features, or they can use the pipelines that we build to create the embeddings in the first place to create new embeddings that are optimized for their surface. And sometimes these new embeddings are constructed on top of other embeddings and everything will feed into each other.
Kirill Eremenko: Wow, that’s so crazy. If you’re able to share, because I’m sure there’s parts which you can’t disclose due to proprietary information, but if you’re able to share, do the embeddings that you create for users that are indicative of… If two of these vectors, 1000 dimension vectors are close, have a very little angle between each other, then that means the users are close in their behavior or in their characteristics. If two vectors for content are close, that means maybe somebody who will like this content, will like that content as well. I can see the implications of that. What goes into the embeddings in the first place? And going back to the question of what are the original features? Apart from their Twitter text, the messages they send or people they follow, there’s no already transactional data that this person on Netflix, or Amazon, that this person purchased these items. Basically, these are their specific interests. Is it all to do with NLP? I’m just curious. What goes into the embeddings in the first place?
Dan Shiebler: Great question. To start, I would say that NLP while very important for many use cases, for the purpose of recommendation, is actually a very myopic view on the structure of a Tweet. And the reason for this is that Jack Dorsey Tweeting a single-word Tweet, like hello, or something like that, had very different set of users who I might want to recommend that to than if I Tweeted hello, or something like that.
Dan Shiebler: Often the most useful information that when we can look at a Tweet, from the Tweet perspective can be the people who have interacted with the Tweet. And the dynamics of the author of the Tweet. There’s a huge amount of information on Twitter that’s represented in terms of the engagement graph and the follow graph. The follow graph is just the relationship between all of the users, based on who follows who. The engagement graph is the relationship between users and Tweets, as well as users and users, based on users choosing to like Tweets.
Dan Shiebler: Users choosing to like, or reply to, or retweet Tweets. And these kinds of behaviors incorporate an enormous amount of information. A Tweet that has 100 likes, all from machine learning focused people, really gives a very strong indication that the Tweet is about machine learning. And we can drill down very deeply into content utilizing this kind of social or contextual information, we often refer to as [inaudible 00:26:38] to collaborative filtering.
Kirill Eremenko: Wow. Okay. Got you. Basically, you can even extract information that this Tweet is about machine learning, based on the likes it’s getting, from whom it’s getting those likes, without even digging into the processing of the text within that Tweet.
Dan Shiebler: Yeah. There are of course limitations to this. Sometimes a Tweet might be about a social issue that is popular among machine learning researchers. Or, a personal issue related to a popular personality machine learning, may end up with a very similar like profile to one that is about machine learning at a more core level. But often for recommendation use cases, understanding which communities of people are interested in something and representing something in terms of that perspective, can be the most rich way to get the kinds of information that the model needs to know. There is a lot more information that can be driven out of the Tweet text itself and we do of course extract this and utilize this. But, in general, if you had to choose between just the collaborative information, or just the Tweet text information, the collaborative information would win without a doubt.
Kirill Eremenko: Oh, very interesting. And what kind of tools do you use for this?
Dan Shiebler: We have stacks instructed and Scala and Python at the language level. Our modeling is almost entirely done in TensorFlow, from the perspective of all neural networks and such. We do have a number of in-house matrix factorizations, style tooling that’s written in Hadoop or Spark that’s used for some applications. We do have a very big deployments that uses a piece of software that we open-sourced, called Scalding, which is a scholar-based Hadoop tooling. That works quite well for constructing really large Hadoop jobs that can operate the Twitter scale.
Kirill Eremenko: Okay. Is that a good description of what a machine learning engineer does, is that you prepare machine learning tools for other people and departments to use?
Dan Shiebler: I would not say that it’s just a construction of tools. Machine learning engineers at Twitter fall into a couple of different categories. Myself, I would not say that my job would described in that way. My work is more around the construction of instantiations of the embedding pipeline. My team often partners with product teams. And we have our own set of tooling and our own set of systems that are really designed for constructing these kinds of embedding nearest neighbor pipelines. We will actually construct the models and help other teams construct these models in a more consultancy fashion.
Dan Shiebler: But there are engineers within Cortex whose role is to create the deep learning model deployments or the platform tooling for analyzing data or scheduling model reruns. There’s a spectrum of these more core engineering tasks and more direct modeling and machine learning model creation tasks.
Kirill Eremenko: Mm, okay. It’s like a variety of things. Got you. And you mentioned team a couple of times, how big is your team? And which part of the team are you working in?
Dan Shiebler: Cortex as a whole, is about 150 people, of which, my team is in the sub-organization focused largely on platform. And our team is about 10 people. My role is really more focused on model construction and understanding of the relationship between models and business value. My team has some people who are more focused on the optimization of our nearest neighbor pipelines, which are highly optimized and state of the art. And some people, who are more focused on the core software development as well.
Kirill Eremenko: Got you. When you say nearest neighbor, does that mean you went and consciously selected for your characterization, the nearest neighbor algorithm or, is that just a broad way of saying we’re finding the nearest neighbors? Because there’s other methods of clustering that could be used to group users into groups. Or finding, as you said, doing this collaborative filtering. Just a question around that.
Dan Shiebler: We actually use an approximate nearest neighbors system. And the reason why we selected that, is based on scale. The reason is because we’re not simply grouping users together, but we’re trying to find the nearest content for each user. If we’re in a situation where we have 300 million users and half a billion Tweets in a particular day, when we’re trying to match for each user, the best Tweets. Exhaustively looking at each user-Tweet pair is completely not scalable. 300 million times 500 million operations and many standard strategies would require utilizing something like that. Our approximate nearest neighbor systems allow us to dramatically optimize this, by constructing these graphs of Tweets that the user’s embedding essentially traverses. That’s a whole topic that’s very interesting, the construction and optimization of these algorithms for really allowing the user to content pairing process to scale.
Dan Shiebler: There are other solutions of course, that can solve the same thing, but clustering is one that you mentioned. This one is nice because of its connection to the amount of flexibility that we have in the construction of the embedding. The embedding itself can be constructed such that the distance relationships are a model output. And any kind of machine learning technique can be utilized there. And deployed, essentially, at scale for free and that’s a very attractive aspect of that family [crosstalk 00:34:32].
Kirill Eremenko: Wow, very cool. I was just thinking, when you gave the example of 300 million times 500 million operations, do you think if quantum computing picks up, that you’d be able to solve it completely differently? Just look at all the pairs and find the-
Dan Shiebler: I think there’s all kinds of optimizations that we do right now, that would be unnecessary if we had access to quantum computers. And in the deployment of machine learning models, certainly, but in the training as well. There’s many things we’re not able to do because of scale restrictions in terms of data collection, pipelines and such, that would be completely overhauled in the presence of really effective quantum computers.
Kirill Eremenko: That’s very cool. Have you looked into quantum computers quite a bit?
Dan Shiebler: Not as much as I’d like. There’s a research group at Oxford that does a lot of interesting research in the intersection of category theory and quantum computers. Utilizing category theory to make some quantum computing ideas much simpler and easier to build on top of. But, I can’t say that I’m familiar with it more than on a surface level.
Kirill Eremenko: Okay. Wow. It’s a very exciting topic and I can’t wait to see what happens when the quantum computers come. Thanks for such an interesting description of your role at Twitter. It’s very exciting and I can see how you’re super pumped to go to work every day. [crosstalk 00:36:23] and come back and do you research. That is really cool.
Kirill Eremenko: I wanted to shift gears a little bit. For our listeners, we have an exciting announcement. Dan was at DataScienceGO 2018 and you are coming back this year, very excited to have you back. How are you feeling about that?
Dan Shiebler: I’m feeling great, looking forward to it.
Kirill Eremenko: And as we discussed, this time, we’re considering aiming for a more advanced practitioner talk for Dan. Out of all the things that you’ve talked about, if you had to pick a topic for your talk right now, what’s the first thing that comes to mind? In a hands on type of workshop, what is it that you would be passionate to share with the audience?
Dan Shiebler: Absolutely task engineering. The process of creating a machine learning task where a model that does well on that task, will actually drive business value. The creation of models that can tie closely to core value, I think is something that is a real science that I’ve continued to learn about. And I think is one of the most important areas in machine learning and data science for people to understand at a deep level.
Kirill Eremenko: Interesting. Can you give an example? What’s an example of task engineering in business?
Dan Shiebler: At Twitter, for example, when we train our models on Tweets or users, or any sort of data, we need to be very careful about how the models that we deploy affect the data that we’re training on. A model that’s already trying to show users content that it thinks they’ll like is corrupting the quality of the training data that feeds back into the model, in that the distribution is shifting. The task that we are constructing for the model, when we retrain it, is now worse than it was originally. The awareness of these kinds of issues, and the construction of the model task and the pipelines that support it in a way such that increased model performance will continue to increase business metrics is a really deep science that has enormous applications.
Kirill Eremenko: Wow, that’s very cool. That’s a very good description, because the data you’re dealing with, you’re dealing with users. And now I think about it, that would be applied across most business cases. The only situations where that wouldn’t be relevant is when, for instance the data you’re analyzing is the national cohort, or the global cohort. Like a massive sample of people, you’re analyzing census data or even daily data, but in a much greater ecosystem than your own company. And then you’re applying it to your company. Then, what happens in your user base, with your company, doesn’t really effect the world that much. But, for instance, an example, if you’re analyzing stock prices and then you go and buy some Tesla stock or you sell something else, Apple stock, that’s not going to effect the world. You can keep analyzing the same way you were analyzing before. But in your case, you’re directly impacting the whole user base with your model.
Dan Shiebler: Absolutely. These kinds of decisions and how these decisions can be the difference in user consumption, is really critical. Things like if we start sending bad notifications to users, and users start opting out of the notifications, then we’re in a situation where we no longer have a data coming from users who really didn’t like their notifications. And a model that now starts performing well on this new setting and this new world, where we don’t have this data from the users who didn’t like notifications, is not actually the best model. And the understanding of that, as we construct a task, such that the best model on that task is actually the best model for deployment, is really critical.
Kirill Eremenko: Wow, that’s such a cool teaser. I want to come to this workshop now, this is exciting. Awesome. Thank you very much. This is going to pique people’s interest in the event and also specifically your talk. If you want to learn about task engineering, check out Dan’s talk at DataScienceGO 2020, 23rd, 24th, 25th October.
Kirill Eremenko: And now, I wanted to jump into something really cool. I’m not sure if you saw, but 24 hours ago, I posted a question on LinkedIn, I was about to say Twitter, on LinkedIn. That’s where I hang out more for some reason, it’s happened that way. And I posted a question for our followers or our audience to post questions for you and see what they want to ask you on the podcast. We’re going to go through these.
Kirill Eremenko: Are you ready for some rapid fire questions?
Dan Shiebler: All right. See what I can do.
Kirill Eremenko: All right. Here we go. Deepa asks, how are unsupervised models improved over time and what are the metrics you track to measure them?
Dan Shiebler: Great question. They’ve improved over time in terms of scale, certainly. But, in terms of our understanding of them, the development of them, there’s many kinds of really deep unsupervised models of course that have come a very long way in the face of improved computation. I think that tracking the performance of an unsupervised model is something that’s extremely application dependent. If we’re training a feature extractor, then the performance of the model that is utilizing those features would be the sort of thing that we would be tracking. If we’re tracking something that’s going to be used for visualization, some sort of clustering or generative model, then it’s much trickier. There’s heuristics who might be able to apply, but we may actually need human evaluation in order to really effectively compare models.
Kirill Eremenko: Okay. And does that change between unsupervised and supervised models?
Dan Shiebler: Supervised models tend to have a more built-in performance metric in that there’s a goal in mind, some sort of prediction goal that we’ve constructed. Classification might have how well is this model actually completing this classification task. But, of course as I mentioned a moment ago, with task engineering, this problem is not automatically solved for supervised models because we have these situations where this task we’re training our model on is not actually what we’re interested in having it do.
Kirill Eremenko: And over time things might change as well. Previously, in one company, I had the situation where a classification model was built maybe, I don’t know, 18 months before I joined. And everything was great, but then the population behavior changed, because of I don’t know, the aging of population. And sometimes behaviors of consumers, especially in retail, change. And the model was no longer working even though originally it had that supervised training.
Dan Shiebler: Absolutely. I think that’s a problem that many companies face. That’s certainly a problem that we grapple with.
Kirill Eremenko: And how do you deal with it?
Dan Shiebler: Regular retraining is one of the basic hygiene techniques that we utilize, but of course, when we’re in situations where our own model is corrupting data stream, even that alone is not enough. Things like setting aside certain populations for deploying different models on different groups of users and trying to avoid these kinds of self-contamination effects, can go a long way.
Kirill Eremenko: Got you. Next question is from Linda. What emergening technology should we be paying attention to and which industries will they impact the most?
Dan Shiebler: I think that improvements and hardware have really come a long way in terms of the types of machine learning models that can be used. The kinds of applications that we build on top of it. And I think that one of the reasons why compute hardware, in terms of things like GPU’s and TPU’s up until a few years ago improved CPU’s become so important in terms of what gets built. It’s a feedback effect. When a new kind of hardware is shown to be really powerful for a particular application, more things get built utilizing that hardware and for that application, which then spurs additional research into that kind of hardware.
Dan Shiebler: One of the reasons why machine learning conferences are so completely swamped right now with super deep networks, rather than more rule-based or symbolic kinds of approaches, is that the sorts of hardware that we have access to, the best most powerful kinds of hardware, is really well suited for deep networks.
Dan Shiebler: And that’s a result of the self-supporting process of deep networks encouraging more research on these kinds of hardware, which then encourages more research and better results from deep networks.
Kirill Eremenko: Got you. Would you agree with, I’ve seen in the news recently, or the past half year or so, that Moore’s Law is dead. That we’ve come to a limit in terms of how small our integrated circuits can be and how many transistors can fit on them and that’s it. From here, our exponential amazing benefit that we were getting is over and now it’s all going to flatten out.
Dan Shiebler: In some ways, I definitely think that’s true. I can’t say that I’m an expert in transistors or the necessary limitations on how small we can make them. But I can say that, improvements in our ability to parallelize computations and improvements in the construction of specialized hardware, have allowed us to maintain exponential growth in terms of the computations we’re capable of. Certainly these effects seem like they have limits and ceilings that are much lower than the seemingly unbounded limitations of Moore’s Law. But it’s certainly possible, that as innovations continue, we’ll be able to find out new ways to utilize other kinds of tricks to continue to improve computation. I don’t think that the speed of computation is necessarily never going to be able to increase with an exponential rate simply because we can’t make transistors smaller right now.
Kirill Eremenko: I agree. I completely agree. I think we’ll find a way. It’s been so good. The next one is from Oscar, who is asking about some insights into how Twitter is using machine learning to detect bots or bot accounts or bot farms. And, what are scalable solutions that are being implemented for cyber security and or fraudulent account detection? Anything you can share on that?
Dan Shiebler: I can’t talk about specifics on this, also because I don’t work on those teams and so I don’t have an intimate understanding of the specifics. But I will say that there’s a multidisciplinary teams combining machine learning techniques, heuristics and really rigorous research and understanding of the sorts of adversaries in fields and the user behaviors, the diversity of all kinds of healthy user behaviors as well. That’s understood at not only at an engineering machine level, but also a very human level to combat these kinds of issues.
Kirill Eremenko: Okay. Perfect. Next one is from Nikhill who’s saying, how much time is realistically spent on data to get it ready for model development?
Dan Shiebler: I think it really depends on what state the data is beginning in and the expectations of the model. Of course, it’s very easy to go to scikit-learn and train on logistic regression on the Iris data set. There’s really not much data [inaudible 00:50:07] at all. But, accessing data, for example, if you’re a data scientist who works at the Federal Reserve, it may take you years to be able to complete all of the necessary documentation. And track down all of the data and all of the different places under each of the different permission walls. And then, process it into a form that will realistically work. I’d say somewhere between 10 seconds and multiple years, depending on your application. Realistically, for a more useful answer, I’d say in general probably at least 80% of modeling time would be spent on some sort of data related task.
Kirill Eremenko: Yeah, like out of the whole, right? The modeling would be 20% of your time spent on the whole thing.
Dan Shiebler: Yeah.
Kirill Eremenko: For instance, at Twitter, when you’re developing some new model or something, I assume you already have some data pipelines prepared. But, if you were to create a new data pipeline, how long would that take you?
Dan Shiebler: Even for creating new data pipelines, a lot of our tooling is very well developed for exactly that purpose. For the process of creating new data pipelines and for the process of maintaining the data pipelines that we already have. I think the most time consuming problems at Twitter, are really understanding model behavior and understanding how a new source of data will allow us to construct better models, and less about the actual engineering work itself. Or, the modeling work, both of which are very supported on tools. It’s the decision making and analysis and understanding that can often take most of the time.
Kirill Eremenko: Isn’t that amazing, you don’t need to process data. This is one of the rarest cases in data science where you just have the luxury of, all right, I’m going to think about creative stuff all day long. Well, of course, there’s some more mundane tasks I’m sure, but you’ve created an environment where it is such that you can just do the fun stuff all the time. It’s so exciting.
Dan Shiebler: I think that it’s supported by the really serious investment of Twitter into making modeling easier and making modeling more scalable. I will say that there’s of course tradeoffs to having so much of the pipeline already exist and already be buildable and adaptable in that when we want to build modeling strategies that break some of the abstractions that are in place, it can be very challenging to understand the pipelines that have been built up over years, by many different teams. And there’s a very real learning curve to the depth of Twitter’s infrastructure and Twitter’s modeling pipelines. That I think can be intimidating for people who start.
Kirill Eremenko: Was it already in place when you stared two years ago?
Dan Shiebler: It certainly changed very significantly. But a very serious amount of this infrastructure was definitely in place. I remember having difficulty in the beginning really wrapping my head around the pure scale of what exists. Very common for me, at the beginning, was to build 80% of a solution. Only to find that some other team in London, or Boston had a solution that was far better than mine. That they’d spent the last several years on, that really completely obviated the need for any of my work in first place. Often understanding what’s been done previously in a space, really at a deep level, and what can be exploited from the work that’s previously done can be more valuable than trying to write a half-baked solution. Even if it’s can be more fun to write a more half-baked solution.
Kirill Eremenko: Got you. And it’s interesting, because from this it sounds like it’s a big investments for and a big bet for Twitter to bring you, or someone, on board to spend a few months getting their head around these things. They’re investing their time, their efforts into this new person that’s joining the team. They want to be sure that you’re going to stay for long enough to create some stuff of your own bring your idea to the team.
Kirill Eremenko: We didn’t speak about this, but I’m curious, how did your interview with Twitter go? Was it very clear at the start, okay, this is a perfect match? Or you were still thinking, or they were thinking? How did they know that you are the right person? It’s only a 10 person team, by adding you to this team, they’re going to bring a lot of value to the company.
Dan Shiebler: Well the team that I’m a part of now, didn’t really exist when I started. But when I started Cortex, the entire organization was only about 15 people. Like I said, it’s almost 10 times larger now. I don’t think that when they hired me they were thinking about the way things would be right now, in this position. I think they were more considering the possible ways that Cortex might develop and Twitter might develop and how I could help and fit into these different possible developments. And I think one of the reasons why Twitter has managed to remain relevant and be really an important social network in the world, is because there’s a lot of attention paid to the kinds of people that we hire.
Kirill Eremenko: Got you. And another question popped to my mind. The team, Cortex, has grown 10X, from 15 to 150, you’ve been there two and a half years. Any thoughts, is it in your plans to become a data science manager, or you prefer to do the hands on work and develop your skills there?
Dan Shiebler: I definitely feel that I’ve grown significantly as a leader over the course of my time at Twitter. I’ve been tech lead for a number of projects and I’m continuing to lead various sorts of initiatives. I do think at some point, perhaps in the not so distant future, I will switch to management. Because I do really enjoy leadership and thinking about things from a higher level. At the moment I’m invested in making the technical projects that I’m a part of be successful. Whether that’s through direct technical involvement, mentorship, or leadership on a more macro scale.
Kirill Eremenko: Got you. Okay. Let’s do one more question. There’s a lot more. But this one got the most votes, people actually voted for the questions.
Dan Shiebler: Oh, cool.
Kirill Eremenko: Here we go. This is from Oren. Oren asks how much of computer science topics, like algorithms and data structures, does a non-computer science data scientist need to master in order to advance from a build a model and present your report type of data scientist to a machine learning engineer that normally deals with production processes type of data scientist?
Dan Shiebler: I would say that there’s a couple of ways of looking at that. On one sense, I do think that it’s quite possible to really advance as a serious engineering engineer without really ever thinking super deeply about some of the core data structures and algorithms. But I do think that somebody who does that, is at a disadvantage, because there are many concepts that are critical in terms of the structure of different sorts of systems and the interplay of different kinds of components. And the elegance of different sorts of techniques that feel very unified and clear. And easy to understand when you understand these key topics to begin with. But can feel more jagged or harder to wrap your mind around, or harder to have that sort of solution be your first attempt, if you’re coming at things from learning each fact individually, rather than really developing an understanding of these kinds of fundamentals.
Dan Shiebler: That said, I will say that there are situations even when these fundamentals themselves can help directly. Not too long ago, I found myself in a situation where a suffix tree, which is a classic, the intro to data structures and algorithms data structure, was exactly what I needed in order to build a feature importance algorithm that would run efficiently. And implementing it at, yielded an 10X speed up, over the next best solution. And I certainly never would have come to that had I not taken a data science and algorithms class back in the day. But the fact that this is a single anecdote from six months ago, and I certainly can’t think of another one in the past year, I think probably says that the knowledge itself is not incredibly important.
Kirill Eremenko: Focusing on fundamentals and structures of systems, you gave that one example of the suffix tree, which I’d be curious to learn more about, but I’ll do that in my own time. What’s another example? Not like of an application of a system, but how thinking about the fundamentals can help somebody advance their career?
Dan Shiebler: There’s a lot of times when the construction of a system can take different roles in terms of its interaction with different interfaces. There’s a degree of abstraction that comes in, in the creation of software systems. The assembling of pipelines that deal with different sorts of data sources, different kinds of modeling infrastructure. The different ways that we can structure the sorts of software pipelines that touch on each of these different kinds of systems. When they’re well-structured in ways that make bugs difficult to introduce, make systems easy to adapt and add to and redesign, this can yield enormous improvements in model quality and pipeline quality over time. Especially when operating as part of a team. I think that one of the largest applications is in the construction of data generation pipelines, and the model training code as it interfaces with these pipelines, and having those constructed in a principled way is really valuable.
Kirill Eremenko: Okay. In a nutshell, the answer would be, rather than going for quantity of topics in computer science, go for the fundamentals and structure of systems. And think things through holistically. Then the follow-up question I would have is, how does one go about learning this kind of stuff? Do you have any books you can recommend, or sources online? Just even specific topics to look into for somebody who’s serious to follow this pathway, but just doesn’t know where to get started.
Dan Shiebler: Yeah, absolutely. I do think that there’s value in going through core algorithms, data structures textbooks, for the purpose of understanding these concepts. I personally like Algorithms by Dasgupta for that. But I would say that would be more of a second order strategy.
Dan Shiebler: I think that the first order strategy in terms of the fastest way to really develop this intuition on a deep level, is to simply be part of large software projects. For somebody working at home, this would mean contributing to open-source projects that ideally in a way that you would be able to get feedback on the code that you write through code reviews. Or, through a community of people who are contributing to a large project, or for somebody who’s working as a data scientist in a company, trying to get an understanding of the kinds of systems that software engineers are working on. And if you could even be part of one of those projects for a little while and understand these things from the perspective of the software engineers, who write code that gets reviewed by multiple people. And is part of really large, complex, multi-tenants infrastructures. And the kinds of concerns involved there, there’s really no better way to learn these sorts of issues than by simply working on them on a day to day basis.
Kirill Eremenko: And if you’re stuck at home, you don’t have access to something like this at work, or you’re still learning and things like that, you can just go to Github, open a recent development in machine learning or deep learning, whatever you’re interested in and read through how it developed. What is version one, what is version two, what was fixed, what was changed, what bugs came up, what bugs were removed. What features were added, what were the user complaints and so on. And just by doing that you can understand better the intuition, as Dan here pointed out, the intuition that went into all this. And the motives that were driving these changes.
Dan Shiebler: Yeah, absolutely. As you become more comfortable, being part of it and contributing it yourself and feeling the pain of these bugs, I think is a really exceptional way to grow.
Kirill Eremenko: Mm-hmm (affirmative). Got you. Well on that note, we’re coming close to the end of this podcast, been super exciting. How did you enjoy your second appearance on this show so far?
Dan Shiebler: It was great. Excellent. Lots of fun.
Kirill Eremenko: Great. I loved chatting to you. Great insights. Any parting thoughts? Any things you’d like to wish our audience on their way to becoming machine learning engineers and data scientists?
Dan Shiebler: I would say to really keep your mind open with respect to learning things. That it could be very easy to fall into the trap of only reading about the very latest, highest scoring on benchmarks sorts of architectures and really focusing on that. The really deep understanding of how machine learning got to where it is, understanding what was machine learning like in 1990? What were the people then thinking? I think going at things from a temporal perspective is an excellent way to develop the kinds of intuitions that makes somebody an exceptional machine learning engineer and machine learning researcher. I would encourage people to really think about how to develop that understanding as deep as possible.
Kirill Eremenko: Fantastic. Great advice. Well on that note, Dan, what are the best ways for our listeners to get in touch with you, or follow you, contact you? Just see how your career develops from here.
Dan Shiebler: My LinkedIn, Dan Shiebler, works. Also my email, if anyone has any questions for me. I’m happy to answer it, just danshiebler@gmail.com or dshiebler@twitter.com, if it’s Twitter related.
Kirill Eremenko: Mm-hmm (affirmative). Got you. Fantastic. Well, once again, thanks so much. And you mentioned one book, and before I let you go. I wanted to see, do you have any other books that you can recommend that have impacted your career personally?
Dan Shiebler: Absolutely. I have two books actually that I’ll recommend. The first one is something I read very early on. It was probably the first actual textbook that had anything to do with programing and it’s Coding the Matrix by Philip Klein. And it’s actually a book on linear algebra, and I’d recommend it for somebody who is either a data analyst or a software engineer who doesn’t necessarily feel that they’re super comfortable with linear algebra. Because the ideas introduced in this book, there’s many of them that ended up being really pivotal in my understanding of machine learning. And I think it’s just written from a great perspective of somebody who wants to understand how each of these different algorithms, that deal with matrices and deal with vectors, play together in a way that makes sense to someone who’s used to programing.
Kirill Eremenko: Got you. It’s interesting, Klein, for a second I thought it was the Klein that developed that abstract mathematical concept. What was it called? The bottle of Klein, or something like that, but obviously not. Probably not, it’s a more recent guy.
Dan Shiebler: It is. He is a little bit more recent. But he’s also a very abstract mathematician, who does some very interesting abstract research on graph theory.
Kirill Eremenko: All right. And then the second book?
Dan Shiebler: The second book is something I read more recently. It’s An Introduction to Computational Learning Theory by Michael Kearns. This is a definitely far less applied book. And not necessarily that I’d recommend to someone who’s looking for a book that will immediately change their career. But it’s written from the perspective of the state of the art of machine learning and the theory behind machine learning in 1994. And it introduces a lot of fundamental ideas, some of which have really gone on to take off. And some of which were largely forgotten, but understanding things from that perspective and in a theoretical framework that’s discussed and it, I think, has given me a lot of context in learning new things about machine learning. And understanding which ideas last and which ideas end up disappearing.
Kirill Eremenko: Fantastic. Exactly what you mentioned before.
Dan Shiebler: Yeah.
Kirill Eremenko: Study the history of something. Yeah. Very cool. It’s interesting you mentioned it, because in the FiveMinuteFriday episodes that I do in the podcast, literally as this episode is going to go live, there’s going to be five episodes about the history of data science. It doesn’t go into the details of algorithms and things like that, but historically, how the field of data science has been progressing. Because I was also curious, I had the same thought. In fact, actually, the team suggested this. And I was like, wow, this is a really cool idea. Knowing the history of something allows you to understand better, what the future will be like.
Dan Shiebler: Absolutely agree.
Kirill Eremenko: Yeah at least the fundamentals, right?
Dan Shiebler: Yeah. I totally agree. That sounds great.
Kirill Eremenko: Awesome. Well, once again Dan, thanks so much for coming on the show. Looking forward to seeing you at DataScienceGO 2020. Can’t wait for your talk, it’s going to be epic.
Dan Shiebler: Absolutely. Looking forward to it. Thank you.
Kirill Eremenko: So there you have it everybody, that was Dan Shiebler, senior machine learning engineer at Twitter Cortex. What was your favorite part of the discussion? For me it was definitely the whole talk about Dan’s PhD, this whole conversation about theoretic machine learning and algebraic topology brought back memories rushing from my university years. So it was really good fun listening to that, but I’m sure you had your personal favorite of this talk. If you would like to meet Dan in person and be part of that advanced practitioner workshop exclusive track for advanced practitioners, make sure to secure your seat today. Head on over to datasciencego.com, click the option for Los Angeles, he will be there. So we are running in two cities this year, in Berlin and Los Angeles. You want the Los Angeles option for 23rd, 24th, 25th October. Get your ticket today and you’ll be part of that advanced practitioning group, you’ll learn from Dan in a hands-on workshop, personally from him. So once again, the website is datasciencego.com.
Kirill Eremenko: And as usual, you can get all of the show notes and materials mentioned in this episode at www.superdatascience.com/345. You’ll get the transcript there plus any links, materials we mentioned, including link or the URL to Dan’s LinkedIn where you can connect with him and follow him at any other places on social media where you can catch up and follow him there as well. So, that is at www.superdatascience.com/345. That’s also how you can share this episode with your friends and colleagues. Just send them the link www.superdatascience.com/345 so they can get up to speed about with all the amazing topics we talked about today, including vectors, embedding nearest neighbors, different techniques and methodologies Dan uses in his work plus how to think about your career and why to maybe even do a PhD in parallel.
Kirill Eremenko: So there we go, hope you enjoyed this episode. Can’t wait to see you on the next one and until then my friends, happy analyzing.