SDS 571: Collaborative, No-Code Machine Learning

Podcast Guest: Tim Kraska

May 3, 2022

In this episode, Einblick co-founder and associate professor at MIT, Tim Kraska, joins Jon Krohn to discuss no-code collaboration tools for data science. The duo also uncovered the clever database and machine learning tricks under the hood of Einblick and talked about how no-code will shape the future of the industry.

About Tim Kraska
Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT’s Computer Science and Artificial Intelligence Laboratory, co-director of the Data System and AI Lab at MIT (DSAIL@CSAIL), and co-founder of Einblick Analytics. Currently, his research focuses on building systems for machine learning, and using machine learning for systems. Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Brain, and was a PostDoc in the AMPLab at UC Berkeley after he got his PhD from ETH Zurich. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the VLDB Early Career Research Contribution Award, the VMware Systems Research Award, the university-wide Early Career Research Achievement Award at Brown University, an NSF CAREER Award, as well as several best paper and demo awards at VLDB, SIGMOD, and ICDE.
Overview
Tim Kraska began working on the visual coding tool Einblick (meaning ‘one view’ in German) back in 2013 after being inspired by shared digital whiteboard tools that were increasing in popularity. The tool’s primary goal is to “bring people together with technical backgrounds [and] have everybody in the same room and make discoveries on the fly.”
Using a progressive approximation engine, Einblick delivers instantaneous machine learning results that are refined gradually in the background. For any operation, Einblick takes a sample of the data and runs a computation of it to receive speedy results – usually in seconds. Meanwhile, in the background, it makes the sample larger and larger until it converges to the final answer. This process allows several people to work on a project together, while avoiding the “awkward” waiting periods that occur on highly technical projects.
The combination of AutoML and Einblick’s progressive approximation engine brings more people together to make greater discoveries and allows those who may not code to take over more technical responsibilities. This shift frees up technical talent to focus on more complex problems and drastically increases overall team productivity–but by how much? When it came to the complete data science lifecycle, they saw productivity grow by 50% and a capability increase of 200%.
With such great results, Jon asked whether (and when) no-code data science would become the mainstream option. The answer isn’t so simple, Tim said. At the moment, business people are rarely using them, and any move toward this will require an organizational change. But ultimately, Tim noted that pure no-code would never exist. Instead, code will become the final option for teams.
Next, Tim outlined the critical steps that organizations should take if they want to become more data-driven: 
  1. Education for non-technical business people is essential.
  2. The right tools and integrations are essential, which is where a platform like Einblick comes in handy.
  3. A proper incentive structure that encourages employees to access data or data models on their own, instead of badgering data scientists, must be implemented. 
As an associate professor at the globally-revered Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (known as CSAIL out of MIT) it comes as no surprise when Tim said that academia was a great “breeding ground” for innovative ideas that transform industries. In fact, his own research is precisely how Einblik grew into what it is today.
Tune in to this episode to learn how ML applied to databases enables them to be faster and more efficient, and discover the future trends that are bound to shake up data science tooling and database optimization.
In this episode you will learn:
  • The inspiration behind Einblick [2:45]
  • Einblick’s progressive approximation engine [6:43]
  • How no-code tools impact productivity [17:18]
  • The critical steps to become more data-driven as an organization [24:30]
  • How research universities like MIT support high-risk, long-term research [38:37]
  • How ML applied to databases enables them to be faster and more efficient [42:03]
  • How real-time collaboration environments like Google Docs are likely to become more widespread for data science tasks [ 49:24] 
 
Items mentioned in this podcast:  
Follow Tim:

Follow Jon:
Episode Transcript

Podcast Transcript

Jon Krohn: 00:00

This is episode number 571 with Professor Tim Kraska, co-founder of Einblick, and Associate Professor at MIT. 
Jon Krohn: 00:12
Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now, let’s make the complex simple. 
Jon Krohn: 00:43
Welcome back to the SuperDataScience Podcast. We’ve got a deep, innovative one for you today, with Professor Tim Kraska. Tim is an associate professor in the globally revered Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, or, if you’d like to say that the quicker, more common way, CSAIL out of MIT. Based on his research at CSAIL, Tim co-founded Einblick, a visual data computing platform that has received $6 million in seed funding. Previously, Tim was a professor at Brown University, a visiting researcher at Google, and a post-doctoral researcher at UC Berkeley. He holds a PhD in computer science from ETH Zurich in Switzerland. 
Jon Krohn: 01:24
Today’s episode gets into technical aspects here and there, but will largely appeal to anyone who’s interested in hearing about the visual, collaborative future of machine learning. In this episode, Tim details how a tool like Einblick can simultaneously support folks who code, as well as folks who’d like to leverage data and machine learning without code. How this dual no code/Python code environment supports visual, real-time, click and point collaboration on data science projects and how it was inspired by Hollywood films like Minority Report. He also talks about the clever database and machine learning tricks under the hood of Einblick that enable the tool to run effectively in realtime, how to make data and data models more widely available in organizations, and how university research environments like MIT CSAIL support long-term innovations that can then be spun out to make game-changing commercial impacts. All right, are you ready for this fascinating episode? Let’s go. 
Jon Krohn: 02:22
Tim, welcome to the SuperDataScience Podcast, it’s so great to have you here today. Where are you calling in from? 
Tim Kraska: 02:29
Thanks Jon, really great to be here. I’m currently in Cambridge in my MIT office. 
Jon Krohn: 02:36
Nice, yes. And so you’re an associate professor at the iconic MIT Computer Science and AI Laboratory, or CSAIL for short. But you also recently co-founded Einblick, and Einblick seems to be doing really well for these early days. You’ve already raised six million in seed funding. So in this episode, we’re going to talk about both of these things. We’re going to talk about your research, and we’re going to talk about how it’s being applied to Einblick. You have a paper, for example, called Northstar, an interactive data science system, and we’ll be sure to put a link to that paper in the show notes. And that paper outlines the motivations and key requirements behind what later became Einblick. So, the system is inspired by futuristic visions of highly collaborative visual environments, like those in Bond movies, and in Minority Report. So, it’s fun that that’s where the inspiration comes from, and that you mentioned that in the paper. Einblick, which later came out of that, it’s a German word. It means kind of one view, right? You speak German better than me, so tell us where the name came from, how it came out of your research, fill us in. 
Tim Kraska: 03:48
Sure, I’m more than happy to. Einblick actually means one view, if you write it as two words. And, the reasons why we did that is because we are very visual interface, and if you do something with Einblick there’s one view, you should see everything necessary to get your insight. At the same time, if you write Einblick as one word, how we have it on our webpage, it actually means insight in German. So it’s a very clever wordplay, which of course almost nobody gets except when you’re from Germany. But a little bit more about the background, how this whole project actually came together, and you mentioned in the beginning Minority Report. The project started a long time ago, like in 2013 we started really thinking about it. 
Tim Kraska: 04:43
And around ’14 and ’15, we had the first prototype. And the main motivation was that we saw these large, interactive whiteboards appearing on the market, like the Microsoft Surface Hub, the Google Jam Board, Samsung Flip. There were a whole bunch of them, and they’re all around, and they’re getting more traction than ever. And essentially how you can think about them is, it’s a large, touchable TV you put on a wall to have a better video conferencing experience. And we saw these devices appear on the market, but they were mainly used to have a shared whiteboard between two locations, and we thought they could be so much more. And a little bit of the inspiration was like, if you watch any movie, just pick your favorite, you never see a person coding Python in these movies, if it’s anything with data. 
Tim Kraska: 05:37
It’s always this visual interface where somebody does some very quick discovery in a team with other people around. But it’s never like, that somebody’s coding Python in front of a PC and other people are watching over his shoulder, this simply doesn’t happen. And we were wondering like, what did it actually take to make this vision we see in movies actually true? Used as like touchable, interactive whiteboards to create an environment where people can actually work together and do discoveries on the fly. And then of course, because of the pandemic people were not in the same room anymore, we had to rethink everything again in the context of remote collaboration. But the key vision behind what we are doing actually stayed the same. We want to bring people together with technical backgrounds in data science, as well as the main expertise from the business people, have everybody in the same room and make discoveries on the fly, create insights, create models that can improve the business overall. 
Jon Krohn: 06:40
Super cool. So, part of what drives, what allows Einblick to be such a powerful tool is that it’s based on something called a progressive approximation engine. So, that sounds pretty fancy, Tim. What does that mean, and how does it work? 
Tim Kraska: 06:57
Yes. Actually, this came as an afterthought when we started that project. Normally, so I have a systems background, and I mainly work in the intersection of machine learning and systems. And normally what we tend to do is like, we develop the system first, and then we think later on about user interface on top of it. In the research project Northstar, which is now the company Einblick, we actually did it exactly the other way around. We started with what we consider to be a good user experience, and then thought like based on what we want, we develop the recording system for it to support that interface. And one key thing we discovered, particularly in this context of collaboration, is that you cannot have people wait for results. 
Tim Kraska: 07:45
So, let’s assume like you work alone on a Python notebook, you start your AutoML tool, and it tells you it runs for half an hour or an hour. It’s normally not a big problem, because you just say like, “Okay, I get a coffee, I walk away, in half an hour I check back in again, and I have my models built, and then I continue from there.” But if you are in a meeting and an operation takes like a minute, or five minutes, and now you have your manager next to you, staring with you on a screen with a wait icon. That’s pretty awkward, and if this happens once maybe. But if it happens a second time, everyone will immediately say, “Okay, let’s take that offline,” right? 
Jon Krohn: 08:27
Right, right. 
Tim Kraska: 08:27
And the collaboration is over. So, nobody wants to wait in a team, it’s just an awkward experience. And so, we needed to figure out a way to overcome that, and the answer, or our answer to the problem was like, this progressive approximation engine. So, how that actually works is, for any operation you do in our interface, it first takes a sample of the data, runs the computation over the sample so that you get a very quick result, normally in sub-seconds. And then in the background, it makes a sample larger and larger, until it eventually converges to the final answer. So, everything in Einblick stays interactive regardless of the data size, regardless of the complexity of the operation, so that you can have several people to work on it together, and make discoveries on the fly. 
Tim Kraska: 09:19
But at the same time, if you wait long enough, everything will converge to the final answer and you don’t have these uncertainties of approximate tree processing. And otherwise, if everything is approximate and stays approximate, you’re wondering, “Can I trust the result?” And using our technique, we actually bridge the gap, we get fast response time, approximate ones in the beginning, and in the end you will have the final answer if you would have walked away for half an hour, and waited for a time. 
Jon Krohn: 09:46
Super cool. So, that sounds like some ingenious tech, you can see how this is such a transformative tool. And when you first think of this you think of, “It’s just a UI that they’ve had to be clever with.” But it isn’t, it’s under the hood in order to have, like you say, that UI work effectively for collaboration when you have machine learning happening in the background. Yeah, people want to see results right away, there’s no point. You can’t every few minutes be saying, “Okay, let’s grab a coffee,” [crosstalk 00:10:17]. 
Tim Kraska: 10:16
Right, exactly. And it’s just awkward. And I think this was research-wise such an exciting project because of it, because we considered everything, starting from the user interface, down to the system design. And then it also made us to design a whole range of new algorithms, like for example we have an AutoML tool, as many other platforms by now as well. And our AutoML tool is designed, again, for interactivity. So, it memorizes which models run fast, and give you good results early on before trying more complex models which take longer to run. And they should probably do later on in the cycle, and other tools like that, just all about on how you can help the end users to make faster discoveries, as well as bring more people together to actually work on a problem as a team. 
Jon Krohn: 11:10
Yeah. So, let’s dig into that AutoML piece a little bit more, which I think is a really cool aspect of it. So, when you’re using Einblick, you could be doing it by yourself, or you could be doing it with somebody else, and you can be grabbing functions that are on a two dimensional plane, and you kind of link functions that are on a two-dimensional plane, and you kind of link functions together into workflows. And if you want to do some machine learning with some of your data, then you can choose to do AutoML. So, automatic machine learning, we’ve talked about it on earlier episodes. But for listeners who haven’t listened to those episodes or aren’t familiar with the term, it’s this idea of allowing an algorithmic approach to selecting the appropriate model for the problem that you’re tackling, as well as appropriate hyper parameters for the model. 
Jon Krohn: 11:57
So, the configuration of the particular modeling approach. And so, there’s a really cool no code way that you’ve implemented, where if I was writing Python code I would have a large number of arguments that I could have as I say, select some AutoML function. And the way that you do it in Einblick is it’s a series of questions, so it’s like, are you using this kind of data, or that kind of data? Are all the variables as important as the others? And so by asking these kinds of prompts, anybody, whether they have coding experience or not, can be guided through an AutoML selection procedure. And then in the background, using your progressive approximation engine, you’ll start to get results right away. And then as you mentioned, they will be more and more refined as more complex algorithms can be used behind the scenes. 
Tim Kraska: 12:56
Yes, I think you really got into the point. I think when we designed the interface, what we always had to mind was that we want to enable more people to do more work with data on their own. And at the same time, also have the right environment for the hardcore data scientists. And the key aspect was always like, let’s try to figure out a way that we can bring the two together. And so what the platform actually does is just like, for the user who just gets into it, or the business expert who doesn’t know how to code in Python yet, or took like one Python class maybe. We offer this assistance, which as you said, they walk you through, and they tell you on how to configure an operator in a very guided way, and easy way. 
Tim Kraska: 13:46
On the other hand, if you are a pro user, you can not only skip entirely the guide, like the assistant, but you can even use just like a code-only approach in the same platform, and you can also create even your own operators. And so, this creates this environment where you can mix and match visual operators, which can be used by everyone, this code, and then the code being packaged as new operators for the more advanced users. And so, now you have an environment where you actually can bring the two together. So if you have a data science team, and let’s say a product manager, and they are trying to figure out, “Okay, what features of the platform are actually used the most? What feature is used by what type of audience? “And depending on where the user came from, what is he most likely to use next? And then, what’s the likelihood that he gets stuck in a certain point?” All these requires that actually the product manager works together with a data scientist, and now we have an environment that they can do that, and over time the product manager can do more and more by himself because he sees how it’s done, right? 
Jon Krohn: 14:57
Right, right. 
Tim Kraska: 14:57
And you train them on the fly. And so in the end, the data scientist can really focus on the things which are more complex, and more advanced, whereas the product manager takes all these tasks he needs for his day-to-day operation, instead of having that split between the two roads. So, I think these systems help with that, and there are a bunch of other things we built into the platform to facilitate that. 
Jon Krohn: 15:22
Eliminating unnecessary distractions is one of the central principles of my lifestyle. As such, I only subscribe to a handful of email newsletters, those that provide a massive signal to noise ratio. One of the very few that meet my strict criterion is The Data Science Insider. If you weren’t aware of it already, The Data Science Insider is a 100% free newsletter that the Super Data Science team creates and sends out every Friday. We pour over all of the news and identify the most important breakthroughs in the fields of data science, machine learning, and artificial intelligence. The top five, simply five news items. The top five items are hand-picked, the items that we’re confident will be most relevant to your personal and professional growth. 
Jon Krohn: 16:10
Each of the five articles is summarized into a standardized, easy-to-read format, and then packed gently into a single email. This means that you don’t have to go and read the whole article, you can read our summary, and be up to speed on the latest and greatest data innovations in no time at all. That said, if any items do particularly tickle your fancy, then you can click through and read the full article. This is what I do. I skim The Data Science Insider Newsletter every week, those items that are relevant to me, I read the summary in full. And if that signals to me that I should be digging into the full, original piece, for example to pour over figures, equations, code, or experimental methodology, I click through and dig deep. So, if you’d like to get the best signal to noise ratio out there in data science machine learning and AI news, subscribe to The Data Science Insider, which is completely free, no strings attached, at Superdatascience.com/dsi. That’s Superdatascience.com/D-S-I. And now, let’s return to our amazing episode. 
Jon Krohn: 17:17
Wow, so cool. Do you have an idea, Tim, about how collaboration and no-code environments like Einblick actually impact productivity? 
Tim Kraska: 17:30
Yes, we do. In some of them it’s anecdotal, in some we actually try to get quantitative measures. So one thing, we actually ran a large user study comparing Python coding versus like a traditional visual workflow engine, like Alteryx, KNIME, you can pick one of them, and we used one as a placeholder for them. And then Einblick, and what we looked into is just like, we gave a set of users a bunch of different tasks, and asked them to solve them in the different platforms. So like, there was a set of tasks for Python Notebooks, a set of tasks for one of these visual environments, and then a set of tasks for the Einblick platform. And then at the same time for… We only recruited people who had Python experience, or who said about themselves that they’re actually data scientists.
Tim Kraska: 18:28
But we gave them a one hour introduction to either Einblick or one of these visual interfaces. The results were actually interesting. So, what we found is like the visual environments, and also like a traditional workflow engine can increase your productivity over the traditional ones, but only if you have done the tasks already a few times in the platform. And actually, it was less than what we expected, at least for the workflow engine. And looking then into why this was the case for the workflow engine revealed that these traditional workflow engines, you have the problem that you put the operators together into a flow of operations. But you don’t see how the data actually flows through and how it’s modified. 
Tim Kraska: 19:17
So, if one of the operators by mistake for example filters almost everything out, or you have a typo in there and it does the wrong thing, the only… Sometimes we only notice that after you press play for the entire workforce, and in the end nothing comes out. Because these operators hide like all the flow of information. Whereas in a Python notebook, you only write the little snippet of data frame operation or whatever you do, and you immediately see like a sample output on the next line. And so, you really notice in how the data’s manipulated, and what you are doing with it. So, this again motivated something we did in Einblick, where we combined aspects of a workflow engine with aspects of no-code visualization engine. 
Tim Kraska: 20:09
So, instead of just having operators you put together which don’t provide you visual feedback, in Einblick actually every operator is visual by definition. It’s like it’s almost their own plot, like the sample snippets you get from Python in between, all of them appear immediately on the screen. And so, you see on how the data flows through. And then on top of that, we have a whole range of ready to go operations, which make it very easy for you to cover the entire data science life cycle from data wrangling, to model building, to visualizations, to what if analysis. Everything is built in, and so you very quickly can explore the data, but then also do the more complex operations. And so, the end result of the user study was like, we saw productivity increase of up to 50%, and also a capability increase by a factor of 2X. 
Jon Krohn: 21:02
Cool. 
Tim Kraska: 21:03
So, it’s a pretty good result. Anecdotally, we see the collaboration to be one of the key aspects, and that’s like… And many companies just struggle with, in the moment you want to kick off a new data science project, but you actually bring the people together to get to a common vocabulary about the problem, and really understand the data. And in this initial phase when you kick something off, we provide a huge productivity gain, particularly in all these stages to get the first model, and then help to understand that. And then productionization, we are actually less focused on the other tools for that. But this prototyping aspect, we see huge improvements in time it takes to get something production ready. 
Jon Krohn: 21:55
Awesome. Today, I think it’s safe to say that the norm in machine learning is to be using Python. Certainly, there are a lot more people using software than no-code tools. So, when do you think no-code data science will become the mainstream option? 
Tim Kraska: 22:15
That’s an excellent question, and there’s not an easy answer to it. On one hand, I think we see that no-code solutions actually are becoming more popular. So, if you go to any large enterprise right now, normally you already find one or the other solution. So, people are playing with that. At the same time, it’s interesting who is using them right now, and it’s rarely the business people. So, I think the almost more interesting question is just like, when are all the business people ready to use these aspects? And I think this requires much more like an organizational change to really make that happen. And there are attempts with it, I know large companies are working on that and everybody wants to do it. They are not completely there yet. The other answer to that one is, I think pure no-code will never exist, because there’s always this corner case you need to solve, right? 
Jon Krohn: 23:14
Right. 
Tim Kraska: 23:14
And then if the no-code operation doesn’t have the right operator for it, or they haven’t thought about that corner case, yeah, then you’re out of luck. So, actually reclaim the answers code-last, like, start with something which is visual and fast for prototyping. But, don’t exclude the code for the experts when it’s needed. So, our take is like code-last will succeed, and probably in the next two to five years, I expect much more happening in that area. And I know that a lot of companies are working on it, to roll out these solutions to larger audio groups. 
Jon Krohn: 23:54
Nice. That does sound like the right combination to me as well, smart that you allow people to use code last, as you say. Making organizations more data-driven and using tools like this, that’s probably something that a lot of companies would like to be able to do. As you mentioned, business people as opposed to technical people being able to take advantage of all the larger and larger amounts of data that are flowing into organizations, having more and more data-driven companies. Given this assumed desire for companies to be more data-driven, what do you think are the critical next steps for that to actually happen? 
Tim Kraska: 24:41
Yes. At least based on the experience we had so far with working with large enterprises, I see there are three things a company needs to do. The first one is clear, education. So if you just have a business person, and he has never used, or heard anything about a classifier, and what it means to have a target variable, and features, or what feature extraction or an embedding is, it’s very, very hard to get them ready to really build a model, and also understand later on what the model actually does. And so there’s something about data literacy everybody should have. And I think at least in academia, people recognize that. So, you see that most big universities, regardless of the subject you are studying right now, you have to take some data class. 
Tim Kraska: 25:36
So, they want to have everybody exposed to some basic data understanding, and then a little bit about statistics. And I think in large enterprises, they simply need to offer that education to get everybody on the ground level that they have a basic understanding of these tools. The other aspect is that you need to have the right tooling available, and that means like, a platform like Einblick for example, obviously we believe that’s the right choice. But also the right integration, so you need to make it easy for the people for example to connect to the right data sources, nobody should be wrangling around this ODBC or JDBC drivers. Like, there’s a click of a button, they should be able to connect to the right data source. 
Tim Kraska: 26:19
And then just have Einblick ready, so the overhead to get started is as low as possible. And so that’s like tooling and integration is the second aspect. And then, the last one was also equally important, is you need to have the right incentive structure and ease people in. And what I mean by that is like, let’s assume like you have somebody working as a data scientist from the business side, and he’s just used to send his requests to the data scientist, and at some point gets an answer back. What is his incentive to actually do work by himself? Because it’s like he has a personal assistant, right? 
Jon Krohn: 27:02
Yeah, nobody wants to do work, everybody wants to push it downstream. 
Tim Kraska: 27:07
Exactly. And it creates a very unhealthy environment, because the data scientist doesn’t want to deal with all the same requests again, the business users normally has to bait them. But you need to overcome that model, and we actually saw a company struggling with that because they were so used of doing it in a certain way that you need to provide the right incentives to change that. And there are different basic end to do that from incentive structures you put out, to like joint coding sessions so that they learn how to do it themselves, or training on the fly so you say, “Okay, you only… If you want to solve this, don’t send it just to the data scientist, at least you have to watch how he does it so next time you can do it yourself,” right? 
Jon Krohn: 27:52
Right. 
Tim Kraska: 27:52
And the moment you see how something is made, there’s a higher incentive for you to do it yourself the next time- 
Jon Krohn: 28:00
That’s right. 
Tim Kraska: 28:00
… in particular if it’s easier to do it yourself than explaining to somebody else what you want, and then understanding the result again. So, I think those three aspects needs to be addressed, and then these initial tips about becoming a data-driven company can actually be successful. 
Jon Krohn: 28:19
Awesome, yeah. That makes so much sense. The tools have got to be available, and the incentives have got to be right. I hadn’t thought about that incentives one before, that is a clever angle. I often just think about, as a technologist I’m too quick to think about a technological solution, and fail to think of the structural, the social issues, and so yeah, [crosstalk 00:28:44]. 
Tim Kraska: 28:43
Yeah, I also didn’t expect the incentive one to come up, but it was very, very clear from the very first customers we had that if you want to make it broadly… Roll something out like Einblick in an organization, those three things really need to be aligned. 
Jon Krohn: 29:01
Yeah. I think the vast majority of people, whether they’re technical or not, fear the unknown. And when you don’t know how to do something, you worry about getting it wrong, you worry about breaking something. And so, you don’t even take those first steps to experiment, you’d rather just ask somebody else, feel like it’s not your responsibility than to learn something new, which is a funny thing about people. And maybe listeners to this program and data scientists in general are as far from that kind of profile as humans get. But I think even for me, everything I do, there’s this taking the first step is like, “Okay, we’re going to do this.” 
Tim Kraska: 29:43
Yeah, yeah. 
Jon Krohn: 29:46
This episode is brought to you by SuperDataScience, our online membership platform for learning data science at any level. Yes, the platform is called SupeData Science, it’s the namesake of this very podcast. In the platform you’ll discover all of our fifty plus courses, which together provide over 300 hours of content, with new courses being added on average once per month. All of that and more you get as part of your membership at SuperDataScience. So don’t hold off, sign up today at www.www.superdatascience.com. Secure your membership, and take your data science skills to the next level. 
Tim Kraska: 30:25
There’s also something else interesting which we found as well by working with some of the customers together. It’s just like, so we always thought like that the business users would start first in Einblick, and then ask data scientists for help when they get stuck. But the problem is, and this is again a psychology issue, you don’t want to look stupid. So if they start out, and need to ask for help, you know it’s just like, they don’t want to do that, particularly if the manager or somebody can figure it out. And so, you need to again change the way on how people work together from the beginning so that they don’t run and just psychology issues of like, “Oh, asking for help is the wrong thing to do.” Like, asking for help should be always a good thing to do. And yeah, we also observed that, and now we do things a little bit differently when we rolled out Einblick to address that. 
Jon Krohn: 31:22
Cool. You kind of mentioned this already, you’ve talked about… We’ve been dancing around this anyway, is that there’s these different kinds of users, business users, data science users. So, do you have stats on who currently is using Einblick, and then can you explain a use case where them using Einblick allowed them to solve a problem that they otherwise might not have been able to solve? 
Tim Kraska: 31:48
Yes. Interestingly enough, in the beginning we had a lot of data scientists on the platform. This is changing rapidly, so we see more and more people from other areas. Like particularly the domain expertise, the users with the main expertise actually using the platform directly. And this is a wide range from academics actually, to let’s say HR people. I mean, we have several people from manufacturing on it. So, it’s really a broad spectrum, and product management is also another one which is standing out, finance. Use cases, interestingly as well vary a lot. Two things which we often observe right now is that people want to do something more with event data. So, a lot of problems are around events, like I’m just mentioning here one which just is off the top of my head. We were working with a HR team in a large car manufacturing company together. And the goal this person had was like to predict if one of their manufacturing workers is at risk of leaving, right? 
Jon Krohn: 33:03
Mm-hmm (affirmative), mm-hmm (affirmative). 
Tim Kraska: 33:04
And so, as the base data, what they had was like, events of the manufacturing workers… So for example, when did he clock in, when did he clock out, how many mistakes in the manufacturing process did he made, what training did he go through? And so, all of these are like event data, so each of them is like an event timestamp, it’s just like some action the worker did, or something else that was measured around him. And so, the task we have now is take this event data, and you need to wrangle it in a way that you create a feature vector out for each particular worker. And then based on the feature vectors and past data, you now can build a model which gives you a likelihood that somebody’s at risk of leaving, right? 
Jon Krohn: 33:48
Sure. 
Tim Kraska: 33:49
And so, this was a really cool use case because it required to have the HR people work together with the data scientists to come up with this feature vector the HR person alone would never be able to do that, in the beginning at least without any training to do something like that. But with the Einblick platform, we had them work together and saw things that they really understood what the data means. Because all these events they had are all coded in a weird way, and at some point some code changed. And so, you need to consolidate everything first, and you really need to go over the data, look at it, list the domain experts together, and then build this feature vector out. And then very quickly iterate all of these feature vector, until you have a model which actually worked. 
Tim Kraska: 34:35
And so this was an HR use case, but we now see the same pattern coming up again in other areas. Like I mentioned before this product conversion thing, you get the user into a platform, you’re trying to predict which features this user might use. When will he get stuck, when should we reach out, all these life cycle, user life cycle decisions you want to drive for marketing and service. And they’re in the end very, very similar. You have a bunch of events, you need to wrangle them into a feature representation for each user, and then make a prediction out of it. And so this use case, we see more and more. Apparently other platforms are harder to work with this event data, and in Einblick we make it very easy for you. 
Jon Krohn: 35:20
Cool. And one of the things that we’ve alluded to, and you kind of described in the situation that you just described, that I think is worth highlighting, which is a really cool thing that Einblick does, is not only… I think I’m mostly familiar with these kinds of tools being used for descriptive analysis. But when we’re using machine learning here, we’re not using it just to have some insight into the data. We can actually be using it predictably, we can actually be flowing in new data, like you were saying in that attrition model that you were describing, where you’re predicting based on historical data whether somebody’s likely to leave or not. It’s, you don’t have to be… Yeah, you don’t have to be just remembering yourself, the HR manager doesn’t have to remember. You could be having some kind of dashboard that is then using the feature weights that the AutoML in Einblick came up with, in order to make predictions on, I don’t know, a daily basis or weekly basis. 
Tim Kraska: 36:23
Yep, yep. And in this use case again with the HR person, this is a problem which the model provides a lot of value for the HR team. But this is not a problem that you have like a 10-person team working on it. You wouldn’t get like a team together, which costs like hundreds of thousands of dollars a year to solve this case, because there’s not enough value in it, right? 
Jon Krohn: 36:48
Right. 
Tim Kraska: 36:49
But if you get a model quickly built, yeah, this does provide value for the HR team. And I think there are a lot of use cases in companies right now, from startups to large enterprises, where these small solutions, like, oh yeah, one, it’s [inaudible 00:37:07] model building, exploring the data a little bit to understand a certain thing. We are not done because the capacity is not there, and if we enable the right people to do more on their own, actually we can get to them and improve the overall process, right? 
Jon Krohn: 37:22
Totally, yeah. We hear this kind of buzzwordy word used a lot to describe what you’re describing, the democratization of AI, where yeah, where the HR manager can build their own ML model, and doesn’t rely on the data science team. 
Tim Kraska: 37:40
The talks I always gave in like 2015 just about the research prototype was always named, Democratizing Data Science. And I gave them all over the place in the academic context, and yeah. 
Jon Krohn: 37:52
There you go, you’re a contributor to that fad for sure, which is peaking now no doubt. 
Tim Kraska: 37:58
I think Gartner have a different term for it, like citizen data scientist, which I never really liked that much. But it’s also, yeah, a term they definitely use. 
Jon Krohn: 38:10
Yeah, that’s a funny one because the citizen is like the opposite of military or something. I guess democratization is also… Yeah, it’s definitely I think of the two options the better option. So, since the beginning of the episode we’ve been talking about how you’re an associate professor at MIT, and that this research is related to the research that you do at MIT. So, having Einblick as a university spinoff, how does that work at MIT? What are the advantages of doing it that way? 
Tim Kraska: 38:44
I think academia in general is like a great place to explore ideas which are further out there. And so, I’m very certain that if we would just try to raise let’s say capital based on the idea we want to create a Minority Report user interface- 
Jon Krohn: 39:00
Yeah, like, you’ve seen the movie Minority Report, right? I mean, come on, this is a no brainer. 
Tim Kraska: 39:06
I think people would have been very, very skeptical. And so I think what academia is really excellent at is just like really trying things out, which is high-risk and high chance of failure. But at the same time, if they work out it is like high reward. And I think it’s a great breeding ground for cool ideas, and in the end also companies. So, I did my postgrad at the MPLAB, and I think the leading example there is Spark, and now Databricks, which essentially followed the same model. They recognized like, Hadoop has all this space, let’s put everything into main memory, and create an open-source framework around that. And which came out of academia, and it was a super successful company. And Ali Ghodsi is the CEO, and we used to go actually for coffee almost every single week. So, again, another of these example, and now they have even more companies there, yep. 
Jon Krohn: 40:06
I wasn’t super familiar with the Databricks story until this week, so at the time of recording it is the beginning of April. And I had just spent the last few days at a conference in New York called ScaleUp:AI, and Ali was one of the opening speakers. On the first day, the first morning, he did an opening talk, like a fireside chat where he was interviewed. He was supposed to be in-person at the conference, but I think he was having a child born that day or something like that, and so he couldn’t fly out. And yeah, it is an amazing story how Spark came out of academia, and the success is incredible. They mentioned in his interview that Databricks has now gone over a billion dollars in ARR. 
Tim Kraska: 40:56
Yeah, yeah. They did a fantastic job of hosting their platform. It’s just like, it’s really, really great. The funny thing is just like, even in the beginning I wasn’t even the biggest fan of Spark as a framework. But I completely misplaced it, because like, it was so clear later on, they really cared about the usage and the open source impact it can have, rather than just like publishing papers of it. And so they did this basic groundwork, and then eventually they also did amazing research. But it wasn’t clear in the beginning, it’s just like, it’s not like, “Oh, these ideas have been explored in the past.” But then they had this whole open source framework, it was super useful. And then even more interesting papers came out of it as a research thing, so it’s kind of inspiration for me as well. They did a fantastic job, and Ali is like, he’s a really, really good guy. 
Jon Krohn: 41:52
Cool. Well, nice to make that connection. Let’s talk about some other academic research that also involves industry. You recently had a collaboration with Google on research, and in it you tackled the limitations that modern databases face with the static nature of components like index structures, filtering methods and partitioning schemes. So, can you explain what these problems are, and how they can be improved with machine learning? 
Tim Kraska: 42:25
Yes, certainly. The original work was like, this was an idea which came at some point while I was at Google. So, I was doing a sabbatical there, so I took a leave from Brown University back in the day, and spent a year at Google in a team which had nothing to do with systems databases, or normally what I have been doing. So there, I was like with Ed Chi, and he runs a big research team around recommendation engines. And so, everything I was hearing was like all this stuff about newer nets, and it was a complete different world. And so, yeah, while I was there at some point I had this idea of, that we can potentially change entirely the way on how we find data. So, to give you a more tangible use case here, it’s just like, let’s assume you go to the library. 
Tim Kraska: 43:26
If you want to find a book by a title, what you can do is go to the catalog, and then find the right index card which tells you where the book is. So, normally every library has a bunch of just drawers, like the whole catalog, which is alphabetically ordered. Which helps you then with your title in mind, let’s say Harry Potter, you first have to go to the drawer with H, and then you find the right index card that points you then to the location. And this is essentially on how every single database system right now implements the way to look up data. At least if you support something that just [inaudible 00:44:06], there are some other index structures that this is like, this way of traversing a catalog is like one of the key access methods every single database has. 
Tim Kraska: 44:18
However, if you go to the library, you probably wouldn’t go to the catalog first. The first thing you would do is ask the librarian and say like, “I’m looking for the Harry Potter book. Can you tell me where it is?” And he would roughly point you in the right direction. He wouldn’t tell you exactly where it is, but he probably will say something like, “Oh, look into the right corner, and there are all the kids’ and young adult books. And then, I think it’s somewhere on the right/left, in the middle.” Yeah, so this is normally what a librarian would do. And so what I was wondering to myself, I was just like, “Can we not do the same thing for a database system?” And so instead of having this catalog structure which is, you have to build and maintain and it’s very expensive, replace it with a model which is approximate, and then do some localized search in the end which guarantees that you find the same data. And it turns out that if you do it the right way, this can be actually much, much faster. And then it turns out that you can use the same idea also to replace other data structures, so not all [inaudible 00:45:29] trees, which is the typical way of implementing this catalog, but you can potentially also do bloom filters for that, right? 
Jon Krohn: 45:36
Mm-hmm (affirmative). 
Tim Kraska: 45:37
You can use the same idea to improve sorting, and there are a bunch of other things. And the core principle behind them is all the same, it’s always that you learn a model over the data, normally the data distribution. And then you leverage that model to design a better algorithm which is more efficient. So, that’s one category. And then it turns out that you can actually expand that even further, and when you take other things, like the workload into consideration. And this is [inaudible 00:46:11] quote like algorithms with Oracle. So, you take an Oracle model as a prediction, and now you design a better algorithm with this model in mind. And this is for example useful for scheduling, or for pre-optimization, and there are a bunch of other algorithms or components of a database system as well which you can integrate these models very, very deeply within. 
Tim Kraska: 46:35
Overall, the end result eventually will be what we call now an instance-optimized system. So, if you take a traditional system, it’s an architect’s design set for a range of potential use cases and workloads and data. And as a result, it works for a whole range of applications, but it probably doesn’t provide you the best possible performance for each one of them because of its general nature. But now, if you have this mechanism of machine learning and it’s deeply embedded in the system, we hopefully, or this is what we are working on, can build something which we call an instance-optimized system. So, the system is self-adjusting all its components based on the workload and the data it observes to provide unprecedented performance. Like, orders of magnitude faster. And we are heavily working on that, we actually have a prototype, and it’s very, very promising. 
Jon Krohn: 47:34
Super cool. So, I guess the main impact of these kinds of techniques, like Learned Bloom Filters, or learned indices, which allow databases to be more efficient by learning particular characteristics of the particular data that are in them, I guess the main impact is that it makes using databases faster for practitioners, right? Are there other benefits as well, or is that the main thing? 
Tim Kraska: 47:54
It’s faster and easier. So, a traditional system normally has a bunch of knobs… So if I designed, let’s say a database or a key value store, pick your favorite system, normally I not only make a bunch of decisions about the workloads I expect to arrive for the system, and I design the system in the way for these workloads I have in mind. But often, I also put a bunch of knobs in there because I know I need to tune certain parameters in the right way. So now, if you go to instance-optimized systems, the hope is that we cannot only configure the entire systems in ways which are not possible before, but also that we take the burden away of tuning all these knobs for the end user or the administrator, right? 
Jon Krohn: 48:41
Mm-hmm (affirmative). 
Tim Kraska: 48:41
And so, as a end result it’s easier to use, because you don’t have to tune the knobs anymore. Plus, you potentially get way better performance, like orders of magnitude faster than what’s currently possible. 
Jon Krohn: 48:54
Wow, wow, cool. All right. That’s a super cool, another strand of research that you’re doing beyond Einblick. So, clearly the work that you do across academia and industry is transformative for the way that we can be working with data, and devising models, and having better data-driven organizations. So, I’d love to hear what you think the trends are. So, where do you think we’re going to be in five years from now with data science tooling, and database technology? 
Tim Kraska: 49:35
That’s a really good question. On one hand, I think like instance optimization and ease of use is becoming more important than ever. And so I think like, if you look at like database providers like Snowflake, they are so successful not only because they’re run in the cloud and they made… But mainly because they made it so much easier, right? So, get an instance up, and get running. And I think ease of use is a key aspect, next to of course performance. So, I’m a strong believer that we will see first instance-optimized systems actually appear on the market, and they will not only be faster but they will also be much easier to use. On the other hand, I’m also a big believer that we see a change in how people are actually working together, and this is like a little bit inspired by what we are doing at Einblick, and as well as the pandemic. So, if I think about how people were working on text in the past, it was mainly people using Word. And then sending Word documents offline around, and then you would do again your changes in the Word document, and then send your changes again around. And, quickly it became a total mess. Lawyers actually still work in that way, right? They still like to- 
Jon Krohn: 50:55
Oh, I know. It blows my mind, and yeah, when… Oh, when you’re working on contracts with people, and then they don’t track changes, you’re like, “Oh, my God, how could you not? [crosstalk 00:51:04],” yeah. 
Tim Kraska: 51:04
Right, it’s just like, it’s awful. And now you have these online tools like Office 365, Google Docs, and they make it arguably so much better, right? 
Jon Krohn: 51:15
Yeah, no question. 
Tim Kraska: 51:15
Like, people see the changes immediately. We see the same trend going on for other areas, like take design. It used to be the case that you use Photoshop, or Illustrator, and you send PDFs around as like the final outcome, right? 
Jon Krohn: 51:31
Right, right, right. 
Tim Kraska: 51:31
And then you could [crosstalk 00:51:32] one of the PDFs. Now, with Figma people again have this realtime online collaboration. Not necessarily in person, but remote as well, and to come to a conclusion. And I think every other aspect will end up with that, so I think data science will be impacted as well in the same way. And I don’t think that simultaneous or real-time coding in a Python notebook is the right solution there. Because if somebody is at the bottom of it and somebody else changes something at the top, everything breaks down. It’s just like, it’s not the right paradigm. So, given this trend of real-time collaboration, I think we need the right interfaces for it, and I think we will see it happen for data science. And Einblick is one of those solutions, but maybe there are also others in the future, and it’s always good to have competition. But I really do see that as a trend. 
Jon Krohn: 52:24
Nice, yeah. Super cool. I was going to say, what you’re describing there does sound familiar. It sounds like I know a tool that could probably [crosstalk 00:52:30]. 
Tim Kraska: 52:29
Oh yeah, that’s like, [crosstalk 00:52:31] if you won’t believe in it. 
Jon Krohn: 52:36
Amazing, all right. Now that we’ve talked about Einblick so much, there’s probably people itching to try it. How can they try it? 
Tim Kraska: 52:44
They should just go to our website, Einblick.ai, and sign up, and you can immediately try it out. There’s a free version which will never expire, and it has the full functionality. But just to make sure that everybody gets the word Einblick right, it’s E-I-N-B-L-I-C-K.ai. 
Jon Krohn: 53:06
Nice. Easy for the German speakers out there, probably harder for everyone else. But, a great name, I love the etymology of it. And then in addition to that, for people who want to dig really deep into understanding no-code, you have a professional MIT course coming up, right? 
Tim Kraska: 53:25
That’s correct. This summer we have a professional no-code data science course coming up. Registration is still open, it’s like… What is the word for the course that’s like, real time virtual? So it’s real time lectures, but it’s virtual. 
Jon Krohn: 53:44
Right, cool. That’ll be cool to check out. And then, do you have a book recommendation for us? It doesn’t necessarily need to be about no-code, it could be about anything. But, I ask that at the end of every program. 
Tim Kraska: 54:00
Yeah, I thought about it a little bit. The last book I really, really enjoyed reading was Hail Mary, actually from Andy Weir. I’m actually not a big fiction fan, but he has this art of mixing science with a story, and I think this book is just excellent. 
Jon Krohn: 54:23
Oh, nice. That sounds cool. I do love a great fiction recommendation, so thank you. And then clearly Tim, you’re a brilliant professor, you’re a talented entrepreneur, you’ve got exciting things happening now and will continue to in the years to come. So, how should people best follow you or Einblick online? 
Tim Kraska: 54:47
We have a Twitter account, and I also have my personal one that you can just sign up there. I’m also always welcome LinkedIn connections, so you can find me on LinkedIn as well. So, essentially the usual channels. We are also about to launch actually a new no-code data science course on YouTube, so if you come to Einblick’s YouTube page and just follow that you will also be informed about whatever is coming on that front. 
Jon Krohn: 55:16
Nice, sounds great. All right Tim, thank you so much for this fun and informative episode, it was great having you on. Maybe we can check in again in a few years and see how all the research and entrepreneurial progress is coming along. 
Tim Kraska: 55:31
That would be great. Thanks again, Jon, for having me. 
Jon Krohn: 55:40
What a great episode with a brilliant computer scientist and entrepreneur. In today’s episode, Professor Kraska filled us in on Einblick’s progressive approximation engine that allows instantaneous machine learning results that are then refined gradually in the background, how incentives within an organization need to be set up so that anyone who wants access to data or data models knows how to fish themselves, as opposed to the common modern scenario of badgering data scientists to provide fish for them. Tim also talked about how research universities like MIT support high-risk, long-term research that once incubated can be spun out and transform industries, how machine learning applied to databases such as Learned Bloom Filters and learned indexes can enable databases to be faster and more efficient. 
Jon Krohn: 56:23
And, Tim talked about how in the coming years, real-time collaboration environments that are like Google Docs are likely to become more widespread for data science tasks. As always, you can get all the show notes, including the transcript for this episode, the video recording and any materials mentioned on the show, the URLs for Tim’s social media profiles as well as my own social media profiles, at Superdatascience.com/571, that’s Superdatascience.com/571. If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app, or on the SuperDataScience YouTube channel. I also encourage you to let me know your thoughts on this episode directly by adding me on LinkedIn or Twitter, and then tagging me in a post about it.
 
Jon Krohn: 57:02
Your feedback is invaluable for helping us shape future episodes of this show. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you, and thanks of course to Ivana Zibert, Mario Pombo, Serg Masis, Sylvia Ogweng, and Kirill Eremenko on the SuperDataScience team for managing, editing, researching, summarizing and producing another fascinating episode for us today. Keep on rocking it out there folks, and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon. 
Show All

Share on

Related Podcasts