Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach, and Lifestyle Entrepreneur. And each week, we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now let’s make the complex simple.
Kirill Eremenko: This episode is brought to you by our very own data science conference, DataScienceGO 2019. There are plenty of data science conferences out there. DataScienceGO is not your ordinary data science event. This is a conference dedicated to career advancement. We have three days of immersive talks, panels and training sessions designed to teach, inspire, and guide you. There are three separate career tracks involved, so whether you’re a beginner, a practitioner or a manager you can find a career track for you and select the right talks to advance your career.
Kirill Eremenko: We’re expecting 40 speakers, that’s four, zero, 40 speakers to join us for DataScienceGO 2019. And just to give you a taste of what to expect, here are some of the speakers that we had in the previous years: Creator of Makeover Monday Andy Kriebel, AI Thought Leader Ben Taylor, Data Science Influencer Randy Lao, Data Science Mentor Kristen Kehrer, Founder of Visual Cinnamon Nadieh Bremer, Technology Futurist Pablos Holman, and many, many more.
Kirill Eremenko: This year we will have over 800 attendees from beginners to data scientists to managers and leaders. So there will be plenty of networking opportunities with our attendees and speakers, and you don’t want to miss out on that. That’s the best way to grow your data science network and grow your career. And as a bonus there will be a track for executives. So if you’re an executive listening to this, check this out. Last year at DataScienceGO X, which is our special track for executives, we had key business decision makers from Ellie Mae, Levi Strauss, Dell, Red Bull, and more.
Kirill Eremenko: So whether you’re a beginner, practitioner, manager or executive, DataScienceGO is for you. DataScienceGO is happening on the 27th, 28th, 29th of September 2019 in San Diego. Don’t miss out. You can get your tickets at www.datasciencego.com. I would personally love to see you there, network with you and help inspire your career or progress your business into the space of data science. Once again, the website is www.datasciencego.com, and I’ll see you there.
Kirill Eremenko: Welcome back to the SuperDataScience podcast ladies and gentlemen, super excited to have you back here on the show. And today you’re in for a treat. We’ve got such a fun and an energetic episode coming up for you. So I just got off the phone with Manasi Vertak who is the founder and CEO of Verta.AI. A company which has developed a unique tool for data science and it’s around deploying your models and versioning them, developing them, keeping track of them, maintaining them. So it’s a really cool new thing that hasn’t existed before in data science, and you’ll learn all about it in this podcast.
Kirill Eremenko: Plus this podcast is full of really interesting insights. So, we’ll talk about model tracking and versioning and maintenance and what Verta.AI is. And in fact, Manasi will share a special code review where you can join the Beta testing absolutely free if you’re listening to this podcast, which is very cool. Also, you will learn about data science maturity, and what it is, what it means. We talked about the three areas where top-tier data science teams are investing into and in the space of data science. And you definitely want to know those. If you’re part of a data science team where you want to be in a top tier data science team. We also talked about why there will be a machine learning boom, a machine learning revolution in the next three years. Definitely a very cool discussion. And something that is going to impact all of us. So that was very… Some very useful insights there.
Kirill Eremenko: And finally, you’ll also find out about what you can do to prepare for this future for this incredibly different and fast-paced future that is rapidly approaching and is going to disrupt everything we do. What you can do for your career, for your business as a data scientist to prepare for that.
Kirill Eremenko: On that note, there is all much else to say. Amazing podcast coming up. Can’t wait for you to check it out. Without further ado, I bring to you founder and CEO at Verta.AI, Manasi Vertak.
Kirill Eremenko: Welcome to the super data science podcasts, ladies and gentlemen, today I’ve got a super exciting guests on the show with me, Manasi Vertak calling in from Palo Alto. Manasi, how are you going today?
Manasi Vartak: I’m well. How are you? Thank you for having me.
Kirill Eremenko: Very well. And very excited to have you on the show. How is the weather in Palo Alto? It’s getting into summer slowly.
Manasi Vartak: Warm and sunny as always.
Kirill Eremenko: Yes. You mentioned like it gets up to 90 degrees, which is about 30 in Celsius. That’s pretty hot. Like, do… What do you do in those days?
Manasi Vartak: Usually, a lot of the buildings are air-conditioned. I actually like it pretty warm. So I hang out outside a bunch.
Kirill Eremenko: Interesting. Like, I find for me out of warm and cold, like I do like the warm, but in the cold, my… Like I think better like I think faster. Do you have that?
Manasi Vartak: Not really. So I grew up in a pretty warm place, and then I lived in a pretty cold place. So I think either one works honestly. I think I’m happier when it’s warmer outside.
Kirill Eremenko: Okay. All right. Got you. So you’ve traveled quite a bit around the world.
Manasi Vartak: Yes.
Kirill Eremenko: Got you.
Manasi Vartak: Yep.
Kirill Eremenko: Okay. And why, why Palo Alto?
Manasi Vartak: So we are… I moved to California after I graduated from MIT and if you’re starting a company in data science or machine learning California and in particularly, Silicon Valley is kind of the place to be. So ended up moving here and we’re working out of our investor’s office right now, which is in Palo Alto.
Kirill Eremenko: Got you. Does is it… Was it a long time ago that you moved to Palo Alto?
Manasi Vartak: Not really about, I think it’s been about a year, nine months to a year now, yes.
Kirill Eremenko: Okay. Okay, cool. I was at MIT in 2013, not that I like studied there. I sat in on one lecture.
Manasi Vartak: Oh, nice.
Kirill Eremenko: But yes, it was interesting. But, what I remember was the infinity corridor. That was very cool.
Manasi Vartak: Yes. The infinity corridor. Yes.
Kirill Eremenko: It’s like one of those, that I didn’t understand the name at first, but it’s like you have to walk through it for ages. Is that why it’s called Infinity Corridor?
Manasi Vartak: Pretty much. I think there might be. There’s also interesting things that happen there. At a particular day of the year, the light gets aligned in such a way that the corridor becomes super bright. There’s like a whole… There’s a whole story about it on the internet that you guys should read it up. For me, what is really cool about the infinite corridor is as you’re walking through there, you’re learning about everything that the students at the school do, whether it’s extracurricular activities, whether it’s student government research, and the first time I walked through there, the energy and the excitement really drew me, and I was like, this sounds awesome. This is a place that I want to learn more about.
Kirill Eremenko: Nice. But, what’d you mean you learn about these things that they’re learning there?
Manasi Vartak: So as you walk through the infinite corridor, both sides have posters on the walls. And these are posters about events happening that day. There might be some like political event going on that someone wants to learn about or there might be dance classes. Student government is a big thing. There is also research labs that are on the Infinite Corridor. So you learn about that. You see in a very short time span all of the amazing things that happen at MIT, and so that was one of my favorite parts of campus.
Kirill Eremenko: Okay, got you. How was it, in general, staying at MIT, because you know, like we hear about MIT a lot in movies and shows and everything. Like it sounds like this really hardcore yet magical place. So tell us a bit about that, what it’s life at MIT like in studying there?
Manasi Vartak: I loved it. I missed it every day now that I’ve graduated. I think the best part of MIT for my perspective was two things. One is the people that you end up working with or interacting with. There are some of the smartest and nicest people. They’re more than happy to help you out with something to teach you, or you can collaborate on a particular project. So the people I think are the primary, the secondary is the kind of resources that you might be able to get at MIT. I think it’s very hard to get them elsewhere, whether that’s access to the particular kinds of compute resources to run experiments, or you get to work on the most cutting edge problems off the day. And I think that’s very unique and you get a lot of freedom to explore both your interests as well as problems that are important in the broader world.
Kirill Eremenko: Wow. Amazing.
Manasi Vartak: It’s super fun. I miss it.
Kirill Eremenko: And you actually did… You did, did you do just Ph.D. there or you did like your full study there?
Manasi Vartak: I did my Masters and Ph.D. there.
Kirill Eremenko: Got you.
Manasi Vartak: Oh, it’s a two-plus four-year program. Yes.
Kirill Eremenko: Well, how was doing a Ph.D. at MIT, was it challenging?
Manasi Vartak: I think it was… A Ph.D. is a long haul anywhere. I really enjoyed my time there. It was… It’s challenging in terms of you’re always pushed to learn new things, publish papers, but you learn so much about yourself as well as your field that I can’t recommend it enough.
Kirill Eremenko: Got you. And what I find very interesting about your story is that you took your Ph.D. project and turned it into a business. Is that right?
Manasi Vartak: That’s right. Yup.
Kirill Eremenko: That is fascinating. Like, I don’t know many people who actually do that and like it gives you… I think it gives you a massive head start. You know you feel inside out. So tell us how did you pick that project in the first place? Did you know it was going to turn into a business eventually or like what is it driving your decision?
Manasi Vartak: Right? Not really. I didn’t know it would turn into a business at all. So bit of background. My undergrad was in math and computer science. I worked for a bit and then decided to come back to do my Ph.D. on… In the area of databases. So like big data was catching on at that point. And because I had this mixed background of software plus math or applied math, I wanted to do something with data where I was helping people understand their data better, operate on in better and so on.
Manasi Vartak: And, I ended up working on a few different projects like over the course of a Ph.D., I worked on say three, a pretty large chunks of work. My previous… My initial work was on automated data visualization. How do you… Given a question, can you identify the graphs that would help answer that question instantly. And then, can you generate those graphs in an efficient manner and so on? It’s called SEEDB in case anyone’s interested. There’s a paper out there. So while that was interesting-
Kirill Eremenko: See as in, see with your eyes. S-E-E right?
Manasi Vartak: Exactly. Yes. S-E-E-D-B. And it’s a system that will automatically find charts that can answer a particular question you ask.
Kirill Eremenko: Awesome. Sounds like an exciting problem that, so-
Manasi Vartak: Very much so.
Kirill Eremenko: Yes. It is a common issue like which chart do I pick, and you experiment if that… If it can be done automatic. That is really cool. And, we’ll put a link to the paper in the show notes if anybody’s interested.
Manasi Vartak: Sure. That sounds great. Thank you.
Kirill Eremenko: Okay.
Manasi Vartak: Yes, so that was the early work and around that time, I wanted to do more math and data because I was missing that. Also machine learning-
Kirill Eremenko: One of the few people I know who misses Math after graduating.
Manasi Vartak: I do. Oh, it’s super fun. Like, I think it takes a while to get into it.
Kirill Eremenko: Which part of Math?
Manasi Vartak: So, I used to do a bunch of both graph theory as well as Linear Algebra, which the Linear Algebra part played on well for machine learning. Because there’s a lot of matrices and matrix operations that you do in a math, so that helped me there for sure.
Kirill Eremenko: Yes. Well. That’s… So, those are the two parts of math that is like you were best at when you’re studying?
Manasi Vartak: So I double major, so like I was doing a bunch of software too. But those were the ones that I enjoyed, and I remember the most. I don’t quite remember how I did my classes, so can’t say.
Kirill Eremenko: Yes. My favorite one was set theory. And yes, I don’t know. I just loved it. Like I got so into it and all those different levels of continuums that you can have and things like that. That was really fun.
Manasi Vartak: Yes.
Kirill Eremenko: Yes. Okay. All right.
Manasi Vartak: I think Math is great.
Kirill Eremenko: Yes. Okay. So what happened next? So you did SEEDB and-
Manasi Vartak: Right-
Kirill Eremenko: And then you finish that, or you decide to change?
Manasi Vartak: So, SEEDB was great. And then yes, I wanted to do more math. I wanted to do more ML. And so I ended up switching my focus a little bit to building again tools, but for machine learning in particular. And that was about the time when the big papers in deep learning were coming out. The Alex Math papers. We were seeing TensorFlow. Deep learning and machine learning was the hottest thing. It still continues to be, but that was just like the initial stages of that cycle.
Manasi Vartak: I got super interested in like how do people build their models and what is their workflow like can we build some tools around there so that they could do their job more efficiently or more productively? For example. For most listeners who have done data science, modeling is a very iterative process. You’ll build hundreds of models before you identify one that works. However, there’s no good way to track them. And I was at Twitter where I used to work on that feed ranking team where we ended up doing a hyper parameter search. We, I don’t know, train multiple hundreds of models. And it was difficult for me to look back and see, well what I’ll have I tested. My best solution there was to have a folder structure that would try to mimic my experimentation and at scale that’s just inadequate or you end up losing a lot of work that way or just recreating work that was already completed.
Kirill Eremenko: Mm-hmm (affirmative). Yes, that’s very true. Okay. So there wasn’t a tool at the time-
Manasi Vartak: Mm-hmm (affirmative).
Kirill Eremenko: That would facilitate that tracking.
Manasi Vartak: Yes, exactly. Facilitate that tracking. And once you get beyond tracking, you also realize that model reproducibility is a pretty big issue. It goes beyond just tracking because if you think about what a model is. A model is the data that you trained on any preprocessing that you applied to the data, the particular model specification, including architecture, hyper-parameters, any random seeds too. You want to capture all of the sources of the variation so that you can reproduce that model as is. And so we ended up building the system called ModelDB, which is a versioning system for models. And it will help you track all of these pieces that will help you recreate a model at a future time.
Kirill Eremenko: Okay. No, wait a second. The people, like if I’m listening to this like I know the answer because we chatted about this before the podcast. But the people listening to this, my first thing popping into minds is like what about Github. Like Github tracks your code and progress. Like how come that’s not good and interesting.
Manasi Vartak: Yes. Absolutely. I think Git is great for code. There are two things that you don’t get with Git-based versioning. The first one is how do you keep track of your data? A lot of times either you’re reading from a particular file. And then you change that file in some way, or you move it. And then you have no idea what data it was trained on.
Manasi Vartak: If you’re reading from HDFS, you need a store, what directory or if you’re querying a database, then you need to keep track of the query. So that’s one piece which by definition sort of is Github is great for code, but once you start trying to track the data, you need to augment the Git like versioning system with data. The second piece that comes up is if you look at a plain python script, it’s hard to pick out the sources of variation there. So if you’re doing a hyper parameter search, over say 20 combinations of hyper parameters, they’re going to be in the same file that’s checked into Git. You’re not going to know what particular variation of those hyper-parameters led to the best model. So, that’s where you need a way to extract the pieces from your code that lead to variations in your model and track them separately-
Kirill Eremenko: Okay. So, there is [crosstalk 00:18:44] history of the high parameter [inaudible 00:18:50]
Manasi Vartak: Yes, exactly. And the final one is also the environment particularly because a lot of data sciences in Python were you using Pandas version ABC or DEF. And that will end up making a big difference on whether you can reproduce your model.
Kirill Eremenko: Okay. Very interesting. All right, so you saw this problem or this gap in the world of data science and modeling, and you decided to research it further. And so your research, because like research that most people are familiar with and myself included is like you… I don’t know, like doing some tests, finding things out. Your research was… Sounds like was more centered around developing a tool, a product to solve a problem. Is that right?
Manasi Vartak: Yes, so I was actually in the databases group at MIT. And we’re all about building systems that others can use. So column stores, there’s something called Vertica that listeners might be aware of that came out of my lab. There is also a GPU based visualization company that also came out of the same research group. So, we are very hands on. We’re focused on building systems, which is not to say that I didn’t do the experimentation type of research. All of the papers that we wrote were… Even when you build a system, there are a lot of knobs that you can tune. And then the thing that you’re measuring is how much did a query performance change or can I use my resources more effectively if I change the architecture of my system. So those are the kinds of experiments that we ran a whole bunch.
Kirill Eremenko: Okay. Gotcha. But the advantage that you get… Insights, you can start a business with that system. So yes, how did that happened? Like, you finished your Ph.D., and you’re like, all right, now this is going to be my… Like, I’m just going to continue and turn this into a product.
Manasi Vartak: Yes. So what happened was that we released ModelDB, which is a system I was describing earlier. So let me recap.
Kirill Eremenko: Yes.
Manasi Vartak: With a model, a model version consists of a few different things. It consists of your data version; it consists of your code version. Hyper parameter settings or any other settings that cause variability and then your environment. And when I started all of this research, there wasn’t a discipline around how do we track that information in an effective manner. So, we ended up building a system called Model DB, which was released open source, and its whole purpose is to help version and reproduce models.
Manasi Vartak: So this is the open source product that got adopted at a whole bunch of places including banks and fortune 500 companies where data scientists ended up using it to version their models, keep audit trails, also just to track all of their hyper-parameters searches. And that adoption gave a pretty convincing signal to us that there was a need for a product that would solve this exact problem. And so, I ended up starting Verta AI where we build tools such as ModelDB as well as others that, I’m sure we’ll touch upon in this podcast that help optimize a process of modeling itself.
Kirill Eremenko: Okay. And how’s that been going for you? So you’ve been running this business for how long now?
Manasi Vartak: So, I’ve been running the business for about six months now. We’re pretty early in. And we actually have a Beta that I’m happy for. Any of the listeners should find out too.
Kirill Eremenko: Yes. You mentioned that, please share that like guys, girls listening, Manasi has a very cool unique offer for listeners of this podcast. Manasi if you don’t mind sharing will be awesome.
Manasi Vartak: Yes, absolutely. So like I mentioned before, we build a tool for versioning models, and we will love to open it up to any listeners of the podcast. We’re currently in a closed Beta, but if you’re interested, if you go to www.Verta.AI, there’s a little Bot at the bottom where you can get a code. And if you mentioned that you listen to the SDS podcast, we’ll make sure that you’re on the top of the list for that code. We would love to have people try it out and provide their feedback.
Kirill Eremenko: Nice. Very cool. So the website guys is a V-E-R-T-A.A-I, verta.ai. And this is absolutely free. Right, Manasi?
Manasi Vartak: Yep. That is right.
Kirill Eremenko: Awesome. Very cool. So you’ve got the Beta now? When do you think you’re going to… Like do you have a deadline when you want to launch this thing?
Manasi Vartak: So it’s already launched in that we’re… And customer sides right now. But it’s Beta because we’re constantly iterating. I think the Silicon Valley mantra I guess is like every day is day zero. So we want to get it right before we make it generally available.
Kirill Eremenko: Yes. Okay. Understood. So, how was this transition from doing Ph.D. work to starting a business? Like what were some of the challenges along the way there?
Manasi Vartak: I think they’re very different. And I’ve been learning about it as I go as well. When you’re doing research, I think you consider a problem. Or even when you’re a data scientist at a company, you’re looking at a problem from a single lens, which is yours. You are a technical person who’s trying to solve a hard technical problem. However, once you switch into the business mode per se, you need to think about it holistically, which is what is the business problem that the user is going to solve? What is their horizon for solving this problem?
Manasi Vartak: Do they have the resources, whether that’s monetary resources or human resources to use your solution? And most importantly, if everything goes well, then what is the best outcome that you can promise this user? So, there’s a lot to think about a product in terms of how does that fit into a users or customers’ workflow, and how can you build it as well as provided to them in a way that it’s easy to adopt and make their own.
Kirill Eremenko: Okay. And what about building a team around this? I’m assuming you are not doing this by yourself?
Manasi Vartak: Yup. We are a team of five right now. And for those who live in California know that recruiting can be quite hard. We are pretty fortunate in that. This is an area where a lot of folks are excited to learn about ML. A lot of engineers are excited to work in this space, and our previous work on ModelDB and other research projects in that area give us sort of the credibility were folks who trust us to work with us. Although we are pretty early stage.
Kirill Eremenko: Yes, makes sense. Like you have not just a proof of concept. You have something that’s four years of research has gone into or years of research [inaudible 00:26:20] that’s something that companies like fortune 500 companies as you mentioned are already using and what other evidence is needed for somebody to see that this is a this is going to be big because it is awesome.
Manasi Vartak: We hope so.
Kirill Eremenko: Yes. That’s very cool. Okay. And so why would you say that companies jumped on this? Usually, it’s quite hard to get these kinds of products into businesses, especially as at early stages. Like where do you think these organizations see the most value for them?
Manasi Vartak: Right. So I think we’re going through this interesting shift in data science right now where we have been focusing on data science and what I think of as a static sense where we’re building these models maybe on our laptops. We might be writing papers about them or creating a PowerPoint that shows trends that we’re seeing. But if you look at the best uses of ML or data science, whether it’s Twitter or Google or Uber, the most return on investment for data science comes from actually integrating data science into products.
Manasi Vartak: Whether it’s better recommendations, better notifications or automated loan approvals or testing how well you’re driving. All of that comes to fruition only when you can take a model and then deploy it into production and service it in a sustainable way. So, as you go into production ML, you need a host of tools that will enable that jump from research to production. And we’re at that juncture now, and a lot of companies are realizing that “Hey, I haven’t thought about this problem. I don’t know how to go through the shift effectively and efficiently.” And that’s where we come in as Verta, where we have a platform that’s tailored for that purpose, and we can get them up and running in just a matter of days.
Kirill Eremenko: Mm-hmm (affirmative). Okay. Got you. So basically, that they can see the value that they invest into this efficiency, effectiveness and then they’ll get the return on investment. So it’s a no brainer.
Manasi Vartak: Yes. Although, the other thing that we’ve found is it also depends on the data science maturity of an organization. If an organization is ready to deploy a model or has already deployed one or two models, they have felt the pain of all the challenges involved in production ML. And so we find that those companies are the ones that immediately get our value proposition and want to try out our products.
Kirill Eremenko: Yes. They can relate better to it. Right? They’ve already had that problem. Okay. Very cool. What… Apart from how many models the companies deployed, what else goes into the definition, your definition of data science maturity?
Manasi Vartak: Oh, yes. I think this might be; this is my take on it clearly. I think a mature data science or like what I’ve seen a lot of top data science teams do is getting good processes around data science. And this is partly the reason why I ended up starting Verta. The… When you think about what differentiates the Google’s and Uber’s from the rest of the world, is not just data. It’s also that they have good infrastructure and good processes around building and deploying these models. So, for example, say a insurance company has tons and tons of data, and they’re doing data science already. So they completely have the capacity to be as good as Uber or Google. And pushing out ML. I think the only thing that’s missing from that equation right now is dependable infrastructure and ways to quickly integrate ML into every product that they have.
Kirill Eremenko: Yes, true. Google, most like a lot of Google’s work, is publicly available. Like you can just go and download those research papers and copy those models. But yes, you’re right. There’s this other barrier.
Manasi Vartak: Right. Yes. And I think that’s the shift as well that comes with going from offline to online is data science is becoming more software oriented too. So like the most effective use of data science that I’ve seen integrate some model that is super complex or super interesting into a real product. And like if there was one thing I would say to the listeners is like, I think data scientists are going to have to pick up software, sort of best practices quickly. So that they’re working half the best and largest impact.
Kirill Eremenko: Okay. Very cool. And so, you… One of the topics you suggested for this chat was the top three areas where top tier data science teams are investing into. And I’m assuming we’ve already covered a couple of them. Maybe you could summarize that and add if we’ve missed anything.
Manasi Vartak: Absolutely. So, I think that three things that top data science teams are doing, I would say are the following; the first one and the one that I believe in the most kind of is having better tooling that goes with modeling. So, like how do you version your models, how do you deploy them, how do you bring the best software engineering practices into the modeling process. So that’s the first. The second is getting data into the right format and integrating the data sources. Because, at the end of the day, a model is nothing if there’s no data backing it. And this is a pretty hard problem; we have been trying to solve that for awhile.
Manasi Vartak: So, investing there is going to pay dividends. And then the third one is getting product buy-in. So whatever your model is that you’re building, there is a product or business process where it’s going to be integrated. So getting buy-in from that particular business stakeholder, that product manager, and understanding what success means in their case. Does it mean like they want to increase the daily active users somewhere, they want to get more revenue and really tailoring your metrics to that. I think with tools, data, and product by in we can go a long way.
Kirill Eremenko: Interesting. So, we’ll get back to the data part just now and I understand how you can, or a team can invest into tools and buy a tool, Invest into getting data in the right format. You spend some time, energy, effort to hire some people to get your data in the right format, or maybe buy that data or do something with it. But what do you mean when you say that teams invest into getting product buy-in. What does investing into product buy-in mean?
Manasi Vartak: I think that investment is more of a human capital investment. Let me give you an example. So suppose you are a bank and you want to produce a model that would recommend financial products to your financial advisor. So like I’m a financial adviser, a client comes in, and I want to recommend to them the best products that we have. In that case, as a data science team, my consumer of the model is this financial advisor.
Manasi Vartak: So you need their buy-in, and you need to understand what it’s going to take for them to completely trust and use your product. That might mean that your model needs to come with some sort of explainability. It might mean that they need to see some data on back testing that demonstrates that your model works as expected. And then you might also need to work with the actual product teams to show the results of your model in a meaningful fashion-
Kirill Eremenko: Mm-hmm (affirmative). Okay.
Manasi Vartak: So, yes.
Kirill Eremenko: So, it’s not like an external investment of funds. It’s more of a investment of time and care to make sure that it’s just like dumping this model onto the department. That you actually spent the rights, a little very amount of time figuring out what they actually they want. Asking the right questions, presenting in the right way. And, like the whole… More like soft skills associated with data science.
Manasi Vartak: A hundred thousand percent, I would say. Because your model in this particular case is going to be used by another person on your team or another team and you want to make sure that they actually understand what your model is doing, and they’re going to trust whatever you’re producing.
Kirill Eremenko: Okay. Understood. And let’s get back to the number two then. Getting the data in the right format. So how do talk to your data science teams invest into that?
Manasi Vartak: So I can give a few examples. I think everyone’s, data set up is very different. So for example, Twitter, it’s well documented that. The data processing is performed via something called scalding. And the data engineers, they work very hard on writing scalding jobs or pipelines that are reproducible so that whatever, if a data science team needs some sort of data than they can write the jobs that will create the data in a timely fashion, it will be available to the data scientists when they want to query it offline and online. There’s also ways to save, visualize the data quickly, and make sure you can run some quality checks on it and so on. So, that whole infrastructure that helps with getting the data into the right input format so that the model can consume it.
Kirill Eremenko: Okay. How does a more conventional organization, not like a tech giants but more conventional organization with a relational database, but with… The usual problems of missing data, corrupt data, incorrectly data, and various dispersed data sources. How do they go about this problem of getting there and getting in check?
Manasi Vartak: It’s very bespoke. Right now, there are a few like data integration providers out there, and maybe those sorts of tools are being used. The other one in a lot of cases is that there is a data engineer who’s writing giant SQL queries that are going to create dumps of this data. And then the data scientist is going to spend a whole lot of time cleaning them up and then producing versions of that data that are; clean blond, clean two, clean with features, all of that fun stuff. The key part there is we still need to track how that data is getting transformed throughout these stages because we’re going to have to reproduce that the next time that we build a similar model or that model has to get deployed into a real product. We have to apply that transformation to live data. So if we don’t know how we did it the first time around, we’re not going to be able to reproduce it at all.
Kirill Eremenko: Okay. That’s where your tool comes in, right? When [crosstalk 00:37:42]
Manasi Vartak: Yup.
Kirill Eremenko: Okay.
Manasi Vartak: And to be clear, I think that’s a very bespoke process in most organizations and the one sort of take away I have there is if you can specify your data transformations in a declarative manner, which is I’m taking this particular data set as input, this is what I’m applying to it, this is my output. Once you have it in that declarative fashion, then it’s way easier to do the same thing over and over again versus if it’s something that’s ad hoc and not documented. And that’s what we’re hoping to make simpler with our product.
Kirill Eremenko: Yes. Like, completely agree. So when you’re… I wonder when you’re experimenting with doing some data exploration, like you do things ad hoc, but then once you’re actually developing a model, you, I find, and like I worked at Deloitte for two years, and they drill it into your head that it has to be documented, like so many in the SOPs, like standard operating procedures.
Kirill Eremenko: There’s so many checks and so many quality assurance steps that you have to be taken. For instance, one thing that really helps in this is something that our listeners have seen in the Data Science A-Z course is using SSIS for these types of things. It’s painful. It takes time to set up these, like upload scripts, but once they’re set up, you just click one button and your data gets re-uploaded, and you have all those data cleansing checks in place as well. And if you need to change anything, it’s all documented. So-
Manasi Vartak: So that’s fantastic.
Kirill Eremenko: Yes, that’s a really… There are some tools out there that are really cool for using that. It’s just a matter of having that discipline and doing it because it might seem faster to just do it ad hoc now and all like let’s get the result. But guaranteed, if you’re going to be using this model for like at least two or three months, you’re going to forget what you did at-
Manasi Vartak: Yes. I agree. I completely agree.
Kirill Eremenko: Yes.
Manasi Vartak: And I would say that if you’re doing that in a consistent manner, you’re way ahead of the game. Yes. I think any data scientist who’s doing that regularly will be highly valued.
Kirill Eremenko: Very cool. Yes. True. Totally true. So the other thing I wanted to ask you is with… Sorry, Verta. With Verta.AI. So your software, your solution. Does it allow for more model maintenance? So that’s a massive problem that a lot of data scientists don’t kind of incur until a much later stage. So you’re going to be like good at data science, you’re good at developing one of the some, but then one point in time, once you deploy model successfully and it stays, you’re going to hit that problem of, all right, what I do with it, six months, 12 months later, 18 months.
Kirill Eremenko: I was in an organization once where they deployed a model; external consultants came in, deploy the model, not this is not Deloitte. So organization came in, deploy a model. Very cool model, it was doing I don’t know 70% accuracy or something like that. For customer churn or some other customer related segmentation. And then 18 months later, so the consultant left. And then 18 months later, the same model is working. It’s in the production. It’s running every night. It is producing results. But nobody is looking at it. Like, the organization didn’t have a data science team. So nobody’s looking at it. And once we actually went in and looked at it, like I had to look at them over like with one of my mentors. Like, its accuracy was a 48%, so it was-
Manasi Vartak: Oh, dear.
Kirill Eremenko: Better to flip a coin than to use that model. So model deterioration is a big thing. How does Verta fit in there?
Manasi Vartak: Absolutely. We think that’s super, super important. So, once a model is deployed in a production, it’s a sort of a real living thing. It’s going to take in new inputs, and it’s going to produce new outputs. And if your a data that you’re feeding to the model no longer looks like the data that it was trained on, it’s going to start producing pretty bad results. And that’s called concept rough data draft models decay a lot of things. So, in our system, we actually provide model monitoring out of the box. So if you version your model, you’re saying Verta, and then you deploy it using Verta, we start tracking the data that your system is getting. If you log the training data, when you version the model, then we know what the training distributions are that you were expecting. Yes.
Kirill Eremenko: Interesting. So when you say you track the data, it is not just like the columns and rows, the names of the columns for instance. Like the format of the data might be the same, but you’re actually tracking the distributions of the values, which is what like, so for instance, if there’s a shift in the audience, like their age changes or their preferences change or there, I don’t know, like time of the year changes and things happen to the customers or whatever, you’ll see that in the distributions.
Manasi Vartak: Exactly. Because we know what the training distribution is like and at test time, we can compare whether the task distribution is similar to the train. And if it’s not, then we can set an alarm saying, “Hey, there seems to be something off. So one needs to take a look at it.” So in your previous example, the system would have alerted way before that accuracy fell.
Kirill Eremenko: Yes.
Manasi Vartak: Because these models, churn in data science is very common. Someone might move to another company, but that model needs to live on beyond their ten year at that company. That’s where the maintenance and monitoring really becomes key where we’re keeping a check on. All right, this is working as expected. It’s working as expected. Suddenly something’s weird, or you’re going to alert the right person.
Kirill Eremenko: Amazing. I love that. This is so cool.
Manasi Vartak: Oh, that is okay.
Kirill Eremenko: Oh my God. So we have a guest on the show sometimes. He’s been on the show twice. Mike Segala, he’s actually from Boston as well.
Manasi Vartak: Nice.
Kirill Eremenko: MIT in Boston. So, and he says that from this podcast, he… Like a lot of very like large scale organizations have contacted him to do consulting work. I’m just saying this to all those CEOs and directors of data science from those organizations listening right now. You guys, you ladies, and gentleman should really consider Verta.AI. Like that is amazing. Like if you can get alerted when your model is deteriorating, how much cost is that going to save you from like doing those checks yourself?
Manasi Vartak: Exactly. Right?
Kirill Eremenko: What a genius idea? Like how… Like I guess for you it was natural. Like being in this space, it was like an essential thing. Like why hasn’t anybody come out? And the question is like why hasn’t anybody before you come up with this, have you ever thought of that?
Manasi Vartak: So data drift is pretty reasonably studied in the literature. So, like from a research perspective, you’re just computing distances between distributions. So, the statisticians are going to say, “Well, it’s obvious.” Just take a distributional test. I think what’s missing, again, not to go back to the same point is the infrastructure.
Manasi Vartak: Someone needs to be tracking your training data. Someone needs to be computing these distributions, and then there needs to be another piece that’s going to be watching your model. And if a data science team where to write custom code to do all of that, it would be a super hard and not re… You cannot do that over and over again. So that’s why I think it’s more about is it easy for someone to implement this? That’s I think the biggest question. And the other piece is also that not as many people are putting models into production yet, but in the next couple of years, we’re going to see an explosion in the need of, in the need for such kind of tools.
Kirill Eremenko: Very interesting view. And I agree with you. I think that this is going to explode. But tell us like, why now? What’s changed?
Manasi Vartak: Absolutely. I think we are; there are two things that have changed. The first one is about until about 2013, 2015, I would say like was the age of big data. Everyone was collecting data; they were building data lakes. They are getting all of their data strategy.
Kirill Eremenko: It was so, It’s so funny when I heard that term for the first time, I was like lake. What lake?
Manasi Vartak: Yes. And there was a joke in the data science community that it’s actually a data swamp and not a lake. Because no one is spending good.
Kirill Eremenko: Yes, that’s right. Like when you just throw everything in there and like just rots yes.
Manasi Vartak: Exactly. And then you have no idea, but people will spend a lot of time and effort in getting their data together. And then the last few years have been spent on a lot of pilots of ML Technologies. Can I build a chat-bot, can I use ML for churn prediction? The example that you were mentioning. And what we’re seeing now is the companies that have successfully integrated ML into their products; they’re seeing advantages like multiple folds. For example. If you think about, again, to pick on the Twitter case, if you think about all the places where models are being used, whether it’s a news feed, as the ads, it might be who to recommend for following or what, there’s something called moments.
Manasi Vartak: So there is whole types of products that are enabled due to ML, and a lot of companies are realizing that that’s also the reason why a lot of startups are disrupting traditional industries. To give you an example, in the lending industry, there are a host of small startups that are using alternative data sources and machine learning to make better decisions about lending that have traditionally been made. So right now we’re seeing the amazing potential of ML, and the reason why we’re going to see an explosion in production ML and monitoring is that the larger companies realize, in order to stay competitive I need to be inserting ML into this, that, and the other. So let me figure out how to do that. And that’s why we’re going to see it. Anything that you touch in the future will be intelligent, and it will be powered by these production ML processes.
Kirill Eremenko: So in a nutshell, the competitive pressure has caught on, it took some time to build the momentum, but now you either ML or die.
Manasi Vartak: Yes, I like that.
Kirill Eremenko: Pretty much it, right it’s kind of like the whole story of electricity that Andrew quotes that AI is the new electricity a hundred years ago, how many businesses have electricity? Now can you name even one business anywhere in the world doing anything that doesn’t use electricity? No. Maybe some shamans in the middle of Brazil, but that’s not even a business. Yeah. So with, with that exciting future dawning upon us, a very interesting philosophical, somewhat question I have for you is what should our listeners look into to be best prepared? Because with this kind of change, radical drastic change happening and all these tools flying into your face and all these different concepts, machine learning algorithms, there’s so much happening in the data science space and machine learning space, people get lost. I get lost sometimes. I don’t know what to create a new course about.
Manasi Vartak: Yeah. 100%.
Kirill Eremenko: So, what are your thoughts? Is there any guidance or advice you can provide to those listening on how they can, what can they do that’s going to be best for their careers in the next three, five years?
Manasi Vartak: That’s tough.
Kirill Eremenko: Tough one. I know.
Manasi Vartak: Tough one. I’ll throw out my ideas. I think that there are two things one can do as a data scientist, and these are just very practical tactical things. One is either to develop domain expertise on in a certain area or a certain kind of data, whether it’s in banking, these are the kinds of datasets and here are the tools and techniques that are most helpful in the banking industry and sort of specialize there.
Manasi Vartak: The other one is what I mentioned before, which is incorporating more software engineering into your data science practice because I think that modeling while important is going to be limited by your ability to integrate those models into products. So if you come into an Oregon, you say, “All right, I don’t just do data science, but I understand what it’s going to take for you to really deploy my work widely. So here’s what I can do to help you become successful.” I think that’s very powerful
Kirill Eremenko: And it doesn’t mean that you will have to be the one deploying those or doing the software engineering. It just means that you will have that perspective that you can develop your models in a way that they’re deployable. Or you can talk to the software engineering teams, and help them, guide them, coach them. It’s like a whole new skill set that you bring to the table. Doesn’t mean you have to be the one coding those models into products and just means you have that capacity to guide the organization in the right direction.
Manasi Vartak: 100%. And that’s where I think tools such as Verta or others can help is data scientist still stay data scientists, but they have an understanding of what it takes to deploy it. So while infrastructure like Verta can do that, they still need an understanding of once I built the model, here are all the things that have to happen before it hits the customer. So let me understand that. Let me speak that language.
Kirill Eremenko: True. And with your permission, I’ll add one more of what you just said before actually. What’s the top, tier data science teams are investing into, and I really liked that you said getting a product by and essentially let’s sum it up as a soft skills. I think that’s going to be super important. What are your thoughts or soft skills still be important in the future?
Manasi Vartak: Very much so. You can write code in isolation, but if you wanted to have impact, then those soft skills are going to get you there.
Kirill Eremenko: Fantastic. Totally. All right. So, domain expertise in a certain area in corporate software engineers skills and look at your soft skills. On that note Manasi so happy, great chat. It was amazing having you on the show.
Manasi Vartak: Likewise. I had a lot of fun. Thank you.
Kirill Eremenko: It’s been a good one. Before you go though, where’s some of the best places for our listeners to catch you to learn more about Verta and your career, maybe get in touch with your team and things like that?
Manasi Vartak: Yeah. So a few different ways. On Twitter I am @datacereal, d-a-t-a-c-e-r-e-a-l
Kirill Eremenko: Interesting, I’ve got to know the story behind that.
Manasi Vartak: I like cereal, and I like data.
Kirill Eremenko: Nice. And it was available. Why not?
Manasi Vartak: Exactly. A lot of decisions in life, [inaudible 00:53:36]. So that’s me. Also, you can look me up on LinkedIn I’m Manasi Vartak on LinkedIn. Our website is www.verta.ai. We post Medium articles there. There’s also a careers page, and we are in Beta, and if you would like to try out our product just on that little chat Bot at the bottom mention that you listened to this podcast and we will make sure to get you a code right away.
Kirill Eremenko: Fantastic. Thank you so much Manasi. Super super exciting chat, and I hope to catch up sometime soon. Good luck with Verta. She’s going to be a blast.
Manasi Vartak: Absolutely. Thanks so much for having me. This was great.
Kirill Eremenko: Thank you, ladies and gentlemen, boys and girls for tuning into the SuperDataScience podcast today and I hope you enjoyed our chat with Manasi and of course all the amazing insights she shared about the world of models in data science and where we’re going in this space. I’ve got some exciting additional news for you. We chatted with Manasi off to the podcast, and it’s quite likely that she’ll be joining us as a panelist for DataScienceGO 2019. So if you’re coming to the DataScienceGO, you’ll get to meet her there. If you’re not coming, if you haven’t got your tickets yet, then you can get them at www.datasciencego.com.
Kirill Eremenko: Don’t miss out on meeting Manasi and many other exciting speakers, panelists and attendees. We’re going to have about 600 to 800 data scientists attending this year. And as usual, you can get all the show notes for this episode at www.SuperDataScience.com/267 that’s SuperDataScience.com/267. There you can get the transcript for this episode. Any materials that were mentioned plus the URLs to Manasi Linkedin, Twitter to Verta.AI. Don’t forget you can enter, talk to the chatbot and tell the chatbot, what a crazy world we live in, right? You can talk to a chatbot and tell the chatbot that you listen to the SuperDataScience podcast, and they will prioritize you in the beta testing participants, so you can get early access to Verta.AI and test it out all for yourself and see how that can impact your career.
Kirill Eremenko: On that note, I hope you had fun. If you know anybody in the space of data science who’s interested in data modeling or is struggling with some of the problems that we identified on this podcast and that could learn something from this episode, make sure to send them this episode. Just forward it to them. Spread the love. It’s absolutely free on all social on all podcasting mediums. You can find it on SoundCloud, you can find on iTunes and Stitcher, you can find it on SuperDataScience wherever you like. And I’m sure there’s lots of people out there that can benefit from these insights. And once again, on that note, thanks so much for being here, can’t wait to see you back here next time. Until then, happy analyzing.