Kirill Eremenko: 00:00:00
This is episode number 417 with CEO of Elephant Ventures, Art Shectman.
Kirill Eremenko: 00:00:12
Welcome to the SuperDataScience Podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now let’s make the complex simple.
Kirill Eremenko: 00:00:44
Welcome back to the SuperDataScience Podcast everybody, super excited to have you back here on the show. Today we spoke, or I spoke with Art Shectman, who is the CEO of Elephant Ventures, a company that helps other companies with their data engineering and data product development. This was a very cool conversation. So in this podcast, you’ll hear about what is data engineering and what is data product development, and how an experienced company with over, I think, about 17 years of experience of helping other companies in the space, how they go about it, what kind of pipeline they’ve prepared and how they put all together. One thing I want to warn you right away about this podcast is, if you choose to listen to it, then you need to commit, it took me about 30 minutes to build this picture in my head. You’ll actually hear me in the audio say like, wow, how I finally understand how interesting this all is. It took me about 30 minutes to build this all into one picture, and then the conversation took a whole new level, I saw how it all comes together.
Kirill Eremenko: 00:02:06
So if you do listen, make sure to commit and go pass, at least past that 30 minute mark. What exactly will you hear about? You will hear about business value, data quality engineering, data pipelines, clarity of purpose, and capturing value through data, the cone of reality dispersion, the definitions of data engineering and data product development, what kind of skills, technical skills are required, and how Art’s company goes about, philosophically looking at all this, and also how they help companies implement these things. And who is this podcast most valuable for? This podcast is most valuable if you, not only want to do data science, but you also want to understand the infrastructure that is required in order for companies to do data science rapidly and do rapid innovations with data.
Kirill Eremenko: 00:03:03
So if you want to just add that to your data science toolkit and understanding, or whether you are an executive at an enterprise or you want to build a startup and you want to have those things a bit more clear in your mind, this podcast is for you. Very interesting conversation coming up, can’t wait for you to check it out. So without further ado, I bring to you Art Shectman who is the CEO of Elephant Ventures.
Kirill Eremenko: 00:03:37
Welcome back to the SuperDataScience Podcast everybody, super excited to have you back on the show. And today we’ve got a special guest calling in from Long Island, New York, Art Shectman. Art, how are you doing today?
Art Shectman: 00:03:49
Great. Thanks for having me on, this is going to be fun.
Kirill Eremenko: 00:03:53
Excited to have you on and you’ll be, I don’t know, maybe surprised to know you’re the first guest who is doing the podcast standing with me. We’re both standing up right now, this is awesome.
Art Shectman: 00:04:05
That is awesome.
Kirill Eremenko: 00:04:07
Yeah, we talked a bit about it before the show. Tell our listeners, I’ve been trying to drive this message across. I even recorded a short episode for the podcast, with a chiropractor on why standing up is good for you. When did you start standing and how has it changed your life?
Art Shectman: 00:04:23
Oh man. So this is an interesting story. So my great aunt, Viera, she’s going to be mad at me, she’s 90 years old, and she’s a professor of piano pedagogy. Now she’s retired, but lifelong piano teacher and she got into ergonomics and repetitive strain injuries to help musicians who had injuries over time, rehabilitate themselves and keep playing their instruments. So she learned a bunch about ergonomics. She was always hounding me like, “You have to have the proper chair, has to have a little back and a seat pan tilt adjustment, and height adjustment, and that’s it, no arms.” And she would beat me if I had arms on my chairs. She was like, it has to be this high, and the monitor has to be this way and whatever.
Art Shectman: 00:05:17
At the company, we had to buy those kinds of chairs, I was having post-traumatic tier stress. Anyway, so we were really sensitive to ergonomics and making sure people had good setups and whatever. And then as you get on, sit in the computer all the time, your posture suffers and you lose a lot of flexibility in your hips and your shoulders start to round and whatever, it’s bad news for posture. One of our folks, her husband actually had started standing up and he was just raving about it, and he’s like, “I feel great. My posture has improved. I can’t sit down now.” That’s crazy, and I’d been reading about all the health benefits, so about, I don’t know, now two and a half years ago, I started standing up and it was amazing. To everyone out there, try it, definitely buy a gel mat. That’s one thing that will make all the difference in the world, but it’s great.
Art Shectman: 00:06:13
I mean, it was fantastic. I like snowboarding and it just did fantastic differential diagnosis of pre standup, terrible sucking wind, my legs were killing me snowboarding and then post standing up like a year later, I could do whatever I want to do. My posture has improved and I feel more flexible. It’s great for core strength, and it’s just awesome, and it’s a no brainer. It’s really, after the first two, three months, you don’t even notice you’re doing it. And then you get uncomfortable sitting.
Kirill Eremenko: 00:06:41
The first three months are the hardest. I went through that recently. I’ve been doing it for five months, first three months, you just want to sit down all the time, your back hurts and so on. But then after that, it’s just like, it’s natural, right?
Art Shectman: 00:06:53
Yeah. I remember when I was like 16 or 17, one of my first jobs as being an usher at a movie theater, and you’d just be on your feet walking through the movie theater all day or taking tickets. You were just standing the whole day. I just remember the first couple of weeks of that job I was like in pain and sore, and I was like, man, my feet are killing me. So now a lot older than that, and it’s definitely like three months, not like three weeks of adjustment, but the gel mat, that takes the pressure off your knees and gives you kind of a reason to activate your core, move around while you’re standing. That helped me a lot, but totally worth it.
Kirill Eremenko: 00:07:32
Awesome. Awesome. Well, good suggestion for people to check it out. Yeah, welcome, excited to talk. We’ve got quite a bit of things to cover about the things that you do in the data space, and it’ll be interesting, exciting for me as well, in a certain way because I am excited to learn about what you do, and I think you’re on a very relevant topic these days about digital transformations and helping the world in this technology space. For somebody who hasn’t heard about your company before or doesn’t know what you do, how do you describe what your company is, maybe introduce your company for us please, and also what your role is there.
Art Shectman: 00:08:25
Sure. My role is easy, I’m the chief cook and bottle washer, which means from a servant leadership standpoint, I do everything that no one else has yet been trained or hired to do, which luckily for me I have a fantastic team. So these days it’s the fun stuff and not so much of the super heavy lifting. Elephant Ventures is a digital innovation and transformation company. We were founded 17 years ago and we focus on helping either early stage startups or innovation practices in large corporations to materialize their ideas faster.
Art Shectman: 00:09:08
We have a methodology we developed over the last 17 years that we call dependable innovation, and in fact, dependability is at the top of our corporate values pyramid, and really we help people go from early stage ideation and how they bet and think about ideas to the kind of concept of idea refinement and how you prototype early and then go build prototypes and get to business value. Our whole focus is about, how do you go from an idea to deliver business value in the innovation or adjacency space where you’re trying to transform an existing system or do something new with new tools.
Kirill Eremenko: 00:09:46
Okay. Okay, very interesting. I hope you’re enjoying this episode, we’ll get back to it after this quick break. And Confident Data Skills edition two is out. This is the second edition of the book I published in 2018. Some time has passed since then, a lot of things have changed in the space of artificial intelligence and data science. If you’re not familiar with the book, then it helps develop an understanding of all the main data science algorithms and the data science process on an intuitive level. So no code, no complex mathematics, just intuitive explanations of the algorithms and useful practical examples and case studies. This book will be extremely helpful for you if you’re starting out, or if you’re looking to cement in that intuitive feeling for the algorithms as you progress through your career.
Kirill Eremenko: 00:10:38
Specifically, you will learn about decision trees, random forest, k-nearest neighbors, Naive Bayes, logistic regression, K-means clustering, hierarchical clustering, reinforcement learning, upper-confidence bound and Thompson sampling. And in this second edition, I also added robotic process automation, computer vision, natural language processing, reinforcement learning and deep learning, and neural networks. Plus of course you will learn extremely valuable skills for a career such as ethics and AI, presentation skills, data science interview tips, and much more. So if you want to get a grip and really cement in your intuitive understanding of this field, then this is the book for you and you can get it on Amazon already today. It’s called Confident Data Skills edition two, and it’s a purple book. So enjoy, and let’s get back to the podcast.
Kirill Eremenko: 00:11:29
There’s plenty of ideas out there, but it’s about executing, and on the other hand, we were chatting before the podcast, there’s plenty of companies that have a lot of data scientists executing, but most of those projects don’t see the light of day. So I guess it’s about putting the right ideas with the right execution to get results.
Art Shectman: 00:11:49
Yeah, so we have this framework we call our data decision accelerator and we have a technology pattern that we call our data-driven innovation pattern. And it’s around that core concept, exactly what you mentioned, that there’s a lot of data and there’s a lot of people doing data science, a lot of people doing data engineering, but somewhere, somehow, you have to thread that through with concept of business value, you have to achieve a business result somehow, you have to be reducing risk, changing capital allocations, capturing markets, developing revenue, creating insights. But even that is not enough, if you just take an idea of a way to get some data somewhere, do some modeling around analytics, get it surfaced to a dashboard and create insights for people, you haven’t captured any business value. You’ve given people the understanding of how they might capture business value, but you didn’t actually get it captured.
Art Shectman: 00:12:42
So we’re finding in the current environment is the cycles for digital transformation, the cycle time for patients for the return on innovation to get your data science team to produce something for you. It’s shorter. People are less patient, they need results in 90 days not nine months. And until you actually take the insight and get it into the operational fabric of your business and get it consumed and implemented, you don’t actually get any business value out of the investments you’ve made. So we help people both at the early stage of the ideas, get leading indicators, whether or not the thing is going to create business value, or at least have a theory about how they’ll adopt it and create business value from it. Then us, we have practices and tools that rapidly move you along that process. So there’s data to decision accelerator process is meant to look at your data, help you apply design thinking and lean product development practices are all of our innovation background towards very focused data product development, or data insight development that help you as a company get to a decision and then implement workflows and things that allow you to take action on that decision.
Kirill Eremenko: 00:13:53
Okay. Got you. Can you give us an example? So I conceptually understand the area where you would be applying your skills and what direction you help companies move in, but a concrete example would be very useful to understand, oh, okay, that’s exactly what you guys do.
Art Shectman: 00:14:20
Sure. So let’s say you’re in the retail space right now, and you’re trying to decide where to open and close retail operations, or you’re struggling with the mixture of where you balance inventory or supply chain aspects of how you’re supplying your stores or even a manufacturing operation. Right now you have some pretty big challenges, you have some pretty big disruptions based on the shift in how people are purchasing, where they’re purchasing, where they’re spending, everything else. So in a good example, let’s say that that data that’s coming from the markets is impacting people with corporates. We work with an industry organization called Global Credit Data that has a centralized database of large corporate and all kinds of different asset classes of loans, and banks are, through their regulations, necessarily obligated to have certain capital reserves.
Art Shectman: 00:15:26
So we’re working with them to create kind of lean conduits for information from the banks to report on their non-performing loans, looking at ways with them of how do you take an idea of how you merge that with market data real-time data, and then create insights for banks to change the way that they’re configuring their portfolios and maintaining their balance sheets. At times that could be a shift in the capital reserve that a bank keeps on their balance sheet, they could free up tens of millions of dollars and put it to work productively elsewhere. Other times it could mean real-time signaling of things to remove from their balance sheet that they should submit to a marketplace to sell off. But that’s a sort of maybe a poor choice of a case study, but the point, I guess, is that there’s a need to integrate data rapidly, there’s a need to understand how that data relates to itself and create some type of common data model or common data fabric that allows you to understand the data that you have. And then you develop analytics and sort of insight generation on that.
Art Shectman: 00:16:29
Then you have to rapidly be able to plug those insights into some operational system. That takes the form in all kinds of different places. There’s work we’ve done with public data sets and correlating public health data sets, or a public governmental information and procurement data sets where you want to be able to correlate those things, do some entity resolution between all the different players that you’re seeing in those data sets, and then allow people to make strategic decisions on the integrated data model that you see. Anyone who’s saying, hey, we have a bunch of data, or we’re going to get some new data or thinking about buying data from a third party, you have to ask the question, why? What are we hoping to achieve? What is the business value? What’s the outcome for our business if we purchase this data? What does it do for us?
Art Shectman: 00:17:22
So we help people think through that process, create a prioritization framework of what they’re trying to achieve, understand clearly what the business value is they’re trying to deliver once they’ve integrated that data and then help them build the engineering pipelines and analytical tools that get them to the ultimate insight and then the final step being, okay, now you have the insight, how do you change the workflow for your organization and do something about it?
Kirill Eremenko: 00:17:46
Okay. How is the work you do different to consulting?
Art Shectman: 00:17:52
Good question. I would say it is consulting. There’s a small dimension of what we do where we develop things enter foundry type applications or products, or we form our own ventures, and sometimes we partner with data interested early stage venture capital firms to help them build some of their platform to actually move their data engineering chops along so that when they’re investing in things in the early stage, we can help them rapidly get their MVPs into market. I’d say we look a lot more like a partner and a venture Foundry in our consulting footprint, but at the core we are consultants.
Kirill Eremenko: 00:18:35
Okay. Got you. What are the typical industries that you work with?
Art Shectman: 00:18:45
Lots of public sector stuff, tons of healthcare experience, here and there financial institutions as well, like we’re working with an early stage neobank right now to help them inflate their operations from basically idea and pitch deck stage to a fully operational bank, leveraging a bunch of the FinTech platforms that are out there today. That’ll take us probably 120 days to inflate a bank from scratch.
Kirill Eremenko: 00:19:11
Wow. Okay. Got you. So public sector, health care, banks, do you have any case studies from the healthcare sector? Given the current situation in the world would be very interesting to hear what kind of services companies are requiring these days?
Art Shectman: 00:19:28
Yeah. There’s a little bit of a, I have to filter my answer a tiny bit just indifference to the disclosure agreements with some of our customers, but the general need right now is for understanding how current healthcare strains of demands are impacting the footprint of where you’re making investments, and then what regional healthcare operators and systems can do to better serve their population and looking at some of the social determinants or behavioral segmentation of how people’s health is currently being impacted. I think the dialogue has shifted from tracking and observing where people are disproportionately being affected by the pandemic, to what do we do once the virus is here? What’s going to happen next? Sorry, once the vaccine is here, what’s going to happen?
Art Shectman: 00:20:30
I think, so there’s some amount of that as a current observation, but generally speaking, I think healthcare data is very messy and it’s a struggle for most providers or operators or payers to get it all integrated, to get into a place where they can consult it. Then there’s a ton of domain knowledge that’s needed to understand what you’re looking at to then help people make decisions about it.
Kirill Eremenko: 00:20:59
Okay. Speaking of messy data, is that a common thing you encounter when the data is not high quality and your team has to spend a lot of time preparing it?
Art Shectman: 00:21:12
Sure. That’s a fantastic question. So data is always messy. Anytime data comes from an external source that you don’t control or touches human beings or human beings are involved in the process of aggregating or entering or collecting the data, it becomes messy. And I would say our experience has been that standard data scientists, they’re going to spend two, three weeks developing a model, six to seven months developing clean data and dealing with data hygiene so they can actually test or train the model, and then another six or seven months dealing with production issues and DevOps and deployment pipelines and MLOps to get their model deployed. And something like 70 or 80% of the initiatives fail, because it just never make it through that cycle. But the starting point is data hygiene and data quality. It actually, it brings me to an interesting point about data quality and data quality engineering.
Art Shectman: 00:22:06
Out of our venture foundry about seven years ago, we lost a social impact venture called Ultranauts. We employ folks on the autism spectrum as software testers and we have this awesome new data quality engineering practice that’s been launched by our new VP of strategy, I forget her exact title, but Nicole Roswell is this amazing woman who has a fantastic long record in understanding machine learning, data science, data quality, quality engineering, general consulting, and transformation services. She’s amazing. But Ultra has this burgeoning practice around quality engineering services on the data side, so data quality engineering, and really what we’re seeing everywhere is folks are integrating all kinds of new data. They’re having trouble assessing whether or not the data feeds are worth integrating. They’re having trouble understanding in their own data, what is the kind of statistical profiling of their data and where do they need to be concerned? What kinds of decisions should they be able to make based on the quality of the data?
Art Shectman: 00:23:10
Then once you operationalize something, there’s always drift. So how are you monitoring for that? How are you checking? How are you making sure that your assumptions are still valid, that you built your models on? So there’s this whole space that is coupled to this idea of using data to make decisions where you also have to make sure the data still has quality or has quality before it gets in the door, or that you’ve cleaned up all the random internal data that you have and created some type of known framework of what’s in there and how to use it. But Ultranauts is a super cool company. Like I said, it’s about seven years old. We say we’re out there demonstrating that cognitive diversity is a competitive advantage. About 75% of the company is on the autism spectrum, and it’s just great. We really have found some magical what we call inclusive practices where you embrace neurodiversity, and we’re what we call this universal workplace we’re designing where we’re figuring out how to employ cognitively diverse teams. Well, ultimately we release some of our findings to the world and share our practices to inspire other folks to do the same.
Kirill Eremenko: 00:24:18
That’s amazing. That’s a great example as well, this data knots, the neuro-diversity companies that Elephant Ventures is not just a consulting firm, sometimes as you said, you partner up with people and you create companies. I want to talk to you a bit more about data quality engineering. So to somebody who’s listening to this who wants to start a startup, or maybe who is already running a company, what is your advice in terms of data quality engineering? What’s the one takeaway that you can give them right now that they can go and do differently in their company, whether it’s a startup or an existing enterprise, that will help them with data quality engineering?
Art Shectman: 00:25:06
Man, the one prized pearl of wisdom. The thing that I think is more important is the precursor to data quality, in terms of understanding what your data catalogs look like, what you’re actually ingesting, because it’s almost like, if you have a mixed bag of fruit, you have apples, you have oranges, you have a couple of rocks that looked red and you have this, a tomato, which I guess is a fruit, you don’t really know what you have. So how do you know, do I have quality fruit? Do I have quality apples? Well, or it’s sort of-
Kirill Eremenko: 00:25:57
Do I have quality rocks?
Art Shectman: 00:25:58
Yeah. Do I have quality rocks that are painted red. So there’s a lot of work, I think, that goes into understanding the data as it’s coming in, and almost like the custodial aspects of data hygiene, like what’s there? What is populated, what’s not? What is variant? What has the same type of field formatting or the same type of data type? What’s the breadth of the coverage inside of all the fields that you have? Are they drifting? Is the profiling of your data different than it was before? There’s really easy ways on the data custodial side to start the process and just know what you actually have.
Art Shectman: 00:26:40
Then there’s a secondary layer in the same vein of knowing what you have, if you don’t understand the data itself, what it’s useful for, it’s harder to understand whether or not, what you need to build a program around quality engineering, because there are some things you can do statistically and stochastically that don’t really depend on a core business understanding of what it is you’re dealing with. Other things it’s more important to then build on top of a solid, clean data framework and an understanding of your catalogs to then figure out, okay, how am I empowered, or how am I set up to be able to make the kinds of decisions or use the type of data feeds in models or data streams for my business. So understanding what the data is applying to, or how it’s being consumed is almost like the second order step of data quality engineering. The first step is knowing what you have and making sure that it’s the custodial aspects of ingesting it and integrating different systems into a common data model or data fabric that you have some baseline of quality there.
Art Shectman: 00:27:48
So I would say start at the custodial level, and once you’re comfortable with what you have, then you can move into the second order stuff where it’s really understanding how it’s impacting your business. But I guess maybe the eye on the prize is like, don’t do the custodial stuff just to do the custodial stuff. Also have some semblance or idea of progression that says, I know where I’m going to capture business value ultimately. I know what quality I’m trying to drive. I’m baking apple pies, that’s why I need good apples.
Kirill Eremenko: 00:28:16
Okay. Got you. It’s especially relevant for enterprises, large companies, where they have so many different sets of data and different storage facilities where they don’t know, nobody knows what kind of data they have. So I think it would be very valuable what you’re saying for a company to go in and do an audit of what is all the data that we have, put a list together. How long, in your experience, does that usually take?
Art Shectman: 00:28:46
It’s funny, I have a friend who is in the data engineering group at a very large pharmacy company. We have this data driven innovation pattern where very rapidly in the span of 60 or 90 days, we can go from the idea of a product to implementing this pattern and an offshoot from your core corporate data engineering pipelines and data fabric, and gets you to an answer and surface it through GraphQL API, to whatever kind of application you want in a fairly fault and change tolerant way. It’s this pattern that’s built to plug into any kind of data in whatever format it’s on and wherever it lives, mangle it together into a common data model, do the hygiene custodial stuff you need, get it surfaced to an API so you can start making decisions rapidly.
Art Shectman: 00:29:36
We talked to him through this pattern and he’s like, this looks awesome and it’s fantastic, except most of my days right now are spent just trying to figure out these three fields have a 5% deviation sometimes because we acquired 47 other pharmacies and one of the 15 pharmacy order fulfillment tracking systems we use didn’t get the memo on some data update to our collective schema from six months ago or a year ago, and I spend four months trying to convince them to just comply or update the way that they’re handling data and make a change to whatever internal systems. The time constant of the standard giant enterprise chains or the way that they get started, so these programs, they are fighting monumental battles to get tiny changes made.
Art Shectman: 00:30:28
So sometimes when you’re looking in the data driven innovation space or transformation space where you need rapid turnarounds, and you don’t have that much time, you need to build an offshoot to the core data fabric or core data engineering or data ops group, hit the measure that you need, prove that you’ve created business value, and then do the monolithic battle of getting that refactored back into your core operational fabric. It’s an interesting pickle. He was like, “These are super exciting tools. I would love to be able to use them, but my day right now is fighting with someone to update super old system and change three fields to have the new configuration we want, and it’s been a 90 day discussion just to get it mapped into their backlog, and then that might be another eight months until they actually implement it.”
Art Shectman: 00:31:21
So the reality of some of this stuff is, okay, knowing that that’s happening, how do you build a rapid response system that can just do the translation you need there, or the transformations you need and catalog and maintain those kind of hygiene customization transforms or scripts that deal with moving those things around or refactoring data so you can actually take advantage of it until the ultimate initial system change gets made.
Kirill Eremenko: 00:31:49
Okay. Talk us through this pattern that your friend, or your colleague was very impressed by.
Art Shectman: 00:31:55
Oh yeah. So we leveraged an open source tool from the Apache Foundation called Apache NiFi. It basically has a library of a handful of hundreds of pre-built connectors, a pretty rich development community, and it’s sort of the data, get it from anywhere package it’s standard ETL tool, but it’s built on the zookeeper pattern for elastic scalability. So when you want to deal with data streams, or you want to scale it up massively or distribute the load of jobs, you totally can. But simple things like retrieving files, talking to APIs, talking to databases, getting some data out, it’s really good at that. Then it has a good compromise between writing your own custom Python scripting to do all this stuff or some cliquey dev simple user ETL package that is frustrating and doesn’t do everything you want to do, it has a really good way of storing these kind of process chains and process groups and processors of how your data gets moved around, and it has some visualizations of that.
Art Shectman: 00:32:59
So you don’t have to be a super in the weeds hands-on software engineer to be able to modify it or change it, or figure out what’s going on. It has really cool things like resumability. It has data provenance built in. So you can tell what’s happening with your data all along the way. So we leverage that to very rapidly just grab whatever data feeds we need from wherever. The default conversation when we meet headwinds is like, look, can you export it to a file and just get it somewhere for us? Great, we’ll grab that and we’ll deal with it. Escalating from there and in complexity for programmatic access or spinning up elastic containers that have to run a certain type of client to get to a certain type of secure data sources, whatever, it’ll handle it.
Art Shectman: 00:33:41
So we do that, we get it to a rapid ingested core hygiene, scripted, transformed, common data model and say, okay, here’s the core data set we need, and then we use a tool called PostGraphile, which is an interpreter of the structures of the ultimate common data model we produce. And it surfaces, it generates a GraphQL end points. So we have this thing we call the instant API pattern, as long as you’ve changed the data model and registered the changes in the data model, and you’ve designed your core data structure as well, it will pick up and automatically write the API for you. So the days of maintaining API code for us are over, we just use this pattern where you go, files from anywhere go through NiFi, get through into PostGraphile into a Postgres database or equivalent, and dah, you have an API. And then we’ve got other patterns around easily reactor D3 dashboarding components or custom workflow development that can talk to that API.
Art Shectman: 00:34:42
As soon as the data is in the database, then you can consume it in your dashboards or front end applications or work flow applications. So really we’ve orchestrated the whole thing, we can turn it on in the cloud for people very rapidly, we could show up and 30, 60, 90 days later, you’ve got a functional data mark dealing with data changes, ETL transforming into whatever you need, and then it’s surfaced into a security API where you can build applications and consume it to get to value. It’s been super helpful for us. We have this methodology around this dependable innovation thing I was talking about before, where it’s of 17 years of collected learnings of how to do innovation properly and make sure we’re mitigating risk for our customers and getting products to done very rapidly.
Art Shectman: 00:35:34
So as we looked at the amount of projects we were doing that involve data over the last four or five years, we have thought of like, okay, well, we’ve done all this lean product development, we’ve done all this, we’d use the Google Ventures design sprint pattern to help people take ideas and move them into actionable things and fed it with customers within a five day process. So we said, okay, well, what does this look like for data driven product? It’s like design thinking for data, design sprints for data driven products. How do we take our innovation practices that are very close to the concept of rapid transformation services? I wrote this thing called like Digital Transformation on Steroids at Ridiculous Speed, which is essentially what innovation was typically called. But we take this pattern and we ingest all your stuff, we get it into a common confrontable model, we put in a database and you can either run analytical packages and deposit more processed outputs into that data store, or you can spin up either workflow or dashboarding type pipelines to confront it through an API, or it can link systems with the API. But it’s fully baked.
Art Shectman: 00:36:41
So we don’t have to make any of those decisions along the way architectually, we just turn it on and then we start actually working on, okay, where’s the data? Can we get it moving? Let’s run the analytics. Let’s see if there’s actual business outcome there, or business value there for you. So a bunch of plumbing stuff by leveraging this pattern gets compressed down, it might take you three to six months to develop on your own, we turn it on in a handful of weeks and we get right to the part where we’re actually trying to iterate on the analytical model or talk to the data providers or provisioning systems that are sending data to us to correct attributes or expand the data sets to actually get to value. We try to shrink that whole cycle between the consumption or viewing of value and the provision of data.
Kirill Eremenko: 00:37:22
Okay. Wow, my mind is boiling right now. I’m finally realizing this is very interesting in the sense, what it actually is. So my understanding is correct that what you guys do is a end to end data pipeline with… And another way of looking at it is like it’s auto ML, but with an added backend of getting everything first into the system, ingesting ETL, all the preparation, and then auto ML built on top of that. Is that about right?
Art Shectman: 00:38:04
I wouldn’t say it’s auto ML, but the idea of all of the plumbing and getting it in there, dealing with data hygiene, getting it transformed, getting into confrontable common data model, and then having preselected patterns to be able to get the data out of the database to your ethical frameworks or-
Kirill Eremenko: 00:38:22
So it’s like a pipeline.
Art Shectman: 00:38:25
It’s definitely like a pipeline. The hand-in-hand innovation consulting history that we have, helps us launch a project, not just on the plumbing side, but really like, what are we trying to achieve? What’s the business outcome here [crosstalk 00:38:41]?
Kirill Eremenko: 00:38:41
That’s the part that I thought was auto ML, but it’s like, it’s more like experience based demo from your experience.
Art Shectman: 00:38:49
Yeah. Also, just experience-based consulting. Sometimes when people are doing these kinds of projects, they’re just not connected to what’s the priority for the business. What are they trying to achieve? What’s the outcome for the business? And once everybody agrees to what that prioritization is and what the outcome is they’re trying to achieve, it’s much easier to have very tactical conversations about, “Hey, listen, I need these 23 fields added from your feed tomorrow.” And they’re like, “Well, I’ll get there in six months.” It’s like, “No, no, no, no, there’s like $30 million of net profit on the line here if this thing works, kind of need them next week.”
Art Shectman: 00:39:26
When you have that kind of clarity of purpose and the value you’re trying to capture, the conversations become much clearer. It’s not like, “Hey, I need 300 tables in four different files of data things added, and just give me the kitchen sink of everything you have and I’ll let you know what I need later, or just give it to me all now because I don’t know what I need.” It’s more like, “Here’s the outcome, here’s the data I need, here’s how it will drive the outcome.” And the conversation efficiency and the implementation efficiency goes up by several orders of magnitude when everybody’s aligned to a key business value.
Kirill Eremenko: 00:39:58
Yeah. I love that you repeat that even in our conversation, value, value, value, you got to be focused on the business value and then your conversations will become easier.
Art Shectman: 00:40:08
Yeah. I think one of our, we like to describe our, the decision accelerator process is helping businesses capture value from data driven decision-making.
Kirill Eremenko: 00:40:21
I saw that on your website.
Art Shectman: 00:40:23
Yeah, that’s it. The differentiator for us is we teach everybody in the company to focus their attention on business value. You’re not just building technology for technology sake, you’re not just doing math or data science for data science’s sake, there’s a purpose, there’s a particular business objective, some business value increment you’re trying to deliver. And we have all these theories in our dependable innovation framework about, there’s this cool concept that isn’t ours called the innovation Valley of Death, where there are three critical factors that make an innovation live. You have time, market window time and some amount of time to do the thing. You have money, funding for engineering talent and whatever you need to make it become real.
Art Shectman: 00:41:11
Then you have political air support, someone somewhere is taking a risk on building this new project and they’re supporting you. They’re buying you time, they’re helping you secure funding, whatever, and you have to shepherd those three resources. So we, as a transformation innovation provider, we look at our job as being a good steward of those resources for our client. And I can’t make quality decisions about what to do for you, where to cheat, where I can move faster, where I can under-build on infrastructure, overbuild on features, but rigid features that might have to get refactored later. If I don’t understand the business value I’m trying to achieve, I don’t know how to make those decisions, and you introduce inefficiency and waste. If you don’t create defensible differentiated intellectual property that’s actually delivered, so delivered defensible differentiated, we call it D3IP.
Art Shectman: 00:42:03
If you don’t do that for your client, and you don’t accumulate enough of that, you don’t reach the innovation action potential, so that the innovative project you are trying to get into market lives and breeds and actually succeeds. It’s like you’re playing Mario kart and you’re in the time trial, you go around the track and you run out of time before you get to the, you don’t get to go to the next lap because you ran out of time. If you run out of the time, political air support, or cash, to be able to keep working on the thing, then you don’t get to keep doing it, you don’t actually get it in market. You don’t actually deliver it. And so unless you actually deliver business value and get the thing into market, all this investment in people, data scientists, data infrastructure, data feeds, it’s all tremendously inefficient and wasteful because you don’t actually get anything back from it.
Art Shectman: 00:42:52
As our consulting practices taught us in 17 years of being innovation stewards, you have to look at what we call the cone of reality dispersion, what is going to happen? A friend of mine, this guy, Dave and I, he used to say, job one of leadership, I don’t know if this is his quote or not, originally, but job one of leadership is make sure everybody’s going in the same direction, and it’s up to us to make sure it’s the right direction. So we have this cone of reality dispersion, this is where we are now, a point in time. And you can imagine this expanding cone out into the future of likely realities. Well, if you have a lot of uncertainty that the angle of that cone, the amount of space and possible realities you have to deal with is vast, so what do you do as an engineer? Well, I have to have a system or I have to have data engineering or the data to deal with all these eventualities to overbuild infrastructure.
Art Shectman: 00:43:50
You move really slowly at the start because you have to accommodate all these variable features. So if you know the business value you’re trying to create, what does that do? First of all, it gives you a North Star. It makes sure that everyone’s pointing in the right direction. But then if you understand the business value really clearly, it narrows the cone of reality dispersion, and essentially you’re planning for less likely realities. The more you can narrow that cone, and the more clear you can be about the business value you’re creating, if you think about the volume inside of that cone from the point of origin outwards towards the goal you want to achieve, you can move faster. If each increment of delivered innovation is based on the volume of that cone, you wind up moving faster, the narrower that cone is, and where you can cheat, you don’t have to build that. I don’t need that data file. I just need this one. I don’t need that pipeline. I don’t need those hygiene routines. I just need this one.
Art Shectman: 00:44:47
So when you have everyone at every level, every person on your team down to the most junior engineer locked and loaded, and I get the business value I’m trying to create, I understand the priorities of my business customer, they know where they can move faster and they know where to over-index and invest for the future. So that makes the efficiency of the spend of time and cost way better. It also makes sure that the effort we’re providing creates more D3IP, so you accumulate more of things that are ultimately going to unlock that value that lets your project be successful. We really do train everybody to say like, what are you doing? What’s the business value it’s creating when we do our sprint reviews, and we talk about the tasks and the stories that we’re presenting, it always starts with a business value statement. The business value of this delivered story is blah, here’s what happened, and we present our work.
Art Shectman: 00:45:39
That kind of culture makes sure you’re constantly looking towards the why of what you’re doing from a business value lens and make sure that ultimately the people that are funding you are getting you kind of, you do your part to get them unlocked business value so that they can do their part to keep investing in product development and the innovative things you want to do that then unlock that door.
Kirill Eremenko: 00:46:03
Okay. Got you. Very interesting. So I’ve recently read a book, Lean Startup by Eric Ries, and there’s one path I know to understanding what your business should be doing, what your customer’s wants through constant experimentation and discussions with customers and getting their feedback on, okay, this is valuable to me, this is not valuable, and things like that. How does that tie in with your philosophy of, okay, we need to know that business value. Is it like you come into the company and the executive team is supposed to, you expect them to tell you what is the business value that they’re searching? Or do you have a methodology to help them understand this business value in the first place?
Art Shectman: 00:46:59
Yeah. That’s exactly it. It’s lean data product development. We have this whole data product design sprint process where if you’re looking at, the Google Ventures has fantastic process that they call the design sprint, they have this thing called the Sprint book by Jake Knapp, I think is his name. It’s fantastic. It lays out the methodology. So we’ve taken that methodology and practiced it a bunch in helping folks go from early stage ideation to policy and idea, to vetting that the idea has the best thinking and the expansive thinking of many experts or individual stakeholders, and then building a prototype of that idea that’s a reasonable facsimile of the thing you’re trying to do and testing it with users. That’s a five day process.
Art Shectman: 00:47:51
Oftentimes when we meet people, they’re unclear what they’re trying to achieve. What’s their longterm goal. They’re unclear, what are the core priorities for their business, what they’re trying to achieve, and we help them refine that vision. They may have an idea, but it isn’t super clear. And so we help them formalize the description of that, help them validate what it is they’re trying to achieve, and then they actually are able to close that loop and say, yeah, we put it in front of the end consumer of the idea, the end customer, and they validated. They said, “Yeah, we’re not just in our echo chamber of our own telling ourselves that it really is a great idea.” And that helps launch the roadmap for lean product development of what you’re trying to do.
Art Shectman: 00:48:33
In agile and lean product development, you’re always trying to deliver the next most important increment of business value, get feedback from your customers or the ultimate business impact with data analytics, get the data back and close the loop and make sure you’re driving the outcome you wanted to drive. So we help people construct, what does that roadmap look like and have a strategy for how that might evolve over a span of three, six or nine months out. But really we have a sharp focus on like, what are we going to do for the next three sprints? What are we going to do for the next six weeks of a roadmap? What are we going to achieve? If we do what we set out to do three sprints from now, what will we have achieved? And then repeat it and do it again.
Kirill Eremenko: 00:49:16
Awesome. Love it. Now that you’ve bombarded me with all this knowledge and I’m like, I’m soaking it in, one thing that would really help me is, can you define data engineering? Now I have all this information, how do you define what is data engineering? Because it sounds like we’re talking about data engineering here.
Art Shectman: 00:49:37
Sure. I think it’s a bit of data product development and data engineering, but data engineering to me is the idea that there are suites and tools that you need to be aware of and need to know how to use them, and it helps you understand where the data lives, how to get it into the places you need to be, how to transform it and deal with hygiene and the real time operational necessities of either batch or stream data and make sure it’s reliable and high quality. So whether it’s ETL packages or stream processors or giant distributed data repositories or data fabric, data virtualization technologies, it’s about the tools of wrangling and presenting data ultimately to analytical applications.
Kirill Eremenko: 00:50:32
Would you call yourself a data engineer?
Art Shectman: 00:50:35
No. I would call myself an engineer. We recruit for a very particular set of skills and aptitude alignment, the core engineering problem solving mindset, the hypothesis testing, insane curiosity and passion around solving problems, and so we describe ourselves as engineers. We had a cool project back in the day where we actually built an off the grid vending machine just because I was a mechanical engineer by training, and someone talked to me about an innovation challenge and I was like, “Yeah, we could build that.” So they sent us a vending machine and we rebuilt it in a low energy CO2 dry ice powered way. We basically treated cold like product. But anyway, we’re definitely engineering product builders.
Kirill Eremenko: 00:51:30
Okay. Okay. Got you. So you defined data engineering. What is data product development? How do you define that?
Art Shectman: 00:51:40
I think the product development side of it starts from what we were talking about before, understanding business value that you’re going to provide and then conceiving of the necessary data partnerships, data sources, data engineering tooling, data pipelines, ETL pipelines, and then also the analytical and data science tooling that you’re going to need to do something new or differentiated or to help people make decisions. And if you really think about data driven product development or data product development, you have to go beyond that. It’s not enough just to surface the insight, it’s also about, okay, let me understand the infrastructure and preexisting systems you have in your company. Let me understand the workflow of the people who are interacting with those systems, and what’s possible and what’s not possible, and then let me understand how I’m going to get them to adopt the insights that are coming out of this thing.
Art Shectman: 00:52:33
It’s really the full life cycle of, how do you go from early stage origins and ideas to the underlying data fabric and data engineering substrate that allows you to do data science and analytical decision making, data driven decision making, but then also the workflow, the end last mile, okay, we know what decision we should be making, or we know what actions we should be taking, how do we go from that point to an organization that complies with that? And there’s a whole bunch of theories we have around essentially the input impedance of an organization to change and signal, and the output impedance or the kind of output frequency of insight generation. And what we’re noticing across the spectrum is that companies are heavily investing in the higher and higher orders of insight and super high frequency, output recommendations and insights, but they’ve under-indexed and don’t understand terrifically well the input impedance and the capable frequency response their organization can have.
Art Shectman: 00:53:45
If I’m someone who has an organization of stores, retail presence or people or knowledge workers, but I only recompile their model and retrain them once every six months and the insight generation I’m doing for the outcome I want to have has a frequency of two or three insights a day, adjustments that need to be made a day. Those are valueless insights because the input impedance of the mechanism that needs to turn those things into value, that needs to change behavior, it’s a mismatch. So if you fail to consider how you ultimately get the insight into practice, the workflow you need to develop to actually consume the insight and get into action, that’s a huge problem. So when I think about data driven product development and how we take all our lean product development and innovation experience and apply it to these data ecosystems, that’s a core differentiator, is being able to go end to end and say, it’s not just, how do I get data from A to B? It’s, what is that data for? How does it create value for my organization? And how will my organization act on the insight generated from that data?
Kirill Eremenko: 00:54:57
Wow, that’s really cool.
Art Shectman: 00:54:59
Yeah. It’s been a lot of fun.
Kirill Eremenko: 00:55:03
Nice. Got you. Thank you. You also say that we’re engineers, we’re not the mechanics. What do you mean by that?
Art Shectman: 00:55:13
If you want a person who knows, I don’t know, the subtle file partitioning schemes in somebody’s cloud for how to super tune their Hadoop cluster, there’s a person for you and that person is awesome and they have a wealth of experience and they’re super skilled in that thing. So in that sense, you need that person with that point solution skill and tons of experience around that thing to get into the super subtle things that only they will know. And that’s not our strong suit, our strong suit is the end-to-end thinking of how you get to value and then the speed with which we can get you to the origins of that value. So from idea to MVP, that’s where we shine.
Kirill Eremenko: 00:56:02
Okay. If somebody wants to, I want to get a feel for this kind of technical skills, you already mentioned innovation and curiosity, and more like qualities that a person needs to possess, but I also want to get a feel for the technical skills that are required in this space, whether it’s for somebody who’s trying to do the same analysis that you do but on their own business, or maybe somebody who wants to be employed in a space that is similar to what you do, what are some of the technical skills that you look for in people you hire into either data engineering or the data product development side of things?
Art Shectman: 00:56:47
The first set of skills I think are around, I would say habits and practices of lifelong learning. When we interview for passion and curiosity, that’s what we’re looking for, is, are you ingesting new tech? Are you reading? Are you experimenting? Do you have an active GitHub account? So attribute wise, that’s what we’re seeking. From a pure technical skills standpoint, we look at things like code structures, coding hygiene, do you understand SQL in the backend? Do you have other data storage experiences? Do you understand unit testing? Do you understand DevOps? Do you understand containerization and orchestration? Do you understand how to use source control? Simple computer science 101 concepts.
Art Shectman: 00:57:45
I think from at least in the data driven product development side, we look for a little bit richer understanding of data modeling and the way that you structure questions of the data, and how do you look at a domain, a business domain, when you’re talking about business value, and how do you map that to a data domain? How do you properly do that data modeling so that you set yourself up for subsequent success? We found that at the outset of our projects, we over-index on the data model side of things and really nailing the domain model and the data model upfront. Most of the frustrations you’ll find in a long running software project or that you mistook the domain model upfront and you’ve then learned through painful code missteps that you messed up that domain model. And then you wind up having to write very complicated code or pay a refactor tax to correct for it later.
Art Shectman: 00:58:48
A lot of complexity is born from a misunderstanding of the core domain model upfront, so you don’t lay down the initial core data model you need properly. So we try to really understand that clearly upfront. So I would say just some good practices around how to think about data modeling and storage structures and things like that. Those are some great skills to start with.
Kirill Eremenko: 00:59:10
Awesome. Thank you very much. I know we’re running out of time. Do you have a few more minutes?
Art Shectman: 00:59:17
Yeah, about five more minutes.
Kirill Eremenko: 00:59:19
Awesome. I just have one final question for you. What is the future bringing for us? If I’m a data scientist or I want to be a data scientist, what is something that I should look out for in this space of data engineering, data product development, what are some of the things that are coming in the next three or five years?
Art Shectman: 00:59:47
I think you’re going to see more work or more tools that democratize AI and machine learning in some sense, and that sounds like a soundbite that everybody’s saying. But the idea that some of the really difficult parts of data science will be subjected to tooling builds and auto generation abilities where some of the entry-level, the simpler data science tasks, the kind of clustering tasks, and the sort of simple predictive model training type stuff. There’s a lot of different ways that you can go about doing those things. And as it becomes a little bit more simple, the tools have improved quite a bit, and so things that used to be in the domain of data scientists only, or applied math people only, will begin to be tenable or accessible to a data engineer or someone who’s DevOps familiar. DevOps will expand into MLOps.
Art Shectman: 01:00:57
As that stuff becomes better supported with intuitive tooling and training, more engineers will be able to take advantage of it with less understanding of the math. So I think those are things that are coming. I would argue that with great power comes the ability to mess things up epically. So in that tooling generation, we’re trying to continue to invest in upskilling our folks on the math side, just because I don’t think, you’ll be able to do it, you’ll be able to turn something on, but you won’t understand the why, you won’t understand what it’s actually doing, and I think people will, in some sense misuse some of the learning models that are out there, or the tooling that is afforded to them because they won’t understand the inherent deficiencies or carve-outs of like, hey, don’t do this, or don’t use it for this kind of data set or don’t make this kind of prediction.
Art Shectman: 01:01:57
I think as the tools move people away from the math, then the understanding gaps will grow and the ability to make missteps with more powerful tools will increase. So I would just say like, while the tools make your life easier, still invest in the data science learning paths and really understanding what you’re doing.
Kirill Eremenko: 01:02:19
Fantastic. Thank you very much Art. A great note to end on, really appreciate your time here. Can you please tell us where our audience can find you, follow you, or learn more about Elephant Ventures?
Art Shectman: 01:02:31
Yeah, elephantventures.com, or if you’re fancy, elephant.ventures.
Kirill Eremenko: 01:02:38
Awesome. Got you. And it’s okay to connect with you on LinkedIn?
Art Shectman: 01:02:41
Yeah, of course.
Kirill Eremenko: 01:02:43
Awesome. Fantastic. Thank you very much Art, it’s been a pleasure.
Art Shectman: 01:02:46
Yeah. Thanks Kirill.
Kirill Eremenko: 01:02:52
So there you have it everybody. Thank you so much for being here today and spending this hour with us. It was interesting how, for me, I found it interesting how it took me some time to develop, to absorb all that information and to understand the importance of what Art was sharing and put it all into more or less coherent picture in my mind, and then it clicked. So I hope it clicked for you as well, and maybe even sooner than for me. My favorite part in addition to all the things I learned and how what is data engineering, data product development, how it ties in with data pipelines, why it’s important to have this rapid process of innovation and cut it down from nine months to 90 days, in addition to all those things, my favorite part was how Art kept repeating that clarity of purpose and capturing value through data are important. It’s important to have that clarity of why we’re doing it, to narrow down that cone of reality dispersion.
Kirill Eremenko: 01:04:02
Even on this podcast, he repeated that probably like two, three, maybe four times, I can just imagine how, when Art is doing a project for a client, he constantly repeats that too, whether it’s executive level to the people on the ground, the data scientists, everybody in the business, data custodians, how he constantly repeats his data value. How are you getting value out of this data? What kind of value you’re driving. Because if you have that in mind, really makes conversations easier. So it’s almost like subliminal message that just travels across his work, I can imagine that way. I’m grateful for him reminding us on this podcast that deriving value out of data is important if you prioritize that, if you think about, all right, what is the North Star? That will reduce the eventualities, the quantity of eventualities that you have to be prepared for as a data engineer or data product developer, and that speeds up your process. It’s a very cool way he put that into this equation.
Kirill Eremenko: 01:05:15
That was my favorite part, I’m sure you have lots of great takeaways from here as well. And as always, you can find the show notes at www.superdatascience.com/417. That’s www.superdatascience.com/417, where you’ll find any materials we mentioned on this episode, the transcript, plus the URL to Art’s LinkedIn and Elephant Ventures. Yeah, if you enjoyed this podcast, and you know somebody who’s into data engineering or data product development, or is looking to transition from just product development into data product development, or looking to transition from DevOps into data operations, into data engineering, send them this podcast. It might help them understand what this space is all about and how it is different to what they’re currently doing. Very easy to share, just send them the link, www.superdatascience.com/417. And that’s us for today, I look forward to seeing you back here next time. Until then, happy analyzing.