Podcasts SDS 249: Diving Into Data Science Consulting

71 minutes
Data Science, Deep Learning, Machine Learning

SDS 249: Diving Into Data Science Consulting

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Michael Segala is again with us to share another three case studies and go in depth about the boom of work SFL Scientific is doing.

About Michael Segala

Michael Segala, Co-Founder and CEO of SFL Scientific, has years of experience leading projects that apply data science and mathematical modeling to solve complex problems. He specializes in working with corporate transformations involving data strategy, growth, automation & cost reduction, performance improvement, and organizational effectiveness through machine learning and predictive analytics. In addition to his client leadership role, Michael leads SFL’s effort to engage industry leading partners such as NVIDIA, Microsoft, and Amazon Web Services. He advises clients in healthcare, life sciences, pharmaceuticals, financial services, consumer products, retail, telecommunications, energy, and transportation.

Overview

Michael Segala is back again! Michael and his team focus on the transferability of data science skills across verticals and even across industries. Michael thinks rinse and repeat methods of focusing around algorithms is one of the least interesting applications of data science. If you’re a company with a problem, how do you use data and data science to solve that? That’s what’s interesting.

SFL Scientific has grown to over 30 people at this point. The great news is all the new hires at SFL Scientific are actually required to take courses over at SDS, which I, of course, love to hear. Even in sales and business, Michael has them become familiar with data science so they know what they’re selling. If you want to be in data science, not just an employee, it’s critical to have the vocabulary and baseline. On the other side of the business, Michael has data scientists—versatility is important in this role—and then engineering folks who deploy solutions at scale. One player throws the ball, and another has to catch it.

So, what’s the ratio of time spent between data science and the implementation of solutions? As a consulting company, they don’t have a ton of time and a lot of their projects are highly innovative. The first POC is aimed to be run in the first 4 to 12 weeks. Half the time is spent exploring the data, a quarter is spent modelling it, and the last quarter is spent presenting the model to the client. The productionization of the solutions could be a day to a year. It depends on the company and their abilities. There’s a massive hidden complexity to data science projects that executives have to often consider. It could be a few weeks, it could be years.

As for his case studies, Michael has been working for a year or so with a client who’s bleeding edge from the medical perspective. Can an algorithm predict diagnosis faster than a human? Michael enjoys these types of projects because they refuse variability in the medical profession and save time for doctors to spend with patients. They utilize deep learning, experiment with neural network architecture to create medical imaging for clients. Another piece of work he does is with federal clients that help security at airports. They also utilize data science to help alleviate bottlenecks and other logistical problems for companies whether it be at production, delivery, pick up, or other parts of the funnel. SFL Scientific also does work in energy to keep people from fooling the system to lower their bills by analyzing data to find anomalies as well as working with internal devices to understand energy disaggregation.

The work SFL Scientific does is exponential and the best way to get the full breadth of it is to visit their site.

In this episode you will learn:

SFL Scientific & their consulting idealology [7:50]
Roles in SFL Scienitific [14:35]
Ratio between data science and implementation [22:44]
Michael’s case studies [26:00]
The future of data science according to Michael [59:00]

Items mentioned in this podcast:

SFL Scientific
SDS 065: How Data Science brings value through Consulting Firms with Michael Segala

Follow Michael

Episode Transcript

Download The Transcript

Podcast Transcript

Kirill Eremenko: This is episode number 249 with the CEO and Co-Founder at SFL Scientific, Michael Segala.

Kirill Eremenko: Welcome to the SuperDataScience Podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. And each week we bring you inspiring people and ideas to help you build your successful career in Data Science. Thanks for being here today and now let’s make the complex simple.

Kirill Eremenko: This episode is brought to you by our very own Data Science Conference, DataScienceGO 2019. There are plenty of Data Science conferences out there. DataScienceGo is not your ordinary data science event. This is a conference dedicated to career advancement. We have three days of immersive talks, panels and training sessions designed to teach, inspire and guide you. There’s three separate career tracks involved. Whether you’re a beginner, a practitioner or a manager, you can find a career track for you and select the right talks to advance your career. We’re expecting 40 speakers, that’s four zero. 40 speakers to join us for DataScienceGO 2019 and just to give you a taste of what to expect, here are some of the speakers that we had in the previous years. Creator of Makeover Monday, Andy Kriebel. IA thought leader Ben Taylor, Data Science influencer Randy Lao, Data Science mentor Kristen Kehrer, Founder of Visual Cinnamon Nadieh Bremer, Technology Futurist Pablos Holman and many, many more.

Kirill Eremenko: This year we will have over 800 attendees from beginners to data scientists to managers and leaders. There’ll be plenty of networking opportunities with our attendees and speakers and you don’t want to miss out on that. That’s the best way to grow your data science network and grow your career. And as a bonus there will be a track for executives. If you’re a executive listening to this, check this out. Last year at DataScienceGO X, which is our special track for executives, we had key business decision makers from Ellie Mae, Levi Strauss, Dell, Red Bull and more.

Kirill Eremenko: Whether you’re a beginner, practitioner, manager or executive, DataScienceGO is for you. DataScienceGO is happening on the 27th, 28th, 29th of September, 2019 in San Diego. Don’t miss out. You can get your tickets at www.datasciencego.com. I would personally love to see you there, network with you and help inspire your career or progress your business into the space of Data Science. Once again, the website is www.datasciencego.com. And I’ll see you there.

Kirill Eremenko: Welcome back to the SuperDataScience Podcast ladies and gentlemen, I’m super excited to have you back here on the show because we’ve got a returning guest. For the second time round, Michael Segala is joining us. He is the CEO and Co-Founder of an AI, Data Science, Machine Learning consulting firm based out of Boston but operating globally called SFL Scientific.

Kirill Eremenko: Previously we had a super exciting discussion with Mike that was back in the middle of 2017 and it was episode number 65 on the SuperDataScience podcast if you missed it and today Mike is back with even more case studies and more inspiration for you guys in the space of data science. Here are some things that we talked about, just as last time Mike shared three case studies and of course they were different this time. This time we talked about healthcare imaging and we delve deep into neural networks and the architecture and design of neural networks.

Kirill Eremenko: Then we talked about logistics and supply chain and the challenges there and we talked about things such as bottlenecks and routes and how machine learning can help in those spaces and what kind of projects they’re doing in that industry. And we talked about energy and in the space of energy, Mike actually give us two case studies, and some of the things that you’ll learn there are dealing with unbalanced data sets, creating fake data sets, unsupervised learning for anomaly detection and supervised learning with small data sets and in general, this challenge of small data. Those are just a couple of things that you’ll learn, there’s plenty, plenty more that Mike shared, including an overview of the world of Data Science projects and Data Science Consulting in general, which I think you will find extremely valuable and why companies in 2019 and 2020 might actually start defunding artificial intelligence and machine learning and what we can do about it.

Kirill Eremenko: As you can imagine, this is going to be a very, very powerful podcast, can’t wait to jump into it. But before we do, I wanted to give a shout out to our fan of the week. And this one is from Ronnie who says, “If you have an interest in programming automation, big data, machine learning, etc., this is a must listen, focuses on data science, analytics, etc., in the corporate world.” Thank you very much Ronnie. Very, very inspiring to hear that.

Kirill Eremenko: And for those of you out there who are listening to the show and you haven’t yet left a review, then head on over there on your podcast app or just go to iTunes and leave a review for the SuperDataScience podcast, that would be just amazing. I’d really appreciate it because I love reading your reviews. And with that said, I’m super excited about today’s episode. And without further ado, for the second time round I bring to you Mike Segala CEO and Co-Founder of SFL Scientific.

Kirill Eremenko: Welcome back ladies and gentlemen to the SuperDataScience podcast. Super excited to have you on the show because we’ve got a returning amazing guest with us here. The one and only Mike Segala from Boston’s SFL Scientific. Mike, welcome back. How are you doing today?

Michael Segala: I’m doing great. Thanks for having me back. It’s a pleasure to talk again.

Kirill Eremenko: The pleasure’s all mine, the podcast we had last time was an amazing success and totally totally rocked it, so looking forward to having another one today. How’s the weather in Boston these days?

Michael Segala: Well, it’s late February, so we’re cold and windy, but not too bad snow this year, I can’t complain too much, but not nearly as nice as where you’re at in the world.

Kirill Eremenko: Yeah, man, I’m in Tasmania now and like I was mentioning before, it was freezing last night, is my first time in Tasmania, literally the day I got here, they have like the worst wind and weather ever and it’s freezing cold. But it’s a nice [crosstalk 00:07:05]

Michael Segala: That’s how it always is that right? You go on vacation, and they’re like, “Oh, this is the worst seaweed we’ve ever had.” There’s always something but it keeps it fun, right?

Kirill Eremenko: Yeah, that’s true. A bit of variety. That’s right. That’s exciting. It’s been over one and a half years since we last spoke, the previous episode, by the way for our listeners, if you haven’t heard it, highly recommend checking out. Mike shares amazing case studies. It’s episode number 65, so you can find at www.superdatascience.com/65, with Mike it was over one and half years ago. What’s been happening since then?

Michael Segala: A lot. Just to kind of recap real quick for the audience. I’ve run SFL Scientific, we’re a Data Science consulting company. Unlike a lot of these traditional product companies or vendors, we’re purely focused on really attacking this Data Science market from a purely kind of consultative standpoint. Truly kind of service oriented. What that means for us is we get to have a lot of really smart folks on staff that get to work across a really far ranging kind of sets of clients and topics across the data science and data engineering space.

Michael Segala: For us we’re really just continuing to grow and move with the market. As everything continues to mature and money gets fed into this AI market, SFL is taking a really nice ride along with them and continue to kind of execute on really interesting innovative projects and just grow the business. It’s been a great time and it’s kind of very similar to yours Kirill. We both kind of started the companies a couple of years ago in the beginning of this phase and are doing great stuff so congrats to you as we’ve been taking this ride a little bit together here.

Kirill Eremenko: Thanks for that. Thanks mate. Yeah, I can only to the same. It’s exciting to see the explosive growth you had. I sometimes go on the SFL Scientific website and even if you’re not, a business owner, if you’re a data scientist or you’re a data science manager aspiring, highly recommend checking out sflscientific.com. I just go there for inspiration sometimes, you go to solutions or our work. I like how you have this grid of different industries you’ve worked in, from advertising, marketing, agriculture, insurance. And then like I click on one of them and I’m like, “Oh, that’s really cool.” What have you done in agriculture, satellite imaging, resource management, crop forecasting, livestock monitoring. Those are some really cool things. There’s a ton of industries you guys have worked on. It’s crazy. How do you keep up with all these projects?

Michael Segala: Well, keep up with the project is different than executing. Keep up is a lot of late nights and email exchanges. But everybody on this podcast listening is pretty educated at least from a data science perspective, and as we know, algorithms, data sets, they all kind of boil down to the same fundamental data types and challenges. What do we have fundamentally? We have images, we have time series data, we have text data and a couple other types of fundamental modalities of data. And what you can start doing is thinking about, all right, if I had an image and this image came off of an MRI machine or a satellite image or even a camera in my house, how would I classify that image? Or how would I segment that image?

Michael Segala: And if you’re really good at thinking through the fundamental challenge behind capturing, collecting and storing and then solving the problems of those data types, you can kind of extract a way some of that industry vocabulary and difficulties that very industry specific folks focus on. What we really try to focus on as a company is saying, “Hey, I want to hire the best in class folks at computer vision or time series analysis or NLP analysis.” And arm them with that kind of 95% of the knowledge to solve all problems. And then when we talk to somebody from Ad tech or from Pharma or from finance, being able to slot in and solve an NLP problem or computer vision problem is kind of very, very similar and almost a rinse and repeat because you have that core knowledge. And then you can really apply it across all these verticals very, very easily.

Michael Segala: That’s the way that we attack the market. Now granted that’s not for everybody, but we find that to be extremely successful and we really had no issues with that so far.

Kirill Eremenko: That’s amazing. I love that you mentioned because we talked many times with many guests about the transferability of data science skills. That’s why I personally enjoy Data Science which I think it’s such a cool industry to be in is because you develop those skills as you mentioned, and then you can take, you can separate in your mind the data science side of things and the domain knowledge or the business knowledge and you can take your data science skills and transfer them into different areas and very quickly graft that domain knowledge and consulting is like one of the places of course where that is the most evident.

Michael Segala: Yeah, absolutely. And I mean for us, we don’t think of data science as a point position around algorithms. I actually think that’s the least interesting thing going right now in data science. Because when you think about data science, all these algorithms, take anything off the shelf, your XGBoost models, your tensor flow models, right? These are all becoming very commodity and it’s almost trivial at this point to take some data, run it through XGBoost and get a prediction. Literally if that takes you more than 20 minutes, if you’re just kind of doing rinse and repeat, you don’t know what you’re doing.

Michael Segala: When we’re thinking about consulting, it’s so much more than this kind of very singular thought around algorithms. We like to take that very holistic approach of saying, “If you’re a real organization who needs to solve a real data problem, how do you do that?”

Michael Segala: And the first way that you do that is as a data scientist to take a big step back and think about the strategic vision here, what’s the real business use case that you’re thinking about? How would you solve this? What’s that ROI look like? What do I actually get at the end of these algorithms? And you really thinking through not just the sciencey algorithm stuff, but also the business stuff. And then also thinking about, well, how would I engineer that solution? How would I do that in a kind of scalable, secure environment where I can now go in productionize this thing.

Michael Segala: And kind of having that, and coding that around these algorithms is really where that interest lies. And again, the reason that I’m saying that is because if you’re a consultant and if you want to get into this space where if you really want to be a great data scientist, what we find is, these just very simple algorithms, they’re going to be commoditing. If you want to stay above that curve, you have to really think about that larger picture. And that’s also very repeatable across industries, all of these themes make you an extremely innovative folk and be able to be used across all these different problem statements. It just kind of keeps going and going.

Kirill Eremenko: Yeah, totally agree. And you mentioned just before the podcast that you have grown to over 30 people. What kind of roles do you have on your team? Is like everybody doing data science projects end to end or do you have some people specialized in certain types of industries, certain types of areas or parts of the data science project?

Michael Segala: We have two very different groups of teams, first is more of the sales and the business folks that sit under me, but we’ll put them further aside for the moment. They have their great roles, they do their things, but not really for this podcast, actually let me just stop there for two seconds. I actually make all of my sales and business people take your courses.

Kirill Eremenko: Oh no.

Michael Segala: I swear to god. As their first two weeks or three weeks of their introduction, they have to take your, I think two of your courses as their introduction to data science.

Kirill Eremenko: Wow.

Michael Segala: Everybody [inaudible 00:15:30]

Kirill Eremenko: Thanks men that’s so exciting to hear.

Michael Segala: Because it’s a great resource. It’s an absolute great resource and I feel that everybody on my team, no matter if you’re a sales folk or if you’re whoever else, you have to be a data scientist at least some novice level. You have great resources so we really appreciate them.

Kirill Eremenko: That’s awesome. It’s so exciting to hear as well that, I think this stands to show that data culture or data driven thinking and culture. This is on one hand of course it’s about knowing your product and what you’re selling. But on the other hand, this way your team as a whole can develop this data driven mindset. If a salesperson is talking to a client, they might be like, “Oh, this might be helpful.” XGBoost or Decision Trees, Random Forests. Really, really cool. Thanks man, you put a huge smile on my face.

Michael Segala: I’ll answer your other question but I want to get back to this as well because I think a lot of folks that listen to your podcast could be from that sales and business side of the world. And at least me, right in my team, I run that department in myself … I’m a physicist. I’m a scientist. I’m a data scientist but now basically I’m a sales guy and I have a very core belief, exactly parroting what you’re saying that if you want to sell data science, if you want to be in that role of data science but not a technical employee, it is phenomenally critical that you have that same vocabulary. You understand the real challenges and you can be, at least a five minute conversation where you’re actually conveying real knowledge about the topic.

Michael Segala: Otherwise you just kind of look silly compared to people who know what they’re talking about. It’s extremely crucial to have a real base line in there. But anyways, putting that to the side for the moment, on the technical side of the house, we usually have two types of individuals. One is our data scientists and our data scientists, we look for people who are generalists but extremely gifted generalist, I need you on one day to be able to solve cutting edge 3D medical imaging projects and then the next day doing NLP work. We tend to not hire folks who only know how to do one thing because you’re a consulting company. That project might be up in six weeks and then you’re off to something else.

Michael Segala: Our goal is to hire really well rounded folks, but we tend to double down a bit in the healthcare market, healthcare, Pharma, biotech. It’s really nice when people have that kind of general backgrounds, physics, biology, chemistry and things of that nature. But really bright individuals, that kind of know the data science space. That’s kind of the one group of team. And then the other side is more of our engineering folks, we call them like AI engineers because they’re not like day-to-day sad mid folks or SQL people. They’re the ones kind of deploying these solutions at scale, all the way from very large petabyte size image loads to realtime data transfer and kind of model deployments.

Michael Segala: We tend to have those two kind of engineering and data science teams, but they work huge overlaps. Both can kind of pair each other and do a really nice job. That’s how we set up the teams internally.

Kirill Eremenko: Gotcha. What’s the split approximately between the data scientists and AI engineers?

Michael Segala: I would say 70, 30 maybe 70% data science, 30% engineering.

Kirill Eremenko: Gotcha.

Michael Segala: Give or take, something like that.

Kirill Eremenko: Am I understanding this correctly that you not only deliver the insights and find solve the problem for the client using data science, but you also help organizations actually deploy their solutions into production and actually have those models working on an ongoing basis, hook up all the tools and make sure that everything’s working right. Is my understanding correct?

Michael Segala: Absolutely. And I think if you don’t do that, you’re falling very short of what it actually means to do data science. Data science isn’t running a POC on your laptop, with a CSV file, it could be, but for most real organizations, they need something much more robust than that. That can fit into a real process and kind of take in real data and kind of show the results and kind of fold into more of their business process. It’s really critical for us, obviously the first phase in most projects is very simple, take this data, show me that you can predict something, great, show it in a sandbox environment. And then what we really need to transition them into where most organizations fail short and why most data science projects fail is not because the data’s no good or because the models are no good.

Michael Segala: It’s actually because the folks don’t know how to integrate these things and productionize the code. That’s a huge problem we see in the industry. We really try to be thoughtful, when we kind of prove out the POC to show them and work with them to deploy it, ’cause unless you deploy it, it’s really a failed project. Absolutely. It’s extremely important.

Kirill Eremenko: It’s kind of like a follow through, like getting things done. I imagine it as American football, imagine one player throws the ball and the other one has to catch it, the data science side of things, that’s throwing the ball. How fast you can throw it, how accurately you can throw it, how you can avoid other players jumping at you when you’re throwing it, all that stuff. But if there’s nobody to catch it, then where’s that ball going to go, is just going to land by itself.

Michael Segala: It’s going to hit somebody in the back of the head. That’s all it’s going to do.

Kirill Eremenko: Exactly.

Michael Segala: But I agree. Any analogy you want to make. Absolutely. The fact that we still don’t have a culture in the Data Science space around deployment and productionization, I think is one of the biggest issues that I see. And one of the biggest risks of folks not investing longer term, kind of in their data strategies, these kind of failed POCs. And a lot of that is really just kind of comes down to integration and productionization.

Kirill Eremenko: When you say POC, what do you mean? Just so we’re all in the same boat?

Michael Segala: Yeah, sorry. A POC usually is take any, I don’t know, take a problem, pick a use case, whatever it happens to be. Predicting churn for my customers, pick something … A POC is normally, here’s 10,000 historical customers, here’s the data. Show me that you can predict with some given level of probability that these customers can churn, pretty straight forward. They give it to you on a CSV file, you fire up XGBoost within a couple hours you could probably do something. You need to show the business that, that is validated and you can do it. But now you need to then productionize this by saying, “Okay, now I have real customer data coming in every day, I’m collecting it, I’m adding external information. How do I integrate this code and algorithm into my actual workflow?” That kind of [inaudible 00:22:28] has the POC into more of a real kind of implementation phase. That make sense?

Kirill Eremenko: POC is basically proof of concept that you cannot get it done.

Michael Segala: Proof of concept.

Kirill Eremenko: Okay. Gotcha. All right. Well, this would be an interesting, to hear from you who’s in the field, you guys work in tons of these projects. What would you say roughly is the estimated amount of time ratio between the data science side of things, doing the work and preparing the model and the productization of the model? How would you split the time required by your team on to those two part components?

Michael Segala: It’s a very open ended question and it depends phenomenally on the project, obviously. You have to realize that for us we tend to work on more innovative type of projects, because a lot of these low hanging fruit problems, internal data science teams are doing, or you can call some API to do it, you might necessarily need to bring us in for some of the bigger type of stuff. A lot of our projects are more kind of that cutting edge bigger projects. For us, I tend to try to run a first POC in the matter of say 4 to 12 weeks, give or take that timeframe, if it’s fast 4 weeks, if it’s a little longer 12 weeks, in that probably half of that time is spent getting the data, thinking about it, doing some kind of exploratory analysis, cleaning it, playing with it.

Michael Segala: Maybe a quarter is spent modeling it and then the last quarter is spent explaining to the client, walking it through, understanding it, validating it and things of that nature. The first half of the project, maybe only half of the time is spent with the algorithms. And then I would say to productionize that, I mean that could itself take anywhere from a day to a year. It really depends on the business and how complex their IT infrastructure is, how complex the data is. If there’s security issues, if there’s compliance issues. That’s when you get into the world wind of just craziness. It really depends.

Kirill Eremenko: Wow. Sounds like that part is the more uncertain one from a day to a year. Well it’s lots of uncertainty there.

Michael Segala: Yeah. I mean, I’m being a little heavy handed with the day, call it a couple of weeks. But yeah, I mean it could be very quick to a very arduous task.

Kirill Eremenko: Okay. Well that’s good to know. And that also shows that there’s a massive hidden complexity involved of data science projects that a lot of executives don’t consider. If you have a data science strategy, that’s something you should have a part of your data science treasure. If you’re just developing your data science strategy, not only should you include things like, do you have the data, do you have data silos, how you’re going to break those silos, what kind of team are you going to hire or who are you going to approach about these projects? What kind of tools are you going to be using for these projects, but also you need to include this whole productization of the models.

Michael Segala: 100% yep. Absolutely.

Kirill Eremenko: All right, let’s shift gears a bit. That was an awesome intro and like awesome overview of the world of data science consulting and just in general data science projects. Let’s talk about some case studies. Last time you shared three incredible case studies on the show. In fact they had multiple components. I would say even more than three case studies. Do you have any new exciting things that you’ve been working on for the past one and a half years that you can share with us?

Michael Segala: I can and I should have remembered which ones I shared, but I’ll pick three ones and will probably be different and if I repeat myself and remember just tell me.

Kirill Eremenko: What you shared, first one was on cleaning unstructured data with NLP pipelines. Then second one was deep learning to detect cancer. And also we talked about growing organs with deep learning. And case study number three was gaining an advantage in sports betting using machine learning.

Michael Segala: Fair enough. All right, let’s actually, let’s do a couple of different ones as well. I like to always go back to medical imaging. I remember that when I had talked about last time. We’ve been working for about the past year or so and I’ll give you three again, just kind of three or four random ones pretty quickly. We’ve been working for about the past, I’ll call it a year, a year and a half with a client who is kind of bleeding edge from a medical imaging perspective. And medical imaging is extremely important for lots of different reasons. Let’s take a step back and think about why we care about automation of medical imaging. Right now you go and you get an MRI, you get a CT scan, you get a pathology reading and basically what we’re doing, we’re detecting cancer, we’re detecting breaks, we’re detecting whatever it happens to be.

Michael Segala: There’s this kind of coolness factor of can I use an algorithm to predict probabilistically is this a tumor and can I do that at a rate that is more accurate than a radiologist. That’s kind of the cool factor and and sure, right? We’re getting to the point where we can do that and we’re getting to the point where FDA clears it. But what’s really interesting and why we really want to do it is for two reasons.

Michael Segala: The first reason is reducing variability within the medical profession, because right now, if I had an MRI and I gave that to a doctor to predict or for them to tell me if I have a cancer they technically will disagree with a group of radiologists and they’ll even disagree with themselves at a pretty large fraction of a clip.

Michael Segala: If we design a system that is unanimous and reduces that variance, we’re now getting to the point where we can give care to a population in a very unbiased way, it’s a pretty significant kind of implication. The second implication is, this actually takes doctors lots of time to do, this could take minutes to hours of their time, that is not spent with patients.

Michael Segala: Now you’re kind of giving them back all of this time where they can go and do what’s really important, which is seeing and talking with patients. That’s really why we want to do medical imaging and why it’s such a popular field, within deep learning and data sicence. And I won’t go on with this along with all of them, I really like medical imaging for lots of reasons.

Michael Segala: What we’re doing with this medical imaging project is we have the world’s largest collection of 3D CT and MRI brain scans looking for different cancers within the sinus cavities. I think it’s like 51 different tumor types that can just establish within your seven different cavity regions within the brain or within the face. What we’ve done there is amassed large amounts of data paid, well our client has paid lots and lots of money for doctors to label it. And we’ve built extremely sophisticated algorithms to detect very, very small signatures of malignant like tumor cells within these 3D images.

Michael Segala: That’s the first one, and that’s been going on for a while. Extremely successful. Kind of has shown them to have accuracies, I can’t really say the accuracy numbers, but far exceeding what they would need to be to get real kind of clinical validation. Very very interesting, very profound. If we think about the implications, so that’s kind of the first one.

Kirill Eremenko: Quick question. What do you mean 3D images? Is it like multiple layers of MRI scans?

Michael Segala: Well, an MRI is a 3D, it’s not a single 2D plane. You actually had a stack of like 128 2D images make up one 3D image. You have to look across the X, Y, and the Z plane. And obviously within that Z dimension, you can have, that’s where a tumor might be embedded within two or three of the actual slices. It’s a very complex problem, because now you’ve taken a data set and for every image you basically multiplied it by a factor of a hundred. Just think of the size of these datas and the complexity of the algorithms that have to happen.

Kirill Eremenko: Yeah. Can imagine. If it’s possible for you to share what kind of algorithms or even branches of machine learning or other areas of AI did you guys use for this?

Michael Segala: This is all deep learning, this is all computer vision and I just want to make a point here because this is a great question. You cannot take an off the shelf VGG 16 or 19 or whatever they have out now and do transfer learning and expect to get a medically viable algorithm. The stuff that people play with is great from an education standpoint, if you do it on Kaggle sure, that’s fun. But if you really want to be serious about solving these problems. You’re really starting from scratch and designing from a research perspective these algorithms in an extremely deep networks, very complex systems, and you’d better have access to lots of really big and powerful GPUs.

Michael Segala: We write all of this from scratch in pure TensorFlow, because [inaudible 00:31:52] is way too restrictive and they just go to town and just really, these takes a long time to do. It’s all very custom kind of convolutional networks and stuff like that. And you do lots of cleaning and pre-processing and post-processing that, just go on and on to get the accuracies up and up.

Kirill Eremenko: Gotcha. How’d you guys choose TensorFlow over PyTorch?

Michael Segala: I mean the team does for whatever reason. Sometimes the client demands it, sometimes for whatever reasons, our team chooses it. For this client specifically, I don’t remember why the choice was made but for us, I mean, it’s not a one or the other. It’s whatever best fits that very specific situation. For this maybe it was, TensorFlow was better for these 3D images over PyTorch, but I’m literally making that up. I don’t know why that specific choice was made, but for this client it was made, I’m sure for a very specific reason.

Kirill Eremenko: Wow. So many questions.

Michael Segala: Sure.

Kirill Eremenko: With deep learning, very interesting. First one would probably be one of the main parts of deep learning is architecting the neural network, finding out or experimenting with how many hidden layers you have, how many neurons in those layers and things like that. Do you guys have any approaches that you have developed in SFL Scientific over the years on what’s the most efficient way to experiment with neural network architectures to get to the end result faster or is it completely dependent on the project and it’s a creative component that people, that you rely on your team to execute.

Michael Segala: I mean it’s a little bit of both. It’s a lot of experience and a little bit of creativity. And now I’m speaking for an area where my team would be much better suited to speak on than I will but I’ll pretend to know a lot more than I really do. You have to realize that we’ve worked in these kinds of medical imaging problems for years, from a kind of all the way from our graduate background for the past several years and a lot of our folks have been working on problems like this for 10 or 20 years. We know computer vision and have deep learning in the medical space very well. We happen to have a pretty good understanding of how to build architectures around understanding and segmenting and classifying dicom like CT or MRI images. And we know kind of the computing power, we know the size of the data. We can calculate the number of neurons to say, “Hey, I need to show incrementally that we’re getting better and better accuracies.”

Michael Segala: Because you don’t start by throwing the kitchen sink at the problem. You start small and you start quick to kind of iteratively show that you can make progress. Design a network that you can do in a couple of hours and then show it works. Now a couple more hours or a couple of days or a couple of weeks. You’re always building on that, intentionally moving in a kind of structured way. It is obviously just knowing some stuff and then being smart around selecting and kind of fine tuning your network and growing that as a function of your accuracy demanding it. Not a great answer, but it’s my answer.

Kirill Eremenko: No, no. I like what you said about starting small, I think that’s important because maybe somebody might be working on a project and they get an accuracy rate with a certain architecture of, I don’t know, like 60% and that really is discouraging to them. And they completely change the approach, they abandon that first idea that they had and they try something completely different. But what I’m getting from what you’re saying is that, okay you got to 60%, see if you can get that to 70%. Can you adjust it rather than completely abandoning it.

Kirill Eremenko: You might’ve had a great idea at the start, see if you can adjust it and increase, increase, increase and get to that end goal. The point is not to hit the bull’s eye right away, but just like keep throwing the darts until you get closer and closer and closer. And you finally hit the bull’s eye.

Michael Segala: There’s two things. Great data scientist are great problem solvers, hands down. Being thoughtful about why things aren’t converging or what can be improved on and then second to your kind of number of 60%. I challenge a lot of our folks and a lot of our clients, when we start throwing out numbers like 60%, 70%, 80%, I’ll always say, “Well what’s good, is 80% accurate on detecting cancer good?” And it actually invokes a lot of thought and like what is an actual good accuracy and what would you do if it was 80% or 60% or 99%. When you’re a data scientist and you’re sitting there and you’re building these algorithms and you’re getting your accuracy numbers, you really need to think about, well, what is needed for the business and what are these accuracy values actually correspond to in terms of an outcome and what level do I really need to achieve?

Michael Segala: It’s not this kind of playground science laboratory. You’re doing this for a business, for a real purpose so figure out that purpose then work backwards in terms of what your accuracy needs to get to. I think that’s such a critical point that most folks just ignore.

Kirill Eremenko: Okay. Totally, totally agree. Thank you for that. That was case study number one, medical imaging.

Michael Segala: All right, let’s see. I have another great case study. I hope I don’t get in trouble for this one. We’ll see. I’m going to be very, very light with the details. We do some work within the federal government. One of them happens to be with a client that develops in airports the baggage screen or stuff that you walk through. Stuff that you physically put your baggage through and then stuff that your baggage that you check in goes underneath and goes through. Those are actually just large CT scans. They’re large CT images. And what happens is as your bag is going through, like you know, you go through the airport security, you’re sitting there, it takes a second and then you have a screener, a TSA agent sitting there and they say, “Hey, I see an interesting object.” It could be a knife, it could be a gun or they’re looking for other objects like explosives and things of that nature.

Michael Segala: You could imagine that these machines might have some interesting algorithms built into them. And you can imagine-

Kirill Eremenko: You’d hope so.

Michael Segala: You could imagine even further that nowadays we would probably want to enhance those algorithms by using like a deep learning solution or really innovative solutions. If you imagined all those things, the TSA probably works with consulting companies that designed and developed these types of Algorithms for folks. We may be one of those companies doing some really interesting work around detection for the TSA.

Kirill Eremenko: Or maybe not.

Michael Segala: Or maybe not, I don’t know. That could be used case two, but we won’t, I don’t know how much I’ll get in trouble for that one so we’ll skip that one for today.

Kirill Eremenko: Sounds good.

Michael Segala: But very similar. It’s object detection, it’s segmentation, it’s classification around really interesting images. And that image could be anything, it could be, go ahead.

Kirill Eremenko: I was just going to say that, it just shows that your existing expertise in the medical space with imaging is very transferable to other industries such as scanning baggage.

Michael Segala: Yup. Absolutely. Other types of use cases, we’re seeing a lot in these very traditional industries like manufacturing, retail, consumer goods, where they have lots of logistical and supply chain problems. This one’s not a real sexy one, but it’s something that we see a tremendous amount of potential lift for. Increasing logistics and supply chain is an area where there’s a lot of hot press happening at the moment.

Michael Segala: If you’re a beverage company, if you’re a company that sells lots of jeans or whatever you happen to do and you’re selling tens of thousands if not hundreds of thousands of these products, the question becomes very simple, how can I use an AI solution? Whatever that means. Some machine learning or deep learning to actually allocate merchandise in a much more optimized way. We have a few different clients and a lot of these big industries that ship, talking hundreds or millions of individual items every day, every week, every month that they want to be able to dynamically understand, how do I ship them? How do I become better about not wasting material? How do I increase my bottom line by just doing that in a more optimal way. We’ve seen very recently a lot of these industries looking out because they’re seeing what machine learning can do over their very traditional kind of rule based forecasting methods just to enhance these operations.

Michael Segala: A lot of our use cases just literally in the last couple months have been around that supply chain and logistics. If you’re somebody who’s looking at interesting problems, I think that most big companies, most Fortune companies or even even smaller mid market, they all have very similar types of use cases around this space. Forecasting, supply chain manufacturing where you can do a lot of interesting stuff. That’s kind of a not really a single use case, but lots of use cases baked into one there. A lot of real great value there, very different, very time series like data and things of that nature. And what you can start doing there is coupling from a predictive side if you’re also doing supply chain, you have things that are failing. Machines are failing, equipment is failing. And the question becomes in that same supply chain when I’m doing my forecast, can I also understand failure events, right? Predictive maintenance and whatever it happens to be on those same machines.

Michael Segala: We’re seeing these companies starting to collect and analyze all this information to wanting to predict when their machines are going to fail. How often do we need to take them offline? How will that affect their shipping? How will that affect their logistics? And kind of solving two problems at once, a better forecasts, plus being able to augment and not necessarily need to fix machines before they break, kind of fix them beforehand. It’s kind of two things boiled into one. But you could potentially do it all together. It’s kind of the second use case we’ve been seeing a lot of recently.

Kirill Eremenko: Gotcha. On that, with logistics, I see there’s lots of components where data science, or data science can be applied to solve challenges. But some of the two challenges that I’m quite familiar with are bottlenecks in logistics, like where is the bottleneck? Is it at the factory? Is it at the pickup location? Is it through the route? Is it at the end? And the other one would be optimizing routes. Things like the traveling salesman problem. How do you get to as many destinations? Like if you delivering the milk to different stores, how do you get two of them in the most efficient way possible? Could you give us some examples of, what kind of algorithms would you use in maybe these two problems, bottlenecks and optimizing routes or maybe other problems and challenges in logistics that you guys have worked with before.

Michael Segala: Yeah. For instance to kind of answer your first question first, lots of different algorithms could be used here. Folks are starting to experiment and got a lot of success with these reinforcement algorithms, whatever you want to call them, deep reinforcement or regular reinforcement learning algorithms to do these kind of difficult optimization problems, right? Because that’s what it is, it’s an optimization. If you kind of take a step back, a lot of these traditional mark off models or Monte Carlo simulation, very similar. You have a very complex dynamical system. How do you optimize across this entire system. It’s not necessarily the same as just a single kind of prediction variable, but now you’re doing it in a very complex manner.

Michael Segala: We’re seeing a lot of interest in movement, especially in some of our use cases in that manner of kind of using some of these [inaudible 00:44:17] bleeding edge methodologies. Others, if you can still turn into a traditional machine learning algorithm, if you can predict something, either binary or categorical or forecasting, you can use whatever traditional ML algorithm you want, or a deep learning algorithm. This is a problem that lends itself to lots of different opportunities, optimizations is a different class of problems, you can even use like genetic algorithms or things of that nature. Lots of interesting stuff there.

Michael Segala: And then your second question was more specifically-

Kirill Eremenko: Sorry on that one. Do you guys have any approaches like genetic algorithms or enforcement algorithms, deep reinforcement learning or machine learning, do you have any of those that have shown to be the most useful, the easiest for you guys to deploy or the quickest win for your clients? Any comments on that?

Michael Segala: Oh yeah. It always depends, it depends on the problem statement, it depends on the complexity of the problem. Quickest wins are always going to be the easiest algorithms, if you can map anything into a very simple machine learning algorithm with a prediction variable, you’re going forward with some features. Yeah, I mean you could do that pretty easily. If you need to dynamically optimize a really complex system and you need to go to like a deep reinforcement to algorithm. Yeah. I mean that’s going to take a lot more time in the amount of lift you might get there. It might be extremely incremental.

Michael Segala: Again, right, there’s a trade off in all of these things. I always advocate very heavily for determining a baseline as fast as possible, literally whatever the fastest path to getting a number out to set a baseline, do it, then start experimenting and making it more complex.

Michael Segala: Whatever that means for this problem, start with an ML model, then go to a deep learning model, then go to a genetic algorithm, whatever it happens to be for your problem. Always kind of think of it in a very incremental fashion in complexity. That’s at least my opinion.

Kirill Eremenko: Love it.

Michael Segala: I think the best way to approach the problem.

Kirill Eremenko: Gotcha. Love it. Love the establish a best baseline as fast as possible. I think that’s golden advice for data scientists out there.

Michael Segala: And then what was your second question?

Kirill Eremenko: The different types of problems I think, bottlenecks, optimizing routes, maybe if you had some more to add to those.

Michael Segala: Yeah. For instance, one of the problem statements that we’re currently working on, that’s a lot of interest. Is in clinical trials, clinical trials is actually a very complicated problem because you’re a big Pharma company and you need to run a trial for your medication against a pretty diverse and large population of people. Think of something simple like Tylenol. You’re not running clinical trials on Tylenol, but if you were, you’d have a bunch of Tylenol, you find a bunch of people at a bunch of medical sites and you’d ship them some Tylenol or they would take it and you would monitor how they interact with the drug and things of that nature. That’s how clinical trials work.

Kirill Eremenko: What is Tylenol, sorry to familiar with the US term.

Michael Segala: Oh geez. Just helps your headache. It’s a, what is it, a acetaminophen whatever that word is.

Kirill Eremenko: So it’s for headaches?

Michael Segala: It’s for headaches yeah. It’s been around for a hundred years. But that was just kind of a silly example.

Kirill Eremenko: In Australia we have Paranol.

Michael Segala: Sure. I don’t know what that is but sure. For clinical trials you have this problem of needing to send out things to the medical facilities for them to be able to run their clinical trials, collect data, collect vital information, collect even blood or other types of, whatever you’re collecting directly from the patient and do so in a very complex manner. Because you have patients in different countries, different ages, races, genetic profiles, whatever it happens to be. You’re sending, your shipping, you’re receiving these things that are all highly perishable. It has to happen in a very kind of dynamic environment.

Michael Segala: For them, this logistic problem with bottlenecks in things that are highly perishable that are shipping all over the globe is a huge problem. And it’s this very dynamic system that we’re applying these exact type of algorithms that we’ve been talking about. Clinical trials is a good one, but everything’s the same. If you’re shipping a pair of jeans, a bottle of coke or a lab kit for a clinical trial, the methodology is very similar in the way to attack the problem is very similar. It’s just kind of that end use case. What you call it is a little bit different.

Kirill Eremenko: Okay. Gotcha. All right, well thank you very much. Logistics, amazing case study. Do you have one more for us? I think we have time for one more.

Michael Segala: Jeez, how about this. What do you care about? I’ll give you a use case on something.

Kirill Eremenko: Ooh, good one. Let’s do energy, the energy space. Do you have any of those?

Michael Segala: Of course. I have some on energy. Let’s see. I have a few on energy. What could be interesting? I’ll give you two quick ones in energy.

Kirill Eremenko: Sounds good.

Michael Segala: One quick one is a lot of people, if you have a meter, I don’t know how you guys do it, but I assume you have an energy meter sitting outside of your house, and that energy meter is basically collecting information on how much energy you used. It turns out that two types of people tend to want to screw with that energy meter. One a is people from, sometimes not as affluent communities who don’t necessarily want to pay the bill or affluent communities who don’t want to pay the bill. It could be here. Or drug dealers who are using an absurd amount of energy and that at that peak some kind of alarm, but they need to hide that.

Michael Segala: What you can do is you actually take a magnet to the outside. I’ve never done it, but I’ve been told, you can take a magnet to the outside and you can actually trick the smart meter or whatever it is from the reading. And it shows much less consumption that what you’re actually having. We did some work with a large energy company out in the UK that was running into a lot of these problems, people were literally putting magnets on their meters outside, fooling the system and they were people fooling it because they didn’t want to pay the bill or drug dealers. And the question was, can you take all of that time series data, ’cause it’s very temporal time series data and look for patterns that would be anomalous that they think corresponds to somebody kind of adding these fraudulent activities.

Michael Segala: We were given a pretty large set of data, but a very, very small set of labeled data. Literally only like a few 10 or 20 labeled cases of these anomalies. We attacked the problem a couple of different ways, both in a supervised and unsupervised manner. We did a lot of different things, could be really thoughtful about it and we were able actually to show, you can spot these anomalies and you can really see when people are gaming the system from an energy perspective. That’s kind of a one quick use case-

Kirill Eremenko: Just quickly on that, that’s very interesting because you had only 10 to 20 examples like in a spreadsheet or like in a database with millions of rows of negatives results. You had 10 to 20 positives. How’d you deal with situations like that? What is your advice to data scientists out there? How do you attack a problem where you only have under 20 examples of what is a positive outcome that you are actually trying to identify?

Michael Segala: This happens a lot of times. We actually fool ourselves that big data is the challenge. The actual challenge is small data, big data’s not a challenge, It’s “Ah, okay, we have big data.” We get into a lot of these cases where you have very small labeled positive examples. You have to be very thoughtful about it, you could theoretically create fake data sets to encapsulate very similar behavior right through that kind of same simulation in modeling. You could do that or you can start attacking the problem because you could treat this as anomaly detection, pretending you didn’t know any labeled data. Can you actually spot anomalies? And you have 10 or 20 in your back pocket to think about, or you turn it into a supervised learning problem with a very, very small holdout set. And find and experiment with it.

Michael Segala: There’s lots of different scenarios, but again it’s really about being a problem solver and thinking about, can you do something that’s convincing enough to yourself from a technology standpoint that it’s working and can you make a business case that it should be implemented? There’s lots of different ways to solve a problem, but you have to do it in a kind of systematic way and be thoughtful about it.

Kirill Eremenko: That’s so cool. I love your three examples, just to recap on those, create fake data sets, anomaly detection so pretend you don’t even have those 20 and see what the algorithm will do, completely unsupervised or supervised learning with a very small holdout dataset. I’ll probably just add to that, that it’s also important to talk with the clients. And correct me if I’m wrong here, but I think it’s important to talk to a client and understand how important for them are false positives and false negatives.

Kirill Eremenko: In your case, in this case of an energy company, is it really bad if you have a high rate of false positives, would they prefer high rate of false positives or high rate of false negatives? If you identify more cases where people are allegedly trying to trick the meter, how difficult is it for them to ask the electrician next time they go out outside to check if there’s a magnet on the box or not?

Kirill Eremenko: Based on that conversation with a client, you can fine tune your algorithm to either output more bindings in terms of like these anomalies or less. And in some cases it might be, I don’t think in the case of energy it would be as bad as saying the case of medicine where a false positive can actually change somebody’s life.

Michael Segala: Absolutely. You’re absolutely right. And what you said, it’s the real critical part in what the real kind of mindset needs to be is how do you tweak this algorithm to actually fit what we care about capturing and what does that cost. Because the question is really what would that cost for the electrician to go back and report it? And then how would they report that? And where’s that data stored? You get into this kind of cascading effects of what your algorithm actually mandates to the business to actually have to implement. It’s not a trivial problem. That’s actually where the real ingenuity and kind of problem solving comes in and kind of tweaking that outcome to actually be effective. You’re 100% on there.

Kirill Eremenko: Gotcha. Okay. That was the first one on energy. And second one?

Michael Segala: Second one. Let’s see. We have a few. The other one we were doing, kind of similar, a little bit different. This is energy as related to internal devices in the home. And the question for them is, if I had all of the kind of time series data of the meter coming in, can I understand which appliance that data is coming from? There’s this concept of energy disaggregation. Meaning, if I only gave one overall signal, can I see what came from the refrigerator or the microwave or the TV or whatever else it happens to be.

Michael Segala: Again, it’s a very interesting class of algorithms where you can kind of look at consumption patterns and then kind of detangle them in terms of understanding exactly where your consumption comes from and why you’d want to do that is because you would be able to show, “Hey, your appliances over here are causing 80% of your bill. Get something more efficient or unplug it or do something of that nature.” It’s this really kind of personalization that is happening, especially within this big energy companies that want to kind of get consumer buy in and kind of always have consumers coming back and never leaving them is to showing them these kind of innovative solutions towards some of their energy bills and outputs and things of that nature, especially as we become a greener and greener society.

Michael Segala: This was a very interesting one and actually showed the extremely promising results as well that, that company is using.

Kirill Eremenko: How do you go about a problem like that? How did you desegregate components of a signal?

Michael Segala: This was actually a while ago, if I remember correctly, and I can be completely wrong here, I think we had a pretty small training set as well of, they had a couple of houses or dozen houses or a hundred houses, I don’t remember at this point, that actually had smart meters plugged into all of the devices. You were able to see a real training set of, here’s the total consumption and here’s what all the devices were, which was fine, you can show that. Now the question becomes on a very new house, does that algorithm actually transfer over and is it generalized?

Michael Segala: That’s really the big question and I think we used two different approaches. The first being a lot of these mark off models, hidden mark off models I believe had worked really well for this case. This was maybe about two years ago when deep learning was still kind of in its infancy, not really infancy, but really being used, especially for time series. I think we started playing around with some deep learning at a time series space there as well. And that was showing some really nice progress, but we were able to achieve what they wanted to in those kinds of mark off models and they kind of took that and ran with it. If I remember correctly, that’s how we attacked that problem back then.

Kirill Eremenko: Okay. Well, Mike thank you so much for showing those case studies. Amazing medical imaging logistics, energy case studies, if our listeners want to … If you guys want to check out more case studies as I mentioned at the start, head on over to sflscientific.com and they have a tab there called solutions or the other one is Our work and you can read quite a bit about different use cases in different industries.

Kirill Eremenko: Before we finish up, ’cause we are slowly getting to the end of this super exciting podcast which could probably go on for a few more hours. But before we finish up, I wanted to ask you on a question that I, a more philosophical question I like to ask guests sometimes and that is, from where you stand and from all these projects and clients and industries and approaches and employees, you’ve seen people, you’ve seen the data science. Where do you think the field of data science is going and what do our listeners need to look out for to prepare for the future that’s coming ahead?

Michael Segala: That’s a tough question. Are you asking that as somebody who wants to get into the data science space, as a data scientist? Or are you asking that in terms of where do I think industries are going?

Kirill Eremenko: Ooh, that’s a good question, how about we do both.

Michael Segala: ‘Cause those are very different conversations.

Kirill Eremenko: How about we do both, what’s your view on both of those?

Michael Segala: All right. Both of them will go quick because I don’t, we could talk for a long time. Sometimes I get too talkative. Let’s start easy, data scientists. And we see this a lot, I’ll have an open REQ and by the way, we have lots of open REQs if somebody wants a job, come and talk. But we see more and more people wanting to become data scientists transitioning into this space. There’s a lot of great potential, money being invested and people honing their skills with courses like the ones you teach, people going to conferences like the ones that you guys give, a lot of great mind share, knowledge share and things of this nature, which are so much easier than when I started about six years ago. I think that’s gonna continue to happen.

Michael Segala: However, I think algorithms themselves [inaudible 01:00:22] come and already are kind of becoming very commodity. Everybody nowadays can fire up XGBoost and run something, that doesn’t make you a good data scientist that makes you extremely commodity in your job. I think data science is going to start to become a wider role that is going to be, as we’re talking about here, it’s really a problem solver. How do you take a business problem and solve it with data? That’s really the big question here. And unless you’re capable of thinking about the larger problem and the impact that it has on the business and how you’re actually going to take that algorithm and actually allow your business to generate revenue or cut costs, you’re probably not going to be a very successful data scientist, especially as these tools become more and more efficient and will start to automate some of your job away.

Michael Segala: I really think the trend in our industry will also be to automate out some of our own data scientists who are doing just kind of very routine type of work. But the ones that survive and do a great job I think are going to be probably one of the most critical folks within the company by far. That’s really how I see that transition happening. And I actually don’t think that, that’s far away, I think within the next 12 to 24 months. Maybe the next time we talk on one of these, we’ll start to see that already.

Michael Segala: In terms of, and let me know if that didn’t answer the question-

Kirill Eremenko: That totally makes sense. I just want to add here that from my experience ’cause I … For listeners who don’t know, I worked at Deloitte for two years in the data science division and what I can definitely say and probably you’ve gathered from this conversation we’re having here with Mike that being in consulting really helps with that, becoming a problem solver, understanding how to not just like do a cool project or a cool algorithm but think of the business as a whole.

Kirill Eremenko: If you are looking for a job, I just want to reiterate Mike’s call, give Mike a shout out and or contact Mike on Linkedin or somewhere else and chat to him because in a few years in consulting really puts you in a whole of game of data science into a different perspective. Not to say that you can’t get there on your own without consulting, if you’re in an industry that’s totally fine as well. Just from experience, I know that consulting is a great way to get to that type of mindset.

Michael Segala: Yeah. I tell my new employees, within 12 months they’ll probably have more project depth and skills than somebody who sits in a single kind of vertical for 10 years. Just the breadth of project and the depth that we get to get into extremely quick. It’s exciting. But it’s hard, it’s not stagnant and you’re always kind of thinking and moving on your feet. It’s not for everybody, but I love it.

Michael Segala: In terms of businesses, I think we’re really at this critical junction in terms of where data science will go. We see industry starting to invest for sure. They invest kind of small pockets of money on a few small initiatives, the big companies that make the media hype, the Apples, the Googles, the Airbnbs, those aren’t even relevant, those are the outliers, the anomalies.

Michael Segala: I’m talking about the other 99% of the market. And we know, we work with so many of them, they see that there’s a lot of interest out there. There’s a lot of innovation happening and there’s a lot of hype and potential. They’re starting to make strategic bets into this space by funding a couple POCs, proof of concepts, hiring a few individuals or a larger team depending on the organization. But we’re really at that critical point where now in the beginning of 2019 over the next kind of nine months, a lot of folks have budgeted Data Science into this 2019 workflow that need to start paying off. They need to see real revenue generated or margins decrease by better automation and cutting costs and things of that nature or margins increase.

Michael Segala: I think if we don’t start delivering past POCs and really start embedding algorithms into deeper kind of production workflows, it’s actually going to take a big hit and a big step back and people will start defunding AI into their 2020 and 2021 plans. And I honestly think that, there’s a lot of folks that, very [inaudible 01:04:50].

Michael Segala: Here’s a fun app on Instagram and I just want to go and repeat that and kind of play off of them and you’re always going to see that in the market, but that’s quickly going to become cannibalized in this AI space. When you have all these big IBM commercials and Microsoft commercials that are really hyping AI and people are investing, they need to see something very quickly pay off or we’re just not going to continue to get funding and this market will start to slow for sure.

Michael Segala: It’s up to you, the listeners, you have so many great listeners on this podcast that are the ones in the trenches. And I say that wholeheartedly, like I think that your audience is by and large some of the greatest audience, especially in the data science space that I’ve interacted with and I still get, literally every week people [inaudible 01:05:35] about your podcast and what you’re doing. And they come to me and say, “Oh, I heard you on Kirill’s podcast. That’s great.”

Michael Segala: You definitely are driving the correct audience. It’s kind of all of our responsibility because as data scientists to ensure that these projects are successful and we don’t just kind of cannibalize ourselves in the next year or two and not get any bigger funding. ‘Cause then we’re all going to be out of a job. That’s honestly how I think the market is going to mature.

Kirill Eremenko: Fantastic. That ties back into that productization discussion that we had. For data scientists out there don’t just leave your project, it feels very satisfying to find the insights and deliver them, talk to your manager, boss, client, whoever it is you’re talking to and consult them, advise them on next steps on how they can actually put that into production, follow up with them, go back in a few weeks and check if your model is performing, if it’s deteriorating of it needs some maintenance. Be proactive in that [inaudible 01:06:32], it’s kind of like marriage. If you get married, you don’t just stop there. You have to keep dating, your wife I mean. Or husband. You have to keep caring out for each other. It’s not like you won the game once you got married. There’s lots more. And now it’s the aftermath and the commitment that comes afterwards.

Michael Segala: I see you’ve been well trained as a husband.

Kirill Eremenko: Not a husband yet my friends.

Michael Segala: Soon to be [crosstalk 01:07:01]

Kirill Eremenko: Yeah one day.

Michael Segala: One day, good for you.

Kirill Eremenko: Mike, wanted to ask you, how many clients have you guys worked with if it’s not a secret, just curious.

Michael Segala: Oh geez. I don’t know the number, but it’s in the hundreds.

Kirill Eremenko: Wow. You guys-

Michael Segala: I don’t know the number …

Kirill Eremenko: Are doing really well. All right, well Mike, thanks so much for coming on the show, been a huge pleasure. Before I let you go, what are some of the best ways for our listeners to contact you, whether they are interested in working with you or whether they’re interested in maybe joining your team.

Michael Segala: Please come to the website, sflscientific.com. There is a place there. I think you could either chat with us or you could inbound an email. That all comes directly to my folks who tell me right away if you’re looking for a job, that HR, I think we have like an HR jobs page, that gets looked at. I tell you, we interview almost a person a day at this point. A lot of them are great candidates, but for whatever reason don’t work out. We’re always looking for really great folks. if you inbound to us, I guarantee one of our folks will see it in a few minutes and reply back accordingly. So please be in touch, that’s probably the best way to get in touch is just through the website.

Kirill Eremenko: Okay, great. And is it okay for people to connect with your LinkedIn as well?

Michael Segala: Of course. It’s my pleasure.

Kirill Eremenko: Awesome. Thanks so much. Of course we’ll share all of those links on the show notes. And on that note, Mike, thanks so much for joining us today and sharing amazing case studies and your view on the world of data set.

Michael Segala: Again, thank you so much. It’s always an honor and pleasure to see your progression as well. Best of luck with you and hope we can talk again soon.

Kirill Eremenko: There you have it. That was Mike Segala from SFL Scientific. I hope you enjoyed today’s episode and got a lot of valuable takeaways from the show. If you’d like to connect with Mike, hit him up on LinkedIn, you can find the URL as well as all the other materials mentioned on episode in the show notes at www.www.superdatascience.com/249 that’s www.superdatascience.com/249.

Kirill Eremenko: There you can also find the transcript for this episode if you’d like to read it. And my personal favorite part for today was the challenge of small data, dealing with unbalanced datasets. And the three approaches that Mike shared with us ranging from creating fake data sets to unsupervised anomaly detection, to supervised learning with a small holdout dataset. Some very exciting stuff. And of course apart from just the challenges of small data, there are plenty of other valuable gems shared by Mike.

Kirill Eremenko: And I’d like to reiterate again the call to action from Mike and the team at SFL. If you’re looking for a job and you’d like to join consulting, then go, head on over to SFL Scientific and look for the careers page and apply there. If you’re a business owner, an executive director and you would you have some challenges that you think can be solved with machine learning, you’d like to explore the space of AI and Data Science, then hit up Mike, don’t hold back and see how SFL Scientific can help your business grow and become even more competitive.

Kirill Eremenko: And on that note, if you’re enjoying the SuperDataScience show, make sure to head on over to iTunes or to your favorite app for playing podcast and leave us a review there. I’ll really appreciate it. I love reading your reviews. Thank so much and I look forward seeing you back here next time. Until then, happy analyzing.

Podcasts SDS 249: Diving Into Data Science Consulting

SDS 249: Diving Into Data Science Consulting

Podcast Transcript

Share on

Related Podcasts

July 10, 2026

July 7, 2026

July 3, 2026

Podcasts SDS 249: Diving Into Data Science Consulting

Share

SDS 249: Diving Into Data Science Consulting

Podcast Transcript

Share on

Related Podcasts

July 10, 2026

SDS 1008: The AI-Native Startup Playbook

July 7, 2026

SDS 1007: How to Find Solid Career Ground in the AI Era, with 80,000 Hours Founder Ben Todd

July 3, 2026

SDS 1006: In Case You Missed It in June 2026