SDS 107: Charting a Career in Energy Analytics

Podcast Guest: Gabor Solymosi

November 24, 2017

Welcome to episode #107 of the Super Data Science Podcast. Here we go!

Today’s guest is Data Scientist at Utopus Insights, Gabor Solymosi
During our European road trip this summer, Hadelin and I got a chance to meet with students of our Udemy courses. One of them was Gabor Solymosi in Budapest, Hungary, who shared with us his passions in data science and how he has worked to accomplish them. In the few months since that trip, Gabor has made even more progress in his career.
On today’s podcast, he gives us new paths to consider in data science and shares how he made his way into energy analytics without prior knowledge in the field. Gabor tells us about using social media analytics to help tech companies improve their business, and how survival analysis can be applied in solving problems for companies in heavy machinery.
Tune in and find out!
In this episode you will learn:
  • The challenges of working remotely (06:50)
  • Trading full-time job security for an exciting opportunity in energy analytics (08:46)
  • Data science skills can be transferred to any industry (16:20)
  • What is survival analysis? (21:30)
  • The key concepts in survival analysis (27:15)
  • Gabor’s university thesis on social media landed him a job in text analytics (32:00)
  • Using Visual Basic, Salesforce, R, and Radian6, as tools for text analytics (35:35)
  • Gabor’s job in banking inspired him to pursue a Master’s in Business Intelligence and Analytics (43:39)
  • Preparing data in the right format to be used as an input for machine learning algorithms, is one of the biggest challenges for a data scientist. (48:04)
  • In data science, there is always something new to do, to research about, to try out, to implement (53:24)
  • Don’t trust human doctors for diagnostics because machines are better (54:34)
Items mentioned in this podcast:
Follow Gabor
Episode Transcript

Podcast Transcript

Kirill: This is episode number 107 with data scientist at Utopus Insights, Gabor Solymosi.

 Welcome to the SuperDataScience podcast, my name is Kirill Eremenko, data science coach and lifestyle entrepreneur, and each week we bring inspiring people and ideas to help you build your successful career in data science. Thanks for being here today, and now let’s make the complex simple.
[Background music plays]
Kirill: Welcome everybody to the SuperDataScience podcast, super pumped to have you on board and today I’ve got a special guest, a friend whom Hadelin and I met during our European road trip, Gabor.
 Gabor is from Budapest, Hungary, and that was I think our third stop during our road trip. It was very exciting to meet everybody there and Gabor’s story especially resonated with me because of his dreams and passions and how he works to accomplish them, how he works towards them. And I’m very excited to hear that since the road trip which was a couple of months ago, Gabor has made progression in his career, he’s got a new job and he’s actually working towards his goals as you’ll see from this podcast.
 Gabor is a very interesting person, very passionate about data scientists, we’ll talk about three of the roles that he’s had to date in the space of data science, we’ll talk about things like text analytics, survival analysis, and jumping into an industry completely foreign to him, how you’re able to switch from one industry to another, being a data scientist and transferring those data science skills and what is the experience of that, what he is going through as he’s moving to something completely different. As you’ll hear from the podcast it’s very, very, exciting, the industry that he’s just jumped into. He’s working with solar and wind turbine energy- who would have thought they also need data scientists there?
So, quite a lot of interesting things we talked about here, but probably the main thing I’d like you to focus on is the path, the way that Gabor has intentionally chosen the roles in his data science career, and how he’s working towards his dreams. I can’t wait for you to hear his story, and let’s get started. Without further ado, I bring to you Gabor Solymosi who is a data scientist at Utopus Insights.
[Background music plays]
Kirill: Welcome everybody to the SuperDataScience podcast, today I’ve got a very exciting guest, a friend of mine from Budapest, Hungary, Gabor Solymosi. How are you doing, Gabor?
Gabor: Thank you very much, Kirill. I’m really excited to be hear actually.
Kirill: Awesome. Cool to hear you. Can you remind us how we met, where did we meet? So that the listeners can get a bit acquainted with our story.
Gabor: I’d known you before because I took several of your classes on Udemy, but then once I got an email that Kirill is coming to Hungary, Budapest, I took the opportunity and we met there. It was quite a nice dinner and we went on a few bar trips, let’s say. It was quite interesting and we had a lot of good talks that time.
Kirill: Yeah, exactly, it was fun. It was during the road trip, a lot of you listening to this podcast might know that Hadelin and I did a road trip this summer through Europe. One of our stops was Budapest, Hungary and so we met quite a few of our students there.
 It was interesting talking because you or someone else was saying that you were very surprised that we came to Budapest. We started with Italy then we went to Munich and then the next email that came out was, we’re going to Budapest. Were you a bit surprised at that or were you expecting it?
Gabor: Yeah, actually I didn’t expect it. I was checking the mails that you were doing this Europe trip and I didn’t think that you were going to come to Budapest. I thought okay, maybe Prague or some other bigger cities. But I think you had a good time here also.
Kirill: Yeah it was good and thanks a lot for showing us around, a very interesting city. If somebody hasn’t been to Budapest, we liked it quite a bit, Hadelin really fell in love with the city. It’s got this big massive river. What’s the river called again?
Gabor: It’s the Danube.
Kirill: The Danube, the river, it’s a big river. It goes through a lot of countries in Europe, but in Budapest it’s really wide and you’ve got two parts to the city, you’ve got Buda and Pest. I think the story goes that they were two separate cities that were growing on both sides of the river and at some point, they just decided to become one city, is that right?
Gabor: Yeah, it’s kind of like the short summary of how we got together.
Kirill: Then Gabor and some other students showed us around the city. It’s got quite a lot of monuments and we even saw the statue to Gabriel and that’s where you told me that your name Gabor is a derivative of Gabriel, is that right?
Gabor: Yeah, that’s correct.
Kirill: That was very interesting to learn, I never knew that before. Anyway, so we’re here to talk about data science and your journey into the space of data science, so tell us a bit about what you do. You told me just before the podcast, you got a new job, congratulations.
Gabor: Thank you very much. I have recently changed my job from one company to another. I’m working as a contractor data scientist for an exciting new energy analytics company called Utopus Insights. As I told you, it’s actually a spinoff from IBM Research and it’s headquartered in New York but I work from Budapest, Hungary. It’s kind of a remote job.
Kirill: That’s so cool. Let’s just realign a bit. Where did you work before Utopus Insights?
Gabor: I was also working as a data scientist at XAPT. I will go into details with that also because it was really interesting. So, that’s my two first data science jobs. Before that, I was a data analyst but it was something a bit different. Now I’m really happy to be here because it’s really cool. Basically, I’m working from home most of the time, which has its benefits of course, and disadvantages as well since I have a lot of time dealing with things around the house or waking up a little bit later or doing work outs every morning.
Kirill: Getting distracted.
Gabor: Yeah, that’s true but on the other hand, of course, it can be a bit boring sometimes. I have regular Skype meetings with the others in the States and here we have a team in Budapest with whom I regularly meet. Actually, it’s quite nice.
Kirill: You said it’s a contract. When you were working with XAPT was it also a contract or was it a full-time job?
Gabor: No, that was a full-time job. Actually, it’s also like kind of full-time but I’m working through a major company.
Kirill: You traded in a full-time secure job for a contract, is that right?
Gabor: Yeah. It’s like that. But of course, it’s really interesting and it’s really exciting for me to work with this now because I really wanted to do something with the energy industry. How to help the future, doing something with renewables and these things. It’s really interesting.
Kirill: I can totally imagine. But the whole concept is very interesting because a lot of people wouldn’t do that. They would think, this is a full-time job versus a contract, a contract can expire, versus a full-time job, I’m very secure in what I’m doing. Was it a hard decision to make, to give up that security of your job, of your income, and to go for something more exciting but something that’s a contract, that can end and might not be renewed?
Gabor: When I was thinking about it, I didn’t want to change jobs at the time, but it was quite an opportunity for me because I have a friend who said that they have an opening in the Budapest office and I really wanted to do something with this renewable energy data science stuff. And for this, you know, yeah, I kind of traded my secure things for a contract but it was worth it, I think. Of course, it’s a contract but I wouldn’t change, it’s not for more money and it’s really exciting.
Kirill: Yeah, I know. That’s really cool and very inspiring to hear as well because, after our chat in Budapest … Probably I should mention this to the listeners. This was one of my most inspiring conversations that I had on the road trip because when we were talking, you said that, look, there’s certain dreams that I have and goals, and ideally, I’d love to live … Do you remember that conversation about Spain?
Gabor: Of course, yeah.
Kirill: What did you say? Tell us about your dream. What is your dream in relation to Spain?
Gabor: I lived in Spain because I did my Erasmus semester in Barcelona and I really loved the spirit of the city and it’s just extremely cool and I’ve always wanted to go back there since I was there for this one semester. My dream was to just find a job there with this hotness and everything. Here, I know everything in Hungary, in Budapest, and it’s just not that exciting. I wanted something more, of course I’ve always wanted more. With this one now I think it’s kind of great.
Kirill: Yeah. Do you like the Spanish language?
Gabor: Yeah, of course. I actually learned Spanish so I know a couple of things. I’m not perfect but I know the basics and I really like it. I learned Catalan also because of Barcelona.
Kirill: That’s really cool. All right, what I was just going to say is that Gabor’s dream is to live in Spain and work in Spain and so on, and one of the things that we discussed during our catch up was that, you remember you said you were a bit disappointed that unfortunately the economy in Spain isn’t the best right now and it might be hard to find a job and so on. And I mentioned that you don’t really have to find a job in Spain, you can live in Spain but you can work as a freelancer through Upwork or through other websites, and it’s very exciting to hear that now you have a remote job. You just got a remote job where you are working from home and in my view, it’s like a step towards that goal and it’s very inspiring to hear that you are on that journey already.
Gabor: Thank you very much. Actually, it’s really exciting and I also feel like it’s kind of an improvement since we last talked.
Kirill: Awesome. So, you just moved from XAPT to Utopus Insights, tell us a bit about the work that you do. In what space of data science are you at the moment?
Gabor: I’m kind of a data scientist/analytics engineer. I’m involved in multiple projects that focus on forecasting the performance of renewable energy farms, like solar farms, wind farms, turbines and so on. That’s what I currently do. It involves a lot of statistical learning methods and a lot of mathematics also and a lot of engineering. I actually don’t have an electrical engineering industry background but with these people here, they help and I bring the data science knowledge also, so it’s kind of cool. I really like it.
Kirill: Okay. That’s pretty awesome. What does an analytics engineer do? I’ve never heard of that profession before.
Gabor: It’s kind of data scientist stuff also. It’s just about building the analytics platform, like in the databases and how to extract the data, how to put it into the analytics platforms and these things. Behind also I know the science stuff so that’s what it is. It’s just how they call us.
Kirill: Okay. It’s a mix of a data scientist, a database architect, that type? Like you do …
Gabor: Yeah. Kind of that. We’re working with a lot of software developers who actually do this backend stuff, the deep backend. But of course, I have to be involved in these things.
Kirill: Interesting. So, you work with wind turbines, what other forms of energy? Is it solar?
Gabor: Yeah, it’s solar and wind.
Kirill: Solar and wind. Out of curiosity, which one is the most efficient right now, out of the ones that you work with, not the world standards or the leading world ones. The ones that you work with, what do you find is more efficient, solar or wind?
Gabor: Actually, I’m not quite into that one yet so I don’t have too much insight on which one is better, but probably in a few months I could give some insights on this also.
Kirill: Okay, gotcha. All right, cool. When you say you do analytics for solar and wind, what exactly do you do? Do you calculate how much is consumed or do you calculate how much, the maintenance requirements, what part of that analytics are you involved in?
Gabor: Now, I’m actually involved in validating machine learning models like forecasts and choosing the right matrix for evaluation. Communicating with the other analysts and engineers, software developers on what and how to improve, this kind of stuff. Of course, it involves a lot of research.
Kirill: What are you forecasting?
Gabor: We are forecasting the performance, the power of the wind turbines, like wind and solar panels, how much power they give.
Kirill: So how much energy we’ll have in the future?
Gabor: Yeah.
Kirill: That’s very interesting because like we all use energy, we all use electricity, and we all hear about solar and wind and so on, but I’ve never actually spoken to someone in this space. It’s good to have an example that even in these industries, you still need data science, you still need data scientists.
I was thinking originally maybe there’s very historical types of roles and types of calculation like scientists or engineers that are performing these estimates and forecasts but nevertheless, you are a data scientist who’s working in this space. And this is something new for you, right? When you were working at XAPT, were you doing the same thing or was your role related to something different?
Gabor: Well, it was a bit different but at that I was working with predictive algorithms, predictive maintenance analytics, which is kind of involved because here we are also planning to do something like that. As you mentioned, you thought that it was scientists and these kinds of people who are doing these forecasts and these things, well we have a meteorologist on the team also, who is doing the weather forecast. And we have a lot of electrical engineers and they have a vast background of science, of the field.
Kirill: Okay. In XAPT, were you working with energy as well or something else?
Gabor: No. Actually, there I was working on a project which was called predictive maintenance for heavy machines. It was kind of interesting, we were doing survival analysis there, predictive algorithms through R server. I was creating like these web APIs with R which was really cool, I really liked it.
Kirill: All right. We’ll get to that in a second. I just wanted to, again, stress for those listening that before … When did you start at Utopus, was this a few months ago?
Gabor: Yeah, it was a few months ago.
Kirill: Okay, so literally a few months ago, Gabor … How much knowledge did you have about solar and wind energy and their consumption and stuff like that? Were you an expert in that field?
Gabor: Not too much. You know, if you’re really interested in something, just make researches and that’s what I did before applying for these things.
Kirill: And that’s why it’s so exciting because, like two months ago or so you had no knowledge of that industry, or very little knowledge about what solar turbines, how they work, what their energy flow is, efficiency and so on and the same thing for … Sorry not solar turbines, solar farms and solar panels, and then wind turbines, and the same thing. But all you had was like your data science skills, your machine learning skills and so on, and you brought that, and now two months later, you’re in a completely new field, something very interesting. I think it’s a very inspiring example for those listening that if you’re interested in something, even as complex as solar energy, you can just go and become a data scientist there. If you’re interested in wind turbines, you can go and become a data scientist there, regardless of your background.
 What I’m getting to, is that somebody might think that you have to be an expert in solar to even be considered for a role in solar. No, you don’t. Like Gabor here is showing by example, you just have to be a data scientist or like be confidence in your skills, do some research, and then go there. I think it’s a good testament as well to the transferability of data science skills that you can go from one industry to another very quickly. Like in your case, from heavy machinery which doesn’t have that much to do with solar in the first place, and you can just move to solar energy or wind turbines or whatever. So, basically, guys, dream big and wherever you want to work, whatever is your passion, you will be able to get in there quite quickly.
Gabor: Yeah, that’s true. You summed it up really good. Actually, the funny thing that I will go into in a bit is that before doing data science for heavy machinery, I was doing text analytics.
Kirill: Text analytics. That’s awesome. There you go, that’s a jump. And that’s when you were data analyst? What was the company called there?
Gabor: Yeah, it was Sykes.
Kirill: Sykes. Before XAPT you were at Sykes and it’s text analytics, that’s so cool. Such a big change from text analytics to working heavy machinery to now working in the space of solar and stuff like that. You touched on a very interesting topic, I think we should expand on that more because I haven’t heard anybody on the podcast talk about it yet. Survival analysis. I’ve heard a little bit about it, I’ve read a bit about survival analysis. Could you give us an overview, what it’s all about and how does it work?
Gabor: Yeah. Actually, I wanted to talk about it of course because it’s really interesting. Did you know that it’s actually one of the oldest statistical disciplines? It has roots in demography and actuarial science like economics, and it dates back to the 17th century. I did a lot of research on it because it’s so interesting. In the beginning, it was most importantly used in demographical analysis like vital statistics that deals with statistics on birth, deaths, marriages, divorces and these kinds of things. Since then of course, it became widely used in other fields as well, like economics, failure analysis and mechanical systems, like what I did for heavy machines. It’s actually about analysing data where your outcome variable is the time until an event. For example, it can be death or marriage, or failure or something.
Kirill: Sorry to interrupt. So, in marriage, survival is how long you can survive until you get married?
[Laughter]
Gabor: Yeah.
Kirill: That’s so funny. Puts marriage in a bad light. But okay, gotcha. It’s just a term. I guess it comes from where it originated. It originated with like how long people live, like before they get sick, or before they die, or something like that.
Gabor: Yeah, that’s it. If you think about it, that the outcome is kind of like a continuous variable, but it’s not continuous because it’s time. It’s actually a generalized form of a high dimensional regression analysis and it’s really interesting. It’s really cool and this is the one thing, a good example that we were talking about a few minutes before. You can just apply it on anything.
Kirill: When you say the outcome is not continuous, what do you mean by that? Like it’s time, right, it can be …
Gabor: Yeah. It’s kind of like continuous. For example, if you want I can talk a bit more about it for course.
Kirill: However you want to structure this. You’re the expert, just tell us about survival analysis. What do we need to know? What’s the most important fun stuff?
Gabor: Okay. Actually, what I did it’s also like this time that you’re measuring or time to event, or the survival time, it can be measured in whatever you want, so in days, weeks, years. It’s a continuous variable, let’s say.
 For example, if the event of interest is like a failure, then the survival time can be the time in days or hours or even years until for example a machine develops a failure, let’s say. It has a lot of interesting terms also like censored and uncensored observations and like for example there are two kinds of subsets of the data, what you can deal with, like the censored and the uncensored one. In some of them there is, for example, if the event hasn’t happened, you don’t have any observation of the event with that kind of machine and then it becomes hard to define the survival time at the end.
Kirill: Okay, and how do you go about it then?
Gabor: Yeah. Actually, there I incorporated some averages and other statistics where you don’t have the exact time. I can talk about too much of these things.
Kirill: Okay. But let’s say in real life, would you use survival analysis if you’re testing some sort of medicine? You have a population of people and they’re … Or let’s say not even people. You have like this group of mice, you want to see if this medicine helps them live longer. Is that an example when you would use survival analysis?
Gabor: Yes. Actually, if you have a good number of observations, of course. In biostatistics, it’s really commonly used.
Kirill: All right. And so, what makes survival analysis stand out? Is it just the fact that we are counting backwards, we’re looking at how much time until these mice start unfortunately, dying, or is it something else? Is there a certain reason why survival analysis is so interesting, it has its own kind of domain?
Gabor: Unlike ordinary regression models, here are dependant variables in survival analysis, it’s composed of two parts. One is the time to the event of interest and the other is the event status which records the event of interest occurred or not. From this you can define the censored and uncensored observations of course. For example, here you can estimate two functions that are dependent on time, the survival function and the hazard function. These two functions are the key concepts in survival analysis describing the distribution of event times. For example, the survival function, gives for every time the probability of surviving or actually not surviving or not experiencing the event up to that time. It’s starting at 1, it’s a positive valued monotone decreasing function. So, when you’re going through time, of course, you will get the score at every timespan let’s say, and as you’re going forward in time, probably the score that you will survive will decrease, that’s why it’s starting at 1 and it’s a positive valued monotone decreasing function.
 On the other hand, the hazard function, gives the current potential that the event will occur per time unit and given that the individual has survived up to that specific time. It’s part of the survival function also so it can change over time, for example it’s increasing as components age, so it’s the difference. It’s actually kind of the opposite, so the survival function is going decreasingly and the hazard function is going upwards.
Kirill: They’re not like, one is not the complement of the other, it’s just that it’s normal that the older a person is or a machine is, the more likely that something will go wrong.
Gabor: Yeah, it just gives you a hazard score but it’s not the 1-minus this of the other function.
Kirill: Gotcha. So, just to recap, so you got the survival function which gives you a score. Give us an example of the machine, right, how would you apply that score in the case of heavy machinery?
Gabor: For example, if you have an engine that you start the engine and if it’s a brand-new engine, that of course at zero time your survival score will be 1 because you just started the engine and you probably think that it’s going to survive.
Kirill: Yeah, 100%.
Gabor: As you’re going through time, you’re going forward in time, this score will just decrease and of course it could decrease based on the features that you’re using, based on the correlation between those features and how they correlate with the output, of course with the time and the status and everything that’s there.
Kirill: And so, for example if your survival function gives you like a value of 0.6, 30 days after you started using the engine, that means there’s a 60% chance that it survives, is that right?
Gabor: Yeah. Exactly.
Kirill: Okay. That’s like for one engine, you can think of it as probability of survival. But in terms of like let’s say if you have 1,000 flowers, and then if on day 30 they have a 60% score or 0.6 score from the survival function, you can say that out of those 1,000 flowers, only 600 will survive up to day 30.
Gabor: Yeah, you can put it that way also.
Kirill: Okay, cool. And then with the hazard function, how does it work?
Gabor: Actually with that you will get a hazard score which, while that score is including the survival function of course, it’s an increasing function not like the survival function. So, then you will get a hazard score. For example, it’s almost like minus the … 1-minus the survival function, but it just incorporates. The survival function incorporates the hazard function inside of it.
Kirill: Okay makes sense. Like at the start you have an engine, it’s brand new, everything is okay, so the hazard will be like very low. But then if you go forward and go further and like 100 days later you have that same engine, your hazard score will be higher meaning that there’s a higher chance that it will break down.
Gabor: Yes, true. And of course, these scores are assigned based on the historical data that you are using, and the features …
Kirill: Okay, gotcha. I see now. So, the hazard function will take into account that let’s say, the engine won’t survive for another, like, one day or for another two days, is that right?
Gabor: Yes, it’s true.
Kirill: So it depends on the time that you want to evaluate. It’s already lived. It’s survived like 100 days, now you want to see will the engine survive another five days, then you’ll have 1 hazard score. But if you want to see will the engine survive another 50 days, then you’ll have a worse hazard score, because it’s less likely to survive longer.
Gabor: Yeah.
Kirill: Okay, that’s pretty cool. There is a whole mathematical apparatus behind this survival theory, I think that’s why I found it very interesting in the first place. Very well defined and it can be applied to many different problems in life, like you say, machinery and other areas as well.
Gabor: Yeah, that’s true. I’m glad you like it also. It’s really interesting actually and fascinating how it can be applied for anything.
Kirill: Awesome. Well, guys, if you found this quick overview of survival analysis interesting, have a look into it, I think it’s a pretty cool area of data science which is good to at least know about.
 Okay, so you said you did survival analysis at your previous job, now you’re working with solar and wind, but before that you mentioned text analytics. What were you doing in the space of text analytics?
Gabor: Yeah. Well actually I got that job because of my thesis work, because during the university, my thesis was based on social media analytics and it was counted one-year-old laboratory work. Actually, there I broadened my knowledge in natural language processing like, text mining, classifying the sentiment and emotions of reviews from social networking sites like Yelp, Foursquare and Twitter also. I really enjoyed doing it, I did some parts of my work in Python, like SPSS, RapidMiner, but the heavy part, the computation and running the algorithms was done in R. I don’t know, I just find it quite handy doing this in R.
 I got that job because of this project. I started working as a social media data analyst, where I had to rely on my knowledge on text analysis, on keyword based information, extraction and so on, and I was analysing the sentiments of posts, comments, messages, forum question for tech companies. If they were positive or negative mentions, or if they were about a specific product like hardware or software related issue, etc. Well it was quite interesting. I still believe that businesses really can make a difference and improve themselves by gaining knowledge on their customers from social media. Because everybody’s posting and texting a lot of things and if you analyse it well, you can really improve your business.
Of course, you know what, the hardest part was like, it was getting harder and harder when we extended the regions, we were not just analysing the English content but when you have European languages like Hungarian, Polish, Greek, and so on, that’s the heavy part. Actually, there I’ve been looking into some of the best sentiment analytics tools but I haven’t found the best that have a great accuracy. So, in this case you will need the help of manual categorization by language professionals or someone who can do it for you. Actually, now I think back that I really liked that job, I really liked text analytics because maybe for all data scientists it’s one of the most common things to like, analysing texts. I really liked it. There, I became a senior data analyst very quick as they actually saw that I know what I’m doing, and they trusted my insights and during that two years of work, I kind of helped setting up a team of like multilingual international data analysts, like 5-10 people. It was really interesting.
Kirill: That’s so cool. What tool did you use for that, was it Python or something else?
Gabor: First in some parts I was using R, some parts I was using Visual Basic and Excel but the most part of it was done by … Salesforce has quite a good marketing tool for this sentimental analysis and kind of like web scraping, it’s called Radian6. I’m not sure if they still call it like that, at that time it was Radian6 by Salesforce. It’s a very powerful tool that you can just collect a lot of things from blogs, technical forums, you can just give the URLs and so on and you can just collect any kind of information by … Like, you know it also relies on this Boolean keyword extraction where you can define what keywords you want to use with “and” or “ors”. It was really cool. And you can just download a lot of things, it can be connected even to your Facebook business pages or Twitter business pages, with Google+, everything, and you can even see the inbox messages, you can just analyse the inbox messages also. It’s really cool. We were doing like a customer service job, I wasn’t involved in answering the customers but I was the one analysing and gathering the insights for our clients.
Kirill: All right. Can you walk us through this? Once you let’s say use this tool that you mentioned or some other tool and you connect it, what happens? It goes to the webpage, it finds a comment, it downloads it into a file, or into a database and then what? It restructures it, because all comments have different structure? Can you walk us through the process from start to finish, please?
Gabor: You can define what kind of platforms you want to make your search on and then you can define the keywords that you want to look for, for example if you’re looking for Apple, you can just put Apple and plus you can put iPhone and plus if you’re analysing the different models, you can put like, 4, 5, 6, 7, 8 and so on, and of course then you can for example add like the keyword for camera or hardware or something. Then, this tool will find all the posts, all the content on those sites that you were including in the beginning with these keywords and then you can download them via, like, excel, CSV files, even XML, anything you want. You can probably even connect it to a database and just put all the data there. It actually kind of structures your data, of course the textual data will not be structured in the beginning but it can also, before downloading the data, you can just say that I want to see the positive and negative mentions also. And it has these prebuilt algorithms for finding the positive and the negative mentions based on, probably it’s also based on something like lexicon or something that’s behind it but it doesn’t really work with other languages, you can tweak it manually to work with it but it takes a lot of time. But with English it’s working quite well. Of course, with these things, you will always have the problem of sarcasm in everything. because you might find it’s negative but then it might be positive and your ratio will not be the one that you were looking for.
Kirill: Okay, gotcha. So, how do you deal with the fact that some comments might be very long, some comments might be very short data. Does it matter or not at all?
Gabor: It doesn’t really matter. I was doing a lot of text mining and analytics with textual data and one of the easiest things to do is like build a term document matrix where it contains the frequency of the words, that’s in one of the documents. For example, one document can be just one line of text or one sentence. You can just structure it really well and it doesn’t matter how long it is. Of course, it will probably take more time to do it but it doesn’t really matter.
Kirill: Okay, gotcha. How long would you say it would take somebody who has never used text analytics before to get into this field and be able to form their first text analytics?
Gabor: Well, I would probably say that not too much because really, if you’re interested in it just go put like, text mining in Python or R, there are a lot of packages that you can just download. It’s really easy to use. Once when I started getting into it deeper during the university, I was using R because it had a text mining package, and I was using Python also for web scraping, and actually once I created, through a web API of Twitter, I was able to easily download a lot of posts, like an automated job. It downloaded my posts during the night and then I was analysing them during the day. It’s really cool, it’s really easy. If you guys out there are really interested, just take a look at it.
Kirill: Awesome. Thanks a lot for sharing those insights into text analytics. It sounds like a very exciting space for people to get into. You are right, a lot of companies will need to do more and more of that because there is so much unstructured data floating around and now companies are just getting good at using their structured data, the ones that are using it, and the next frontier for competitiveness is unstructured data, and part of that is text analytics.
 All right, you’ve done quite a lot of different stuff. You’ve done text analytics, you’ve done survival analysis, building some forecasting predictive models. What would you say is the most exciting for you and also what are you looking forward to learning the most? What is the next type of data science that you can’t wait to get your hands on?
Gabor: Well, probably as I’m working in the energy industry now, I will go deeper into energy forecasting and how we can use all the other features from weather insights like for example we can use even the precipitation to forecast wind power, on this kind of things. I would like to go into details with, like, what else to use in these algorithms.
Kirill: Okay. That’s fair enough and sounds like a big area of data science that I didn’t even know existed until today. What I wanted to ask you is you said that you were doing a thesis on text analytics, that was part of your research and that’s how you got your job. What exactly did you study at university?
Gabor: Well, that’s a funny job actually. It might take a bit longer to tell you but actually I graduated as a Business Informatics Engineer but it has kind of a longer story how I got there.
Kirill: All right, tell us.
Gabor: My professional career started with an internship at banking, where I worked under the supervision of seniors like business consultants dealing with IT demand management for business departments and I got to handle smaller and larger excel sheets for administrative purposes. Seriously, I remember my first day at work and my boss gave me an excel sheet containing the project backlog and I was just so afraid to touch it.
[laughter]
 It was during my bachelor’s but I was like, wow I have to do something with this. But you know, it was that job that pushed me into analytics. At that time, I had a conversation with my boss about staying there full-time after my bachelor’s and she said that, “Yeah, I would like to hire someone with a master’s,” but at that time I didn’t think I would want to have one. And the funny part here is that a few days later, I actually had a dream where I was talking to my supervisor at the university, I was handing in some papers to him when he looked at me and he said, “Hey Gabor, I think you made a mistake here because these papers are for bachelor’s and you need the application for the master’s.” I woke up, I immediately checked the application deadline for master’s and it was a week from that day and it was my birthday.
I was like, okay, that’s a sign. And I got into one of the best universities in Hungary for master’s and my specialization here was like Business Intelligence and Analytics, and it was mainly about data mining, customer analytics, going deeper into machine learning algorithms. I also remember that I was even building and calculating decision trees and neural networks on paper for smaller data sets.
Kirill: That’s so cool. You got into data science because of a dream, that’s so awesome.
Gabor: Yeah, kind of. If nothing then that was a sign for sure.
Kirill: Why did you choose data science for master’s even though you studied something else as a bachelor’s?
Gabor: I was studying bachelor’s as Business Information Systems and there I got to study some decision science, business intelligence and then it sounded really cool to study business intelligence and that’s what the advertisement was saying for this specialization because it was in English and it said, “Business Intelligence and Analytics” and I really wanted to do some analytics jobs also in the future. And it happened and I’m really glad that I did it because I’m very happy now.
Kirill: That’s cool. What I also wanted to ask you is, you said your thesis helped you get your job. How did you use your project, your research at uni, how did you use that to show to employers? Did they accidently find it themselves, or were you proactively sending it out to people? How did that happen?
Gabor: No. I had a friend in university and her aunt or someone worked at the company, at Sykes, and she was saying that it would be nice if you could go to an interview there because they’re looking for someone with what we are doing at the university and my thesis was about it. Then I went there and I was talking about my project and at that time in Hungary, they didn’t see anyone else before who was doing exactly that what they needed.
Kirill: So it was very easy for you to get it then, if you were the only one in the whole country.
Gabor: Probably not the only one in the country but who they were checking.
Kirill: Who they had seen before. Okay, that’s a very interesting story. All right, so I’ve got a few rapid-fire questions for you which I would like to pose, are you ready for these?
Gabor: Yeah, sure.
Kirill: Okay. What’s the biggest challenge you’ve ever had as a data scientist?
Gabor: It’s a hard one. Actually, we haven’t talked too much about feature engineering and data wrangling until now. But they are kind of one of the most important and most stressful parts of being a data scientist, when you have to transform and map the data from one format to another or to create a long data structure from wide data or wide spatial. Or when you have to gather or spread the data based on some specific key value pairs. To do it well it can be really stressful and time consuming sometimes. That’s the biggest challenge that I have.
I think the biggest challenge, generally, being a data scientist is preparing the data in the right format to be used as an input for the machine learning algorithms, because this is what they don’t teach too deep in the universities. It’s just easy to use some premier tidy data that they give you in the university but in real-life scenarios, it’s really not the case. You have to be careful how to make that input data that you will feed your algorithms with.
Kirill: Yeah. What I also find is it’s really hard to find courses or even books that can prepare you for that because it’s hard to come up with a data set that’s dirty intentionally, right? You commonly find them, you come across them during your work and you get caught by surprise, but when you want to find one, or when you want to create one yourself, it’s not an easy task for exercise purposes. I think that’s why a lot of courses miss out on that side of things.
Gabor: Yeah, I think so.
Kirill: Okay. Thanks for your answer to that question and now let’s move on to the next one. What is a recent win you can share with us that you’ve had in your role? Something that you can disclose, I know that there’s certain things that you cannot talk about, classified things maybe, but let’s talk about something that you can share, something you’re proud of that you’ve achieved or accomplished.
Gabor: Back in my previous job when I was doing survival analysis, I actually got a hold of some historical GPS data of the machines, like coming from sensors and IoT devices, and I analysed if there is any correlation between the machine’s location and the failures or survival time, and it showed that actually there was. The survival time in one region differed from the survival time in another.
Kirill: Wow, that’s so interesting.
Gabor: Yeah, and I don’t need to say that I got so happy because I got to analyse more and of course I had some discussions with the business side, the industry side, and I started extracting and gathering terrain data, like land cover from raster files and also data from weather stations to get more features to use in the survival analysis. At the end, I got better results with these new features, so I included them in the solutions. I think it was quite a win.
Kirill: That’s really cool. Did you ever get to the reason behind why in a certain region the survival was lower?
Gabor: Yeah, actually if you think about the US like for example if you just think about one state like if there are, like, rocky mountains, of course your machine will fail more likely than when your machine is working on an urban area. If you think about it it’s really simple and it shows in the research also, very cool.
Kirill: That’s cool. And so were these machines constantly stationed in those separate regions or were these machines travelling across different regions?
Gabor: During two failures or services, or between those services, they were kind of in the same type of location.
Kirill: So it was just generally the case that the machines in this area are more like to fail than machines in this area so we need to service them more often.
Gabor: Yeah, that’s right.
Kirill: Very interesting. And so nobody knew that before you did that analysis?
Gabor: Well, not in that solution.
Kirill; Not in an analytical way. That’s really cool. Awesome. It’s always fun, I find that, to relate back to the reality of things or the business knowledge. That you find something in the data and then you look at the locations and you’re like, oh that does make sense, it is more rocky mountains here and it is an urban area here. It’s good when the real life confirms what you see in the data, isn’t it?
Gabor: Yeah, that’s true.
Kirill: Awesome, that’s a big win. What is your one most favourite thing about being a data scientist, except for cleaning data of course?
Gabor: Yeah, that’s the best part. Well, to learn new approaches, to discuss your results with other professionals. The visualization part is one of my favourites. Currently I’m using R, R ggplot package for visualization, but I’ve also used Tableau and Power BI where you can just import any kind of data, you can just create nice plots. Another favourite thing is that there is always something new to do, to make research about, to try them, to implement them.
Kirill: Yeah, that’s really cool. I like that as well. It’s constantly growing, there’s always things that you’ve got to be up to date with to see what’s happening and you can also contribute to the new areas of data science, so that’s really cool.
 From all the different various experience you’ve seen and the types of work even, like you’ve had for work in the office, you’re now working remotely and so on, where do you think the field of data science is going and what should our listeners look into to prepare for the future?
Gabor: Everything. Actually, this is a good one. I’ve recently been to an AI summit in Vienna and there was this inspiring talk of Sepp Hochreiter if you know him. He’s a professor at Johannes Kepler University.
Kirill: That’s so cool that you got to hear him. He created the LSTMs, long short-term memory for recurrent neural networks in deep learning.
Gabor: Yeah. He asked this question, who of you are still going to human doctors? And everyone raised their hands. Then he asked, who is going to AIs, then of course no one put their hands up, and he just concluded that, “Well, this will change; Don’t trust human doctors for diagnostics because machines are better.” Surely, he was just making a joke but if you think about deep learning and image processing, and of course you have this course on Udemy. You know machines can just better classify for example skin cancer, because it can learn from millions and millions of images, whereas doctors might have just thousands of patients during their practice. So, I think this is really interesting to see how far machine learning and AI and data science will go in the near future.
Kirill: That’s so cool that you bring that up because I totally agree. There’s an app, you can actually download it. It’s called Skin Vision I think. There’s another one, M-I-I Skin, Miiskin. But Skin Vision I think I’ve heard of before, and basically … Anyway, there’s an app you can get on App Store, I’m not sure about the name but you can get it on App Store, for iPhone, maybe for Android, and it does exactly that. You take a photo of something on your skin you’re worried about, and based on thousands and millions of images that its algorithm, its deep learning algorithm has been through, it can help understand. Whatever algorithm’s in the background, can help understand if that’s skin cancer or not and they actually tested it against doctors and it performed as good as several professional skin doctors in the world and so, yeah, exactly what you’re saying. We’re getting into an age where it’s going to be AI mostly.
Gabor: Yeah, I totally agree.
Kirill: Awesome. Thank you so much for coming on this show and sharing your story and insights. If our listeners would like to get in touch or follow you or maybe ask questions or see how your career goes further, where is the best place to get in touch with you?
Gabor: I’m on LinkedIn, so just feel free to send me a message or request if you have some questions or you want to learn more, or I can learn more from you, then definitely send me a request.
Kirill: Okay, awesome. We’ll share your LinkedIn on the show notes. I have one final question for you. What is your one favourite book that you’d like to share with our listeners today?
Gabor: It’s really not easy to say just one book. It’s hard but since I’m still really in love with text mining and I talked a lot about it, and how much I really love it and its applications, I will recommend the Introduction to Information Retrieval by Manning. It has topics on how to build web search engines, how they work, also areas on text classification and clustering, indexing, ranking, and so on. I can only recommend it for those who are interested in text analysis, it’s really a great book. Next to it you can just create your own stuff on R, Python, anything really.
Kirill: Okay. That’s awesome. Well, there you go, Introduction to Information Retrieval by Manning. And if there’s a book you’d like to recommend to people who are not interested in text analytics, does anything come to mind?
Gabor: Yeah. That was the first book that I got a hold of when I was studying data mining during the university, it’s called Introduction to Data Mining. I will send you the link.
Kirill: Okay, sounds good. So, our second book is Introduction to Data Mining.
 Well, there we go. Once again, thank you so much, Gabor, for coming on this show and sharing all the insights. I really appreciate you taking time and good luck with the remote work and can’t wait to see how your career goes from here.
Gabor: Thank you very much.
Kirill: So, there you have it. That is Gabor Solymosi and his story and how he’s moved through the space of data science. Hope you find that inspiring and it’s hopefully going to give you some ideas of how you can structure your path better. Personally, what I found most inspiring was of course how Gabor, since our catch up in Europe, has moved to a remote data science role, he’s working from home now. That is very in line with his passion of being free and being able to do what he loves and at the same time being able to do it from wherever he wants and on his terms. He is a very talented data scientist and I’m sure that by being able to control his own time as he desires, he is bringing even more value to the organisation that he’s working for and I’m super excited that they have created this environment for him. I can’t wait to hear how his story will progress, I’m very excited to hear how it will progress over the next couple of months or next year or so.
 Make sure to Gabor on LinkedIn and follow along his career path and see where it takes him. You can find the URL to his LinkedIn at the show notes at www.www.superdatascience.com/107. There you will also find the transcript for this episode and any other materials that we mentioned during the show.
 On that note, thank you so much for being here today. Can’t wait to see you back here next time, and until then, happy analysing.
[Background music plays]
Show All

Share on

Related Podcasts