SDS 348: History of Data Science – Part 4

Podcast Guest: Kirill Eremenko

March 13, 2020

Welcome back to the FiveMinuteFriday episode of the SuperDataScience Podcast! 

Today we’re back to our saga of the history of data science.  
After 2015, more opportunities for data scientists arose. In fact, data scientists were getting hired too quickly in jobs with a broad range of requirements. It’s a consequence of not understanding from a managerial standpoint what data science is. Thankfully, David Donoho clarified what a data scientist was through rejection of definitions: it’s not just about big data, it’s not defined by computing skills, and it’s not all statistics and analytics. 
At the start of 2016, the most in-demand skills for a data scientist were SQL and Hadoop, but Python began to grow. In early 2020, it’s now the most important language for data science. If the trend continues, it’s set to overcome C and Java to become the most popular coding language in the world by 2023. But, back in 2017, a shift towards AI began to emerge which came with many conversations on ethics and the dangers of AI. However, progress was made in customer service and health care through artificial intelligence. 
Data science stops being just about crunching numbers and producing code, but becomes about ethics and the moral obligation we have to become part of the conversation around machine learning and artificial intelligence. The need to explain the insights found from this work became clear quickly. Even simple classification methods and clustering need to be free of bias, free of unethical use of data, and free of unethical data collection. Companies are no longer looking for data science unicorn. Education on data science progressed to allow them to more intelligently look for data scientists. This means there are more opportunities to specialize as jobs grow and tailor to specific passions. 
To sum up so far? Well in 1925, about 46 years after the light bulb was invented, half the homes had electricity. Now we all use it. As Andrew Ng says “AI is the new electricity.” About 20 years from now, 90% of companies will be using AI because the rest will be wiped out. This is not the peak, it’s the beginning.  
ITEMS MENTIONED IN THIS PODCAST:
DID YOU ENJOY THE PODCAST? 
  • How can you continue to promote ethical uses of data science to continue the trend that started in 2017?
  • Download The Transcript 
  • Music Credit: Uplifting [Album Mix] by NCS

Podcast Transcript

This is FiveMinuteFriday, the History of Data Science, episode 4. 

Welcome back to the SuperDataScience podcast everybody, super excited to have you back here on board. In today’s episode, we are continuing our saga into the history of data science. So, in the previous three episodes we covered off all the history of data science up to 2015. 
Today we’re going to do these most recent five years, 2015 to 2020 and let’s quickly recap on what we left off last time.
So, in the period preceding 2015, so in 2010 to 2015 some of the major things that happened were an article published by DJ Patel and Tom Davenport on Harvard Business Review called The Sexiest Job of the 21st century. 
And it really showed the world where this is all going. More and more companies were realizing that data science is turning from the next big thing to the thing, and more and more organizations started jumping on board with data science. So, that is a great starting point for 2015, what happened after that? 
Well as we know data scientists were getting more and more offers. There were more and more opportunities for data scientists to get jobs and they were getting hired rapidly, in fact, too rapidly. Why is that? Well, because we have all seen or perhaps have seen these job descriptions where you need to know all the possible tools under the sun from R to Python, from Tableau to SQL, to Hadoop, to absolutely everything and also you need to have a PhD and 10 years of work experience in the field. 
How can you have 10 years of work experience in a field that has only been formalized for about eight years or so? And so that is a consequence not of any flaws in data science but rather a consequence of not understanding from an organizational standpoint or from a managerial standpoint of what is data science and how can it be used? What is required for a specific organization in order to succeed with data science and there is nobody to blame here. The reason is just that data science wasn’t around for long enough. It’s not like accounting or a very specific field where you know what you want and therefore you know who you need to hire. Here managers were doing their best and just putting out all the requirements they could possibly get their head around and therefore hoping for the best that they would hire the right people. 
So, that is how we started out in 2015 but thankfully there was a professor, there is a professor David Donoho, who did a great development in, actually in 2015, to clarify what a data scientist is. Just quickly, who is David Donoho? Well, David Donoho is known for his work on the construction of low dimensional representations for high dimensional data problems. Among other things, he’s also received the COPSS prize, which is the Committee of Presidents of Statistical Societies award. Only one person per year gets that prize. He received it in, I think it was 1994, by the way, if you’ve been following the podcast, you will know that in 2019 that prize was given to Hadley Wickham, who’s been on the show. How amazing is that? So, now we’re talking about another person who’s received this prize much earlier of course, and he’s also received the Shaw Prize, which is considered to be the Nobel Prize of the East. 
So, this is a very well, highly acclaimed individual. And by the way, do you know who his undergraduate supervisor was? Can you take a guess? We’ve talked about this person before. So, he did his undergraduate at Princeton University and his supervisor was none other but John Tukey. So, exactly the same Tukey who we talked about before who came up with exploratory data analysis or EDA, box plots, the Tukey Test and many other things. I found that very interesting that a great teacher created a great student and this is not the first time in history. For example, a classic famous example is that Aristotle was Plato’s student and both are famous philosophers. So, in a nutshell, you could potentially use this as a shortcut to your dreams. So, why not learn from someone who’s done it before, someone who’s already accomplished or is already on the path or spearheading the path that you want to follow, the industry or area you want to be, why not follow them? Have them as your mentor, take their guidance and it looks like it works. Looks like great teachers, create great students. So, be a student of a great teacher. 
Anyway, back to our episode, David Donoho in 2015 attempted to clarify what a data scientist does through a method of the opposite, by rejecting three things that a data scientist actually doesn’t do. Three simplistic and misleading definitions of data science. So, first of all, first one is that data science is not just about big data. The second one is that it’s not defined by computing skills. And the third one, it’s not all statistics and analytics. So, we’ll include the links on the show notes if you’d like to read about more, but in general, let’s recap. It’s not just about big data, it’s not defined by computing skills and it’s not all statistics and analytics. And that coincided with data science becoming named the best job in America in 2016, this was done in a report by Glassdoor, again you can check out the links in the show notes, but yes. How cool is that? 
So, David Donoho attempted or made this first attempt to classify or explain what, or clarify what data science is by saying what it’s not. How cool is that? And this new understanding of data science is better linked to what was named the best job in 2016. It wasn’t just because of the big data or it wasn’t just because of the computing skills, not just because statistical analytics. Data science is so varied, it’s so welcoming, it’s so inclusive of different backgrounds. 
And I think these two go very hand in hand and it’s great that this clarification of what data science is or actually what data science is not was done around the same time when it was the best job, it helps us understand better what exactly was the best job in America in 2016.
Also in the previous five years, at the start of somewhere around in 2016 the most in demand skills for data science were SQL and Hadoop, but at the same time, Python, which was first released in 1991 was quickly rising in popularity and in early 2020 it’s easily the most important language for data science followed by R. So, you can see how things change, that before at the start of the 2015 and I remember working in the industry back then, it was all about Hadoop, Hadoop, Hadoop. 
Everybody wanted a data lake. Everybody wanted their own, not even on the cloud, like a big data system, but they actually wanted it in-house and to develop a big data system. One company I was working with, they were about to invest $20 million to build an in-house cluster for a big data system and all these consultants were coming in and out of the organization. Right now, if you ask an organization it’s very debatable, do you want it in-house or do you want it on the cloud? And more importantly, big data is not really the solution. As we understood, big data is not really the answer to all the data problems. Machine learning and Python are way more right now popular and in demand in the space of data science than SQL and Hadoop, even though those are still important. 
From my perspective and from what I’m seeing, Python and machine learning and R are way more popular and that wasn’t the case back in the early 2016, but things have changed. And moreover, if we see the current trend continuing for Python, it’s actually set to overtake both C and Java and become the most overall popular programming language in the world. And that should happen around the 2023, there’s a prediction for you. Python becoming the most popular language because it’s so versatile. You can actually do development in Python, not just data science and lots of other things, of course. 
Next, so speaking of big data, big data gave the way not just to machine learning and data science, but to artificial intelligence in the center stage of technology hype in around 2017, so 2017 is where that shift happened and that came along with its own new conversations such as prospects and fears about job loss, hostile robots, AI ethics, AI interpretability and many other things in that space.
So, those were the new conversations from 2017 onwards, kind of like in that era. And however, at the same time data science and machine learning made great progress in the space of actually helping humans. 
For instance, chat bots right now are reducing waiting times and improving customer service across the board, across ultimately all industries and helping both companies be more efficient and consumers get what they want faster. Even there are visible advantages in healthcare, for instance, predicting spread of contagious diseases started making the headlines around the Ebola epidemic, which was a few years ago. At the same time as we saw recently this technology has been developing rapidly and a Canadian AI health monitoring platform actually made this claim that they predicted the outbreak of the Coronavirus almost a week before the world’s leading traditional health organizations. So, that is a clear example that not only are there concerns about data science and artificial intelligence, but also there are massive benefits. 
And now we’re understanding, trying to better understand what is the balance between the two. And it’s a very important consideration for people in their careers in the space of data science and AI. Not only to be able to code AI and moreover to explain the results and findings of artificial intelligence, but also be up to date with the ethical considerations of artificial intelligence. It’s no longer just about crunching numbers and being very good at coding, you as a data scientist, as you become better and better as a data scientist and a machine learning engineer and a AI professional, you need to be more and more involved in the ethical consideration because companies need that. Companies want that. Companies will ask that and you have a moral obligation to be involved in those areas, as well. 
So, as you can see, data science and AI, machine learning are becoming a very interesting field that is not just isolated or they have become, not just become, they have become a field.
It’s not just isolated to coding and getting the insights, but also you need to be able to explain the insights and that was very clear, something that we spoke about previously. I believe on these series of episodes that it’s important to be able to communicate those insights, but now it’s becoming even more than that. You need to have ethical considerations in mind. Even if you’re not doing complex artificial intelligence that might destroy the world, but you’re doing simple classification methods or clustering or whatever else. Still, you need to understand is there bias in your data? How are you removing that bias? How is your algorithm using other people’s data? Is it ethical, is it not? And all those considerations need to be taken into account, so it’s a very, very interesting and diverse field right now. And another exciting development is that finally now around these past couple of years, we’ve seen that companies are no longer looking for the data science unicorn. 
There’s been enough education for data scientists and managers of companies. Oh, by the way, I’m going to do a little plug-in here that 2015 was the first, it was the year when SuperDataScience launched its first course. That was around May, June, 2015 and the first course was actually on Tableau. How exciting is that? Woo hoo! Speaking of education and data science. But anyway, so what I was saying is that we saw that there was enough now, there’s been enough education for a while now on this topic of data science and also not just for data scientists but for managers, leaders of organizations, and people have had enough time to make mistakes and learn from them, that companies have realized that there’s no point in looking for a data science unicorn. Companies are now understanding in a more structured way what exactly they need and what people can supply. 
Like basically what kind of professionals they need to solve certain problems in their organizations. What kind of business intelligence analysts, visualization experts, machine learning engineers, data scientists, AI experts, who exactly they need with what kind of skills, what kind of background, what problems they’re solving and that is great. That means that there’s more opportunities to specialize in what you want. So, whatever you want to specialize in, do you want to be a machine learning engineer, a visualization expert, a business intelligence analyst, an AI expert? You can always find a job. There are jobs out there that are tailored to your passion and that is great because now we don’t have to have those 10 years of work experience and PhDs and everything else. Companies know better what they need and we know better what we’re good at. 
And to wrap up, so what can we say about data science?
Well, Andrew Ng has a very cool quote, “AI is the new electricity.” So, I did a bit of digging and the light bulb was invented in 1879, Thomas Edison or Nikola Tesla, that’s up to you to decide who made the progress on that. But anyway, light bulb invented 1879. I’m skipping ahead about 50 years, so in 1925, this is an important date. In 1925, so a hundred years ago, less than a hundred years ago, half of the homes in the U.S. had electric power. This is just in the U.S. about half of the homes had electric power. Right? 
So, now let’s think about today. How many homes do we know, what percentage do you think don’t have electric power? Probably like 0.0001% or something along those lines. And how many businesses don’t use electricity? So, back then, 1925, half the homes had electric power. Probably not many businesses, definitely not all businesses were using electricity.
Can you think of many businesses this day that don’t use electricity? Pretty much every single business, even if it’s like a farm, somewhere in the middle of nowhere that they’re, have nothing, no industrial component to their … They still use electricity for lighting, for heat and things like that. So, now pretty much every single business, probably 99.9999% of businesses are using electricity and it’s only been a hundred years. 
So, same thing with AI, but it’s going to be much faster. So, you can see that AI, data science and machine learning, every business has data. We all generate data. Humans generate data, businesses generate data, and using that data to optimize efficiency and cut costs, improve customer service is a huge competitive advantage. So, more and more companies are going to jump on board with data science, AI, machine learning and it’s not going to take a hundred years. It’s going to take 10, 20 years. 
By 20 years from now, probably 95% of businesses are going to be using machine learning, AI, data science. Why? Because all the other ones will be wiped out. If you’re not going to be keeping up to speed with these technologies, you are going to fail as a business because your competitors are just going to have much more competitive offerings. So, businesses are going to get up to speed with AI very fast. It’s not going to take a hundred years. And the interesting thing is, so we can already see a lot of businesses getting on board and we can see that it’s very clear where the paths are going, how data science can be used in marketing and medicine, and operations, optimization, and things like that. 
However, the interesting thing is that this is not the peak, this is actually just the beginning.
It’s tempting to look at everything we know, everything we have achieved and claim that it’s some kind of peak and I’ve heard these comments about data science that it’s actually going to drop from now, the hype is over, it’s going to die down and so on. It’s like, maybe the hype is over in the sense that … And the hype wasn’t actually necessarily, a hype created that situation, those situations where people were hiring data scientists with 10 years of experience and PhDs. This hype is not something that we needed, but the point is data science is going to continue.
People have fallen into this trap before of saying that something has reached its peak. A great example is Lord Kelvin who was a genius physicist of his age. He claimed that heavier than air flying machines are impossible and this is the same Kelvin who created the first and second laws of thermodynamics. He estimated quite accurately the absolute zero of temperature minus 273.15 degrees Celsius or minus 459.67 degrees Fahrenheit and even the Kelvin scale is named in his honor. Same person said that heavier than air flying machines are impossible. So, and that was just less than 10 years from then the Wright brothers proved him wrong and built their first airplane. 
So, something to keep in mind that when people tell you that data science is over, it’s not over. It’s here to stay. It’s actually here to grow and create more amazing things for the world. 
So, in a nutshell, my advice to you is this, look to the past not for aspiration, but for inspiration. Look to the present not for what can be done, but for what could be done. Never stop questioning, never stop learning, and never stop developing. The future is just around the corner. 
So, I’ll leave you with that thought to ponder and invite you to join me here next week for the final episode in this series where we will take a peek into the future of data science and artificial intelligence. And until then, happy analyzing. 
Show All

Share on

Related Podcasts