SDS 340: History of Data Science – Part 1

Podcast Guest: Kirill Eremenko

February 14, 2020

Welcome back to the FiveMinuteFriday episode of the SuperDataScience Podcast!

Today is a super special episode that begins a series on a specific topic: the history of data science.
Data science is an exciting and growing field in 2020, so we thought it would be good to take some time to appreciate the history of data science. This first episode will start in the 1950s and head to 2000.
We’re going to start even earlier to kick things off, all the way back to the 9th century. Why did humans get into statistics and archiving data? The first records we have of this practice are from Arab mathematicians in the 800s A.D. This practice came in when civilizations grew to a certain size and needed ways to record data and taxes of citizens. The next big massive development in mathematics was the 18th century when calculus was invented by Isaac Newton. Analytics started being used in business as early as the 19th century. But the biggest breakthrough came when computers were introduced as a way to advance analytics.
So, 1950s. We can say modern data science was born in 1962 by John Tukey. This is when the concept of data as a science in its own right was first argued. In 1977 we explored another paper called Exploratory Data Analysis, which is where EDA actually comes from. That paper also argues for an emphasis on data being used to explore and test a hypothesis. Around the same time, in 1974 Pete Naur published a survey of data processing methods, showing how far the field was branching out.
What happens next? The International Association for Statistical Computing is born to turn data into information. Then in 1989, Gregory Shapiro organizes the first workshop in knowledge discovery through data. The 90s were a boom period for data science, articles began to be published about businesses using data to predict customer habits and desires. We also some of the first written criticism of bad data practices. Online courses did not appear until later but Professor Jeff Wu, who got the COPSS in 1987, he called for statistics to be renamed, formally, as data science.
We can see the seeds of today’s debates. We’ve gone from dangerous data science to malicious data science with deep fakes. We still debate about the proper use of “data is” or “data are” as well as the role statisticians play in the world of data science.
ITEMS MENTIONED IN THIS PODCAST:
DID YOU ENJOY THE PODCAST?
  • How can you see the history of data science still playing in our topics and debates today?
  • Download The Transcript
  • Music Credit: Wonderland by JJD & Zyphen [AirwaveMusic Release]

Podcast Transcript

This is FiveMinuteFriday, The History of Data Science, part 1.

Hello, and welcome back to SuperDataScience Podcast. Super excited to have you back on the show because we’ve got something super special prepared for you. Don’t you love these little surprises we come up with occasionally?
So, have you ever listened to a series of episodes about some specific topic? Personally, I’m listening or I started listening to a series of episodes about the history of Rome. I started about a few months ago. I really need to get back into it. It’s very interesting. And it’s not a short series. It’s quite long actually. It’s got, I think, maybe 20, 30 hours of content in total, but the episodes themselves are short. And every single time I learn something new. It’s very insightful.
And so, we decided to do something similar here, where in the next five episodes, so this is going to be a five-episode series, we’re going to be talking about, not the history of Rome, but the history of… Wait for it. Data science. Right? You probably guessed it or from the name as well. So, five episodes on the history of data science. And why is that?
Well, we’re going into 2020. Data science is here to stay. It’s exploding. It’s a very exciting field to be in. And as it often is in history or in life, it’s useful to know the history of something to understand it even better going into the future. And that’s what we’re going to focus on. Let’s take some time to spend a few episodes and discuss or understand, or even just appreciate what was happening in data science in the past couple of decades.
And so, the way these episodes are going to be structured, episode number one, that’s today, we’re going to be talking about the 1950s to 2000s. Episode two, 2000 to 2010. Episode three, 2010 to 2015. Episodes four, 2015 to 2019, and episode five, 2020 and beyond. Thus, you can see that data science is exploding and we need a logarithmic scale to keep up.
So, that’s what we’re going to be doing. I hope you’re excited. I am super pumped about this. We’ve got a very interesting content prepared for today’s episode. And without further ado, let’s dive straight into it. Let’s kick it off.
So, today, data science in the 1950s to 2000s. 50 years. Actually, we’re going to start off even earlier. We’re going to start off way back in the ninth century. The question is here, why did humans, why did we humans even get into data, not data science, but just the study of data, recording data, statistics and things like that? Or why did even get into this in the first place?
Well, this all dates back to the ninth… First written, it was probably done way before, but the first time it was written, the first records we have is from about the ninth century by Arab mathematicians. So, it’s quite known that mathematics came from the Arabs.
And here there’s a couple of cool books you could look into. Sapiens by Yuval Noah Harari. Amazing book that will explain exactly why mathematics was developed. And that’s why empires grew to a certain size when you could no longer keep track of taxes very easily, and governments had to come up with ways to keep track of these things, like who owed whom or who owed the government, more importantly, or the empire money, that they came up with ways to record data. So, that was back in the ninth century or thereabouts.
And another great book to look into is Silk Roads. And this book is by Peter Frankopan, recommended to me by a great friend of my father. I’ve started listening to it. I’ve probably listened to maybe 20% of the book. Amazing book as well, and that one focuses specifically on what was happening in the Middle East and Asia during the development of the world. So, another great book there, Silk Roads.
Anyway, so that’s where it all started back in the ninth century or so. And then, what happened next? Well, the next big massive development in the space of mathematics was in the 18th century when calculus was invented, right? Calculus gave the field immense depth. And who is credited for calculus? Well, I guess you may have already heard about this, but if case you haven’t, Newton, right?
We often think of Newton as gravity, apple falling on his head, coming up with the third… What is it? The law of gravity or the three rules of motion. Is it? Yeah, I think it’s three rules of motion, if I’m not mistaken. But nevertheless, the main thing that he… One of the biggest contributions was actually calculus. And so, that was the 18th century.
And analytics were actually used in business as early as the 19th century. That was quite a while ago, but the main breakthrough happened when, of course, computers were introduced and they allowed us to use analytics to even a higher extent, higher degree. And so, that’s the pre-history, way before the 1950s, just to paint the picture of where it all came from.
So, now, let’s dive into what happened from the 1950s to the 2000s in the world of data science. So, we could actually say that the modern data science started in 1962. So, the date when you could say data science was actually… or the way we analyze data now was started, is 1962.
And who started this? Well, John Tukey. I have no idea how to properly pronounce the name. It’s been a while since I took statistic class, but you probably know that Tukey test. His name is spelled T-U-K-E-Y. So, you know the Tukey test from statistics.
So, in 1962 John Tukey published a paper called, The Future of Data Analysis. And in that paper he argued that data analysis should not be thought of as subfield on mathematics, but rather it must take on characteristics of a science. And we were actually talking with Bradley Voytek, who was the first data scientist at Uber on this podcast. So, if you want to check out that episode, it’s episode number 253.
And Bradley has a very cool position about data science being a science on its own, rather than a part of another discipline. So, very similar to those, but these were expressed back in 1962 in John Tukey’s paper. So, he argued that it’s a mistake on data science… Well, data analysis must take on characteristics of a science.
And then, later on in 1977, he published another paper which was called, Exploratory Data Analysis, and that’s actually where the term comes from, EDA, which we hear more and more frequently now, or EDA stands for exploratory data analysis. That was introduced by John Tukey in 1977. It’s actually not that well known that John Tukey actually introduced box plots as well. You know box plots and how powerful they are?
I was actually finished recording or… I was recording some videos for the Tableau Certified Associate course and box plots, there’s quite a few tutorials and cool things you can do with box plots. A really powerful method on analytics. Well, that was introduced by John Tukey in 1977.
And also in that paper, Exploratory Data Analysis, he argued that more emphasis needed to be placed on using data to suggest and test hypotheses, bringing together exploratory and confirmatory data analysis. That is something that still holds true today.
And by the way, it’s interesting. I was reading up about John Tukey. I was just checking the Wikipedia page about him, and he was actually… He’s really cool. He lived all the way up to 2000. So, he saw the start of this millennium, which is very exciting. And he’s best known not for any of these things. He’s best known for the Fast Fourier Transform in mathematics, Fast Fourier Transform Algorithm.
And so, how crazy is that? He’s created the Tukey test. He’s created the exploratory data analysis, coined the term, wrote a paper about it. Argued for data analytics or data analysis to be a separate field. He created box plots, and on top of all that, he created the Fast Fourier Transform Algorithm. So, there we go. That’s a very strong start, right? From such a great scientist, a start for the field of data analysis.
Then later, or actually somewhere around the same period, specifically in 1974, Peter Naur. So, Peter Naur is known in programming. If you are very close to the field of programming, you might know the BNF notation or Backus-Naur form notation in programming. Well, that comes… That second letter BNF, the N is from his surname, Peter Naur. So, he’s known for that. 
Well, in 1974 he published a survey of data processing methods used at the time. And he showed that the field was already branching out. They also gave possibly the first formalized definition of data science, “the science of dealing with data once they have been established, while the relation of data to what they represent is delegated to other fields and sciences”.
So, very interesting approach or policy here that we extract the data and then we deal with the data and we don’t really mind what it’s related to. It makes data science very transferrable. You can go from one field to another, but it does ignore the concept while in its raw form, the squad does not take into account the concept of domain knowledge, right? So, when we don’t care what the data is connected to and that is delegated to other fields and sciences.
Well, that’s the raw core of data analytics, data science and that’s where all these… That’s where we could attribute all the algorithms that we work with, all the models, all the methods that we create. And then, on top of that, we could add the domain knowledge to be specialized in certain industry.
And by the way, here you’ll notice that Peter Naur is saying, once they have been established, in referring to data, right? So, the science of dealing with data once they have been established. And this is an interesting thing. That’s actually something I mentioned in my book, which was published back at the start of 2018 and edition number two is coming out very soon in mid this year, in 2020.
And so, there we discussed the concept of, should you say data is or data are? So, often you’ll read in books or in quotes that data are, but actually the reality is that the term data was first coined back in 1645. So, way back. Again, in the 17th century in 1645, there it was first used in singular form by Thomas Urquhart, and it wasn’t changed. So, it was always used in singular form. So, data is, for about 60 years.
So, first time in 1702, it was used as a mass noun. It was used in plural. There’s something to think about if somebody ever pulls you up on saying data is and says data should be are, and the singular of data is datum. Well, that’s a very recent development actually. Originally it was coined as data is, so singular noun. And that’s how… I find it easier to say it that way.
So, that was Peter Naur in 1974, and you could say that’s the first time when the term data science was actually used or formalized. Then, what happened next? Well, a few years later, we saw the establishment of the International Association for Statistical Computing, which aim to link traditional statistical methods and modern computer technology to convert data into information and knowledge.
And then, in 1989, Gregory Piatetsky Shapiro, who was on the podcast actually, he was here on episode 175. So, if you haven’t heard that one yet, check it out. One of the leading or very important people in the development of data science. He organized the first knowledge discovery in a database’s workshop. So, that was a step towards popularizing data science and discovering knowledge from data.
The nineties… We’re already up to the nineties. So, the nineties were a big boom period for data science as it slowly entered public awareness. Business Week published a cover story on database marketing in 1994, discussing how companies were collecting and working with data to predict customer behavior. That was already happening back in 1994.
However, there were clear limits in the space. The sheer volume of data couldn’t be handled by the computers of the time. For example, in Mining Data for Nuggets of Knowledge, which was published in 1999, Jacob Zahavi, who is a professor at the Tel Aviv University… Get this. So, Jacob Zahavi has been working at the Tel Aviv University for 43 years and he’s been a professor for 37 of those years. And at some point in his career he switched to data science. He’s a very incredible person in the space.
He was actually quoted in 1999 saying that today’s data sets can involve millions of rows and scores of columns of data. Is already being recognized back then. And at the time, data scientists were facing issues of scalability, developing models for non-linear analysis and they were speculating about developing data mining tools to analyze online behavior already back then. And now, that’s…
We have the luxury of having all those things available to us and happening real time, happening all the time and new tools coming out constantly. Back then, those were just things scientists and data scientists were dreaming of.
Then, in the nineties, we also saw some of the first written criticisms of bad data science. A 1996 book called, From Data Mining to Knowledge Discovery in Databases, it clearly showed that the best minds of the time were aware of the dangers. As it quotes, “Blind application of data mining methods (rightly criticized as data dredging in the statistical literature) can be dangerous activity easily leading to the discovery of meaningless and invalid patterns.”
So, with the abundance of data and growth of data science, comes the growth of bad data science. And we are seeing that. Now we’re actually even seeing more. We’re running a bit ahead here, but we’re seeing not just bad data science, we’re seeing malicious data science with things like deepfakes and so on.
And what next? Well, online courses weren’t really a thing at the end of the 20th century. However, Professor Jeff Wu… Wu is another incredible person. Why is that? Well, because he proved the convergence of the expectation maximization algorithm, improved it drastically.
He also got the COPSS Presidents’ Award. So, what’s the COPSS Presidents’ Award? COPSS Presidents’ Award is the committee of presidents of statistical societies. So, it’s a massive award. It’s like the Nobel Prize of statistics. Only one person per year gets it. He got the award in 1987.
By the way, guess who got the award in 2019? If you’ve been up-to-date with the podcast, you will know that it was Hadley Wickham, the creator of ggplot2 and other amazing libraries in R. And Hadley was on the podcast just a few episodes ago. A few episodes prior to this episode, Hadley what was on the podcast. So, if you haven’t heard that episode, check it out. Episode number 337, Hadley Wickham.
But yeah, going back to Professor Jeff Wu, so Professor Jeff Wu also got the COPSS Award in 1987. He has contributed massively to the field of statistics. And so, while online courses weren’t really a thing at the end of the 20th century, Professor Jeff Wu called for statistics to be renamed data science, the statisticians to be renamed data scientists. And he did this at his inaugural lecture at the University of Michigan in 1999, so at the very dawn of the 20th century.
Very interesting, very radical approach, I guess. Statistics to be renamed data science, and the statisticians to be renamed data scientists. I don’t think that’s ever going to happen and I have huge respect for statisticians and their work and for statistics as a field, but you can already see that these thoughts were in the air already at the end of the 20th century.
And we’re going to wrap up on that. I hope there were interesting insights here, but it also clearly shows that everything… Is not like data science just appeared out of nowhere 10 years ago. Actually, everything was going in this direction. Everything was boiling in one kettle and people were already feeling this happening. There were lots of talks about this. So, it’s not just a hype that appeared recently. It’s been in the making for decades now.
And so, we’re going to end here on the end of the 20th century of the past, where we already feel that the world is ready to stride into the next millennium, and data scientists were poised to lead charge, and data science was about to explode. So, I look forward to seeing you back here on the next episode for the continuation of our little saga into the history of data science. And until next time, happy analyzing.
Show All

Share on

Related Podcasts