SDS 468: The History of Data

Podcast Guest: Jon Krohn

May 6, 2021

Welcome back to the FiveMinuteFriday episode of the SuperDataScience Podcast! 

Today I’m returning to the topic of history to discuss the history of data.

 

Previously, I provided a history of algebra. It turns out you wanted more historical episodes, so I decided to go further with the history of data. What is data? Units of information. The word is derived from the Latin word datum which means “something given”. That is, a piece of information you can give. Data has been around on the planet for 4 billion years stretching back to when RNA started storing information which is still used today by some organisms, such as coronaviruses. The more stable DNA is used by most living things today.
So, when did we start consciously interacting with data? In Mesopotamia around the time of 7500 BCE, clay counting tokens came into use as commerce and commodities. Around 3000 BCE in Sumer, the counting system evolved from tokens to etched clay tablets as the earliest written records and earliest written language. Jumping ahead to the 18th century, we invented punch cards and tapes to encode data. Charles Babbage proposed using this technology for processing numbers as early as the 19th century though there is no evidence this idea came to fruition.
In the 1950s and through the 1970s, punch cards were used commonly in computing in the military, IBM, and NASA. The first use of the term “data processing” appears in 1954. Today, we no longer use punch cards, we utilize magnetic tape, hard drives, and optical disk drives. We’re pretty good at storing large amounts of data cheaply. We also have common structures to store and interact with data.
DID YOU ENJOY THE PODCAST?

Podcast Transcript

(00:06):
This is FiveMinuteFriday on The History of Data.

(00:19):
Last month for episode 460, I provided a brief history of algebra. I wasn’t sure how this would be received. As I thought it might be rare for people to be interested in the history of data science-related topics like I am. In the end, it turned out to be a slam dunk. I received a record amount of positive feedback, for a FiveMinuteFriday episode with people reposting the episode a fair bit on social media and commenting that they loved the historical context of this mathematical field. Well, when you provide feedback, positive or negative on the show, I listen. And so, for today’s episode, we’re doing a history of data. If you thought the history of algebra stretch back far with its 4,000 years of development. Well, that doesn’t come anywhere close to the 4-billion-year history of data. All right. 
(01:12):
So first, what does the term data mean? Well, data are units of information. The word itself is derived from the Latin “datum”. It’s around the 18th century we started using that Latin word datum in English to represent units of information, to mean units of information. So, datum itself means something given. And the idea there is that a piece of data is something that you can share, that’s something that you can give. It’s a unit that you can give. So, I already mentioned that data have been around on this planet for about 4 billion years. So, 4 billion years ago, ribonucleic acid began to store information. So, RNA, ribonucleic acid are these strands of genetic information. And so, there’s RNA system of encoding biological information has been around almost as long as the earth has. So, the earth is four and a half billion years old, and RNA is 4 billion years old. Some organisms on the planet still use RNA, but it’s relatively rare. Corona viruses, which aren’t even really a living thing, they do encode their genetic information in RNA. 
(02:31):
But far more common today is deoxyribonucleic acid or DNA, which is a derivative of RNA that is a lot more stable. And so, you have less errors in that genetic type of data. So, DNA is a data system that humans make use of obviously, but not consciously. So, when did we start consciously manipulating data? Well, for that, we can look back to around 7,500 BCE in Mesopotamia. So almost 10,000 years ago, when humans started working with clay accounting tokens. So, you could have a piece of clay that represented a cow and another piece of clay that represented a goat. And then you could trade these pieces of clay as opposed to having to physically bring a cow or goat with you to trade it with somebody else. 
(03:20):
So, these clay tokens represented commodities, a few millennia later around 3000 BCE in Sumeria, which is in Southern Mesopotamia, that accounting system evolved from just clay tokens to bigger tablets of clay, which we would etch with a stylist. So, while the clay was soft, you could etch it with a stylist and then it would harden, and you would have this record of commodities. And these written clay records later evolved into the earliest known written language. 
(03:59):
So, jumping ahead, several millenniums more to the 18th century, we started using punch cards to encode data. So not just writing systems, but we had punch cards, which are closer to the kind of data that we talk about in computing today. So, the punch cards in the 18th century, specifically since around 1725 have been used in mechanical looms. So, you could have punched cards or these lengthy punched tapes that would be fed through a mechanical loom and allow that loom to create a particular textile, a particular fabric in a specific way, including with perhaps a specific pattern on it. So, the punch cards or the punch tapes were these data storage methods that allowed the mechanical loom to reliably create a textile in the same way.
(05:01):
The British engineer, Charles Babbage proposed using punch cards like they had in looms for processing numbers. He suggested that in the early 19th century, but there’s no evidence that these punch cards for processing numbers or his analytic engine that he proposed for handling those punch cards was ever built. So, it wasn’t until the 1950s to the 1970s, that punch cards became common in computing. So around that time in the mid-20th centuries, it was militaries, particularly the US military, the space administration, NASA in the US and IBM, the company that started using punch cards for storing data in computing. Indeed, it was in 1946 that the word data was first used to mean transmissible and storable computer information. By 1954, the expression data processing was first used. 
(06:08):
So today, we don’t typically use punch cards for storing computer data. We have hard disk drives, optical disks and magnetic tape. And so we have way more methods of storing data than ever before, and we get better and better at doing it cheaply. So, every couple of years, the cost of storing one unit of information, roughly halves. And we also today have more and more data structures. So, software concepts for manipulating the data that we have stored on hardware devices. So today common data structures include lists, dictionaries and objects. We in the data science world, we have particularly common data structures like NumPy Array’s, TensorFlow, or PyTorch Tensors and Pandas dataframes. So, lots of hardware options proliferating in recent years, and also tons and tons of software options for working with data. 
(07:19):
Amazing to reflect on how much we’ve progressed as a species over time and how that progress has been accelerating in recent centuries, particularly in recent decades. What a time to be alive and be interested in data science. Well, that’s it for this week’s FiveMinuteFriday, keep on rocking it out there and I’ll catch you on another episode soon. 
Show All

Share on

Related Podcasts