Podcasts SDS 367: Building Data Pipelines for COVID-19 Modeling

77 minutes
Data Science, Python

SDS 367: Building Data Pipelines for COVID-19 Modeling

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

We had a lot of fun in this episode as we talked about Sam’s role as the lead data analyst for the COVID-19 Critical Care Consortium, real world challenges of data, data modeling, DSGO Virtual, and astrophysics.

About Samuel Hinton

Samuel Hinton is an astrophysicist, software engineer and online data science instructor. His interests lie in investigating cosmological problems with a variety of methods, from supernova and large scale structure to gravitational waves, as well as utilising numerical simulations, optimisation and algorithms to solve complex problems. He is a strong advocate for proper Bayesian statistics and rigorous analysis as well as a passionate science communicator and presenter.

Overview

On top of his day job studying dark matter, Sam is working directly with data on COVID-19. Over 50 countries are signed on to offer hospital data to the COVID-19 Critical Care Consortium which is the largest of its type in the world. Sam’s relationship with the University of Queensland, which is partnered with Oxford, got Sam the role designed for someone who had the ability to sift through the data and help lead the team of analysts. Because of the nature of medical data, Sam is one of the few people in the world allowed to access the raw data.

Some of the challenges of this data pipeline include the lack of uniformity in how the data comes in thanks to different languages, different processes, different units for data, and older systems that were designed long before this. It’s a battlefield for hospitals which are focused on treating patients and don’t have time or resources to efficiently and uniformly record their data. Groups like Amazon have volunteered to help gather data under restrictions to have no access to the data. Only hundreds of patient records are complete at the moment. This allows for some inferences on data (for example, data on intake but not outtake) but it can be tricky. Other initiatives of this data include clustering to find evidence of mutations and constructing causal nets to discover trend drivers.

Beyond this work, Sam has released his second online course for data manipulation in Python, which, ironically, matches well with the work he is currently doing in the fight against COVID. He also boasts one of the highest-rated courses on Udemy. His method of lecturing is based on what he disliked about lectures he attended as a student. His lecture is practical and he aims to answer every question possible he receives from his over 40k students.

One great announcement for those who are interested in learning more in-depth and live from Sam, he’ll be providing a workshop on data science pipelines at our DSGO Virtual event in June. The idea of the workshop is simple: data scientists don’t want to be doing data cleaning. So how can you design your data pipeline to get your machine learning products or business intelligence products at the end? The less time you spend playing with the data, the more time you can spend actually getting insight from the data. So, how do you teach this? A lot of code. Sam imagines a collaborative coding workshop.

Beyond this, Sam professionally studies dark matter and dark energy. He gave a talk on the concept of dark energy and dark matter in his hunt for what exactly dark matter and dark energy are. His post-doc work will be to investigate dark energy and dark matter using two separate probes: type 1A supernova and the structure of the universe. Supernovas are a universal candle that can be used to map out the structure and history of the universe by tracing it back. The current thought is that dark energy is simply energy in empty space, but even that is complicated. The other side of this is looking at the size of the universe and the history of its contents and looking at the acoustics it leaves behind. In a short way of explaining something very complex.

Right now, Sam is reading novels rather than textbooks as a way to relax. He estimates he’s read around 45 books this year so far. His favorite this year was Sufficiently Advanced Magic and the Cradle Series. He hopes to get a weekend off soon to take time to read textbooks again and learn.

In this episode you will learn:

Sam’s current work and COVID-19 Critical Care Consortium [4:22]
The COVID data science pipeline and workflow [12:50]
Sam’s second online course [36:22]
Bayesian inference [43:06]
Sam at DSGO Virtual [53:30]
Sam’s work on dark matter [1:01:25]
What is Sam reading right now? [1:09:14]

Items mentioned in this podcast:

SDS 303: Proper Hypothesis Testing For Every Field
Sam’s Udemy
DataScienceGO
The Dark Side of the Universe
Sufficiently Advanced Magic by Andrew Rowe
The Cradle Series by Will Wight

Follow Sam:

Episode Transcript

Download The Transcript

Podcast Transcript

Kirill Eremenko: This is episode number 367 with Astrophysicist and Online Data Science Instructor, Sam Hinton.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name is Kirill Eremenko, Data Science Coach and Lifestyle Entrepreneur. Each week, we bring inspiring people and ideas to help you build your successful career in data science. Thanks for being here today and now, let’s make the complex simple.

Kirill Eremenko: Welcome back to the SuperDataScience podcast everybody. Super pumped to have you back here on the show. Today, we’re hosting Sam Hinton, who’s returning for the second time round. The first time he was on this podcast was in episode number 303 in October 2019 where we talked about hypothesis testing and what it means for the world of data science along with other topics. That episode was hilarious. I highly recommend checking it out. That’s SDS 303. This one is going to be super fun as well.

Kirill Eremenko: Sam Hinton is always fun to talk to. He’s got a great personality and very outgoing and loves to share things. I had a lot of laughs and I’m sure you’re going to have a lot of laughs along the way with us. What is happening in Sam’s life? What did we talk about? Number one, very important, I think you’re going to be very interested in this is that Sam is the lead data analyst for the COVID Critical Care Consortium, which is one of the largest studies in the world right now looking into COVID-19 and what is happening to people who end up in critical care, things like ventilation and other factors.

Kirill Eremenko: You will get a lot of interesting thoughts from a data scientist who’s actually working, like spearheading this direction working with other scientists in over 100 or approximately 100 different countries around the world. You’ll find out what they’re looking into. In addition, Sam will talk about some of the challenges of data like what are the real world challenges that data scientists face?

Kirill Eremenko: Right now, he’s facing all of this data that’s coming in, that is inaccurate or maybe incomplete. In many cases, a thing that has to be cleaned up, has to be normalized. Lots of pre-work on the data has to happen and you’ll find out how he’s building this data pipeline and what it means. We’ll be talking quite a bit about data pipelines. Very, very interesting. I’m sure everybody can get a lot of value out of this.

Kirill Eremenko: We’ll also talk about data modeling, Bayesian statistics and DataScienceGO Virtual and how Sam will be joining us to run a workshop there. Make sure to listen in on that, that’s going to be very cool and maybe that workshop will be right for you. At the end, we’ll talk about astrophysics. You’ll find out some cool things about dark energy and dark matter. Super exciting, super fun podcast. I can’t wait for you to check it out.

Kirill Eremenko: Without further ado, let’s dive straight into it. I bring to you, Sam Hinton, astrophysicist and online data science instructor.

Kirill Eremenko: Welcome back to the SuperDataScience podcast everybody. It’s super fun to have you back here on the show. Today, we got a very special guest, Mr. Sam Hinton. Sam, welcome back. How are you doing?

Samuel Hinton: Thanks for having me again. It’s always a pleasure.

Kirill Eremenko: Fantastic, man. Second time, how long was it the last time? The last time was like, what was it, eight months ago or so we chatted?

Samuel Hinton: I have no idea. More than a week which means I barely remember it.

Kirill Eremenko: How have you been since then? Things going well?

Samuel Hinton: Things have been hectic. I’m sure that there’s a lot of people where they’ve lost their jobs and things aren’t hectic. A lot of other people in recent days who now have 10 times the workload and don’t know when to sleep. I guess I’m lucky to be one of the second group.

Kirill Eremenko: Yeah. No, that’s good. Why did you not know when to sleep? What’s happening in your world?

Samuel Hinton: I’ve got my normal job. I’m a postdoc at the University of Queensland. I’m trying to lead the Dark Energy Survey, Supernova Cosmology Analysis. Lots of fun, astrophysics science that I’m supposed to be doing. I am now also the lead data analyst for the COVID Critical Care Consortium, which is as of writing right now, I think, the largest international study on COVID-19 in the world, specifically looking at things like ventilation and all these stuff that we know is quite difficult with COVID.

Kirill Eremenko: Wow, that’s pretty cool. First of all, what countries are in that consortium?

Samuel Hinton: We’ve got almost 50 countries now. I know the US signed on weeks ago. Now that all the legal agreements are in place, their hospitals are coming online. We’ve got data from Estonia, from Kuwait, from almost, well, a whole ton of European countries apart from France. France has their own study and they’re not joining ours. We don’t have any from Russia either. Almost every other country is signing up. We’ve hit almost 50 countries, hundreds upon hundreds of different hospitals sites. Soon, the data should start pouring in, we hope.

Kirill Eremenko: Okay, but tell me like how did you get this job? Out of all data scientists in Australia or in the world, I don’t know even, how did you get this?

Samuel Hinton: Mostly, luck. A lot of things in life are luck or just being in the right place at the right time. It turns out that in this giant collaboration, the data is hosted at Oxford. Oxford and the parent company overseeing this study are sort of like the big players. The University of Queensland has an agreement with Oxford. People were looking around at UQ for someone who could do it. They had all these issues with the data and they needed essentially someone that could help out on the machine learning side of things, on the visualization side of things, on the data pipeline side of things.

Samuel Hinton: People are just going around saying, “Who has experience?” One of the project, the head machine learning investigator, talked to my supervisor and my supervisor, my postdoc supervisor, said, “Oh, well, why don’t you talk to Sam. He’s got previous experience in all these areas.” She talked to me and then an hour later, she said, “Okay, I want you to lead this.” I said, “Oh boy, this is a lot. Are you sure?” She said, “Yes.” That’s been my life for the past couple of months.

Kirill Eremenko: Yeah, that’s really cool. How does this all work? How big is this whole team? Are you the only lead data scientists? Are there other lead data scientists, like other data scientists in different countries, that you’re responsible for Australia? How does this all work?

Samuel Hinton: Yeah. It’s a bit of a complicated, I’m not going to say mess, but it’s a very spaghetti-like situation here, simply because we’re dealing with medical data. As soon as you deal with medical data, there’s a whole bunch of confidentiality and privacy agreement, things that you have to take into account. Because I was essentially the first person that got access to the data, they’ve now come down and said, “Look, we don’t want anyone else accessing the data.”

Samuel Hinton: I’m one of the very few people in the world now that can get the raw data. One of my jobs is to take this raw data, run it through a data cleaning standardization and de-identification pipeline that I built. Then, distribute those data products to specifically UQ researchers. We are getting other universities in. We have, for example, in Brisbane researchers from QUT. They get added to sort of UQ system and then they can access the data.

Samuel Hinton: On top of that, we have companies that have reached out and said, “Hey, we want to help with the machine learning. We want to help with this or that.” We’ve had Amazon and IBM and we’re working with both of them right now and then Fast AI, a whole bunch of companies. The big issue is that, it’s sensitive data. It’s not something that you can just upload onto Kaggle and have an open source Kaggle competition. You can’t do that.

Samuel Hinton: There’s a lot of people that have offered to help and they’re simply unable to because we haven’t got data sharing agreements with them. We would be in very hot water if we gave them the data. There are data science …

Kirill Eremenko: You de-identify the data?

Samuel Hinton: Yes. The issue is, even if it’s been de-identified what you normally have is then a security team comes in, takes your de-identified data and they want to see if they can break it, if they can re-identify patients. The issue is, that step takes a little while to do. We’ve had some preliminary groups at UQ say like, “These are the variables that are quasi identifiers and if you combine it with social media data, we may be able to re-identify a person doing this or that.”

Samuel Hinton: Until everything’s like proven 100% good enough, it’s hard to even share de-identified data. We’re moving towards that section. Obviously, as soon as these things come in and as soon as there are legal agreements and everything in the way, it’s no longer just like a one or two-day task. It’s back and forth between legal teams and things slow down.

Kirill Eremenko: Have you heard of the Netflix Prize on Kaggle?

Samuel Hinton: There’s a Netflix Prize? I haven’t.

Kirill Eremenko: There was like years ago, years ago, when Netflix, which was I think it’s like 2000. Oh, my God, I don’t even remember like I don’t remember that. Maybe, years ago basically. Netflix went on Kaggle. They posted like their de-identified data for people to have this competition. The prize was a million dollars to build a recommender system. Or the prize pool was a million dollars to build a recommender system that predicts the best way possible what movie you want to watch next, what show. It was successful. They’re going to launch it.

Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the US showing that he could identify the people from Netflix Prize data by combining parameters in certain ways. A lot of people, I think, wanted or either did launch a class action lawsuit against Netflix for that. How crazy is that? Yeah.

Samuel Hinton: Yeah. It’s definitely a place you don’t want to take the risk.

Kirill Eremenko: Yeah. Hope you’re enjoying this amazing episode. I’ve got a cool announcement for you and we’ll get straight back to it. Virtual Data Science Conference. Curious? You’ve probably heard of DataScienceGO, the conference that we’ve been running for the past three years in Southern California. Maybe you’ve attended, if so, it was super cool to have you there. Maybe you weren’t able to attend for the reason of being in a completely different country, or the flights were too long, or the timing wasn’t perfect. There could be plenty of reasons why you weren’t able to attend. Now, we’re bringing DataScienceGO to you.

Kirill Eremenko: This June, we’re hosting DataScienceGO virtually and you can attend and get an amazing experience there. Guess what, the best part is that it’s absolutely free. Just head on over to datasciencego.com and get your tickets today. This will be our very first time running a virtual event. Nevertheless, we’re still going to combine the three key pillars of fun, amazing talks and networking into this event. You’ll hear from speakers like John Krohn, Sam Hinton, Hadelin de Ponteves, Stephen Welch, and many others.

Kirill Eremenko: Plus, you’ll be able to network with your peers. This event is going to be epic on all fronts and we’d love to see you there. Head on over to datasciencego.com/virtual and get your ticket today. The number of seats is limited. We’d love to have everybody there. For our very first event, we’re limiting the number of seats to make it more manageable. Make sure to get your tickets today, if you want to be part of this. On that note, I look forward to seeing you there. Now, let’s get back to this amazing episode.

Kirill Eremenko: Yeah, okay. All right. You get all this data. What is this data science pipeline? Tell us about it. Of course, by the way, for everybody listening, none of this is medical advice. We’re going to as much as possible avoid … we are going to avoid sharing sensitive information that Sam cannot share on this podcast. Most importantly, none of this is medical advice. If you hear anything like the coronavirus, it’s opinions only. With that caveat, what’s this data science pipeline? What goes into this process of building one? Why do you need one in this specific case?

Samuel Hinton: Right. There are a few things to keep in account when we’re talking about this specific study. The first is that the database in the system where the data is gathered was written by a very smart and very talented UQ researcher. I won’t give you the name because I’m sure … Respect. I want to respect his privacy and people will end up emailing everyone over everything. He …

Kirill Eremenko: I’m sorry, just to add. UQ is University of Queensland in Australia …

Samuel Hinton: Yes.

Kirill Eremenko: One of the top universities in Australia.

Samuel Hinton: Yes, good clarification. I tend not to define my acronyms straight, that comes from my astrophysics roots. He’s made this great database and it’s used for a whole bunch of different medical studies, including the one that I’m working with which is the COVID Critical Care Consortium. It was originally named ECMOCARD, if that rings a bell to anyone listening.

Samuel Hinton: What it means is, it was a very generic way and the doctors go and they get CRFs. Essentially, they print out a whole bunch of sheets of paper where they write down the details, and then someone goes it through and uploads it into this database. The issue is that there’s very little checks done on the database. It was written to be general by a person by himself essentially and a long time ago too.

Samuel Hinton: It wasn’t written this year with all the modern frameworks. It’s a fairly old system. That means that when the data comes back, we have very little guarantee on what the data should look like. Dates don’t have to be dates. Numeric values can come through filled with strings or letters. That’s the easy part to identify because at least we know things should be numbers. If it comes through, we have for example, 107, but the O in a 107 is the degree symbol from degrees centigrade.

Samuel Hinton: There’s a lot of weird issues like that simply because we’re mixing European keyboards and non-European keyboards. Then, even when you get that down, there are things that haven’t been validated. Like, we want to take patient records every day so that we can track the evolution. Sometimes, we have two or three records for the same day. It’s like, I’ve entered the date row, but there’s no validation on that.

Samuel Hinton: Then, even if you do all of that, you now have hundreds of hospitals from dozens of countries and they used different units for everything. You get a whole range of numbers come in. For a lot of the cases, you know what the units are and you just do basic unit conversion. Some of the fields don’t have unit as input. You have to try and infer from the ranges what the unit should be. That’s tricky to do because in medical data, things like lymphocyte counts can span four orders of magnitude in a living patient.

Samuel Hinton: How are you supposed to deal with that? Then, on top of all of that, because the data is filled in from someone writing down off a piece of paper, it’s highly incomplete around 80%. If you just turn this into a 2D data frame, around 80% is missing. That’s a huge problem for imputation especially because in this current, like right now, when I’m talking, we don’t have that many records.

Samuel Hinton: We have hundreds of patients and less than 100 if we count just those where we know whether they were survived and discharged alive, or whether they didn’t survive and succumb to coronavirus. With such little data, how are you supposed to do effective imputation? We have to have multiple strategies that we then need to try and vet. All of that needs to happen every day. We download the data, clean the data, standardize it, and try out a bunch of things every single day so that we can go back to the ICUs, back to the hospitals when we need and say, “Hey, I think you’ve put this in wrong here or maybe this is a really cool, really interesting novel result.”

Samuel Hinton: All of this needs to happen in a very quick, a very automated fashion to make sure that we can get the results back as quickly as possible.

Kirill Eremenko: Wow. I can just imagine the doctors, it’s like a battlefield for them. They’re running around trying to save people’s lives. The last thing they care about or last thing that’s on their mind is to sit down and properly, carefully input all the data for Mr. Sam Hinton in Australia.

Samuel Hinton: Yeah. It’s a difficult sell, isn’t it? Because especially writing things down on a piece of paper that someone then has to copy in. It’s just not an efficient way of doing it. It’s one of the things that Amazon reached out and we said, “Hey, you’re well-suited to this, gathering data and using it.” Obviously, there’s a whole bunch of privacy concerns when you decide to bring in a large corporation.

Samuel Hinton: There’s all the legal issues there where we have to be very careful about whether or not they can actually have the data at the end. The answer is, no, right? They will help us gather the data and then the idea is they don’t get access to it. Then, it’s okay, do we develop an app? Do we try and set up Alexa so that the doctors can simply read out the values into their phone and it will populate the form for them using natural language processing.

Samuel Hinton: There’s a whole bunch of concerns there but even then, the doctors don’t have time to go through and even just read out what the values are. Countries like Germany, those that haven’t been massively afflicted yet as in those that haven’t broken through their capacity in the hospital system, are doing things like getting student doctors and med students to go out to the hospitals. They just drive from hospital to hospital. They take the paper and they enter it into the databases.

Samuel Hinton: They go around and pick out the values and record them and enter it. There’s simply in many countries, absolutely, no chance going to get the people that on the frontlines trying to keep people alive to take a little break from that, to do some data entry. It just doesn’t make sense.

Kirill Eremenko: Yeah. Yeah. Okay. All right. Once you have all this data processed and cleaned, what you’re saying is, you have to do this every day and there’s no way for you to automate it completely that all these checks happen automatically?

Samuel Hinton: Yeah. At the moment, every day we regenerate our data products. Every day, we regenerate a new list of issues that we then go back to the clinicians with and say, “Hey, by the way, this value from this day looks a bit funky, can you please double check that or is that a legit value?” Obviously, that gets sent back, not every day. We don’t want to overwhelm the sites but the more egregious areas, the things that we can’t fix ourselves, gets sent back. Essentially, the only time we do send them back is when we’re losing data.

Samuel Hinton: There are some fields that we simply need. For example, when was the patient admitted to ICU? We want to track their evolution over their stay in the ICU, which means, if we don’t have when they were admitted, we don’t know at what point they fall in the timeline, and we can’t use that data. It seems a shame, like all we need is this one variable for you to fill out and then, we can use the 400 other variables that you put in for this patient. Then, we go back to them. Then, the other thing that they want is every day, they don’t just want issues. No one wants just bad news every day.

Samuel Hinton: We generate daily reports for them. This simply comes down to some Jupyter Notebooks essentially that are automatically generated and converted to HTML documents and they will have interactive plots, where we can show them basic statistics and demographics of their patients. We know that COVID-19 affects more men than women and it affects older people worse than they affect younger people. We want to keep up-to-date. The statistics that they have saying, okay, what are the risk factors? Is arterial hypertension, like high blood pressure, a large risk factor or not? How about smoking? How about obesity, diabetes, all of these different things?

Samuel Hinton: Then obviously, treatment. One of the big questions which are, what is the difference in outcome when you look at different treatments? Who are on antibiotics? Who are on antivirals? Which antivirals are they on? Which antivirals have different ratios of success versus failure? All of that is data that we try and generate every day into an HTML document that we can then send back to the clinicians.

Kirill Eremenko: Interesting. You said you don’t have that many, like you already have hundreds of patient records that are complete and you know the full story, wouldn’t those insights be statistically not significant? Like, if you’re inferring that?

Samuel Hinton: Yeah. It’s a big problem, which is that in some cases, we have the numbers. If you’re not looking at the outcome, if you just want to say, “Hey, what are the demographics of people that are being admitted?” You don’t know the outcome. We can say some things there but then, we can’t say a lot. This is one of the reasons why we have to be very careful about what we give to clinicians and medical doctors because we don’t want to mislead anyone. We don’t want to cause any unnecessary harm.

Samuel Hinton: We have, for example, some clear trends in some, let’s say, pH. Your blood pH levels were broken down into those that survived and those that didn’t. At the moment, the trends look very different but we can’t give that to the clinicians because if you look under the hood, at the number of patients that are being used to generate those trends, it’s a tiny, tiny number. If we give that away, we aren’t confident that we’ve accounted for the difference in country and ethnicity and all these factors that differ across the patients because we don’t have a representative sample.

Samuel Hinton: It’s something that we always have to keep in mind. There are currently a whole bunch of data products and things that we are simply hiding so that we can see them when we’re developing products like the dashboard, like the daily report. We can’t make them public because the chance that they would mislead people is simply too high because without the knowledge of statistics that would help inform the validity and the confidence of those trends, it’s very easy to make a mistake.

Kirill Eremenko: Got you, man. I’m so glad that you’re doing this. Out of all the people in the world because I remember in our first podcast, you stressed very strongly that given the 95% rule for frequency statistic, the P value of 0.05 is not sufficient. That means, like … this was your quote and I’ve used it many times, 1 out of 20 research papers out there is wrong. Every 20th research paper is incorrect simply because we agree that 5% confidence is sufficient.

Kirill Eremenko: I can just imagine how rigorous you are about not misleading people and misleading doctors here would cause people’s lives. You have to be very careful.

Samuel Hinton: Yeah. It’s so easy to happen because I think we have around 450 variables. Imagine if 1 in 20 of them are wrong, and we’ve drawn conclusions, that’s almost 20 different hypotheses that we could incorrectly give if we just decided, “Hey, P value 0.05, good enough, ship it out.” You’ll notice that if you look through all the papers that are currently being published on COVID-19 and especially some of the early ones, they’re done on a cohort studies of three, four or five people.

Kirill Eremenko: Five people.

Samuel Hinton: Yeah. There was a study in The Lancet with a patient count of five. It’s like, okay, well, it’s good. You got to get these things out. There’s no time to sort of dilly daddle on it but at the same point like can we trust it? You don’t know.

Kirill Eremenko: Yeah. Yeah. Interesting. Okay. What happens next? You do these reports back and forth, how’s the workflow? Do you guys like have meetings? Do you … I don’t know. Do you have like a vision? Is there some leadership?

Samuel Hinton: I have had around six hours of meetings today. Yes, there are meetings. There are meetings between the different companies we’re trying to help out. There are weekly meetings with the PIs and myself to try and determine directions.

Kirill Eremenko: What’s a PI?

Samuel Hinton: The project investigator.

Kirill Eremenko: Okay.

Samuel Hinton: One of the people leading the project. There are meetings every Thursday with the clinicians. There are meetings every Friday with the UQ researchers, who are trying to apply models onto it. In terms of where to go in the future, hopefully less meetings. I can’t see that actually happening. There will always be too many meetings. What we want to do is, once we get more data, once we can actually be a little bit more confident in the results that we’re getting, hopefully, we can do some interesting things with it.

Samuel Hinton: At the moment, we’ve been doing things like generalized linear models and Cox regression and a bunch of other little statistical tests to try and answer some of the queries that the commissions have. The other thing we want to do is use unsupervised learning to see if we can cluster the patients because one of the current questions with COVID-19 are, are their separate phenotypes. Are there multiple variants of the virus going around? Do they present differently?

Kirill Eremenko: Like mutations?

Samuel Hinton: Yeah, essentially. Yeah. There’s been some marginal evidence so far published in papers that yeah, it looks like there might be multiple phenotypes. We want to see, can we cluster our results? Do our results indicate that there might be subgroups? Again, hard to do with only a few hundred records. Then, we also want to figure out things like causal modeling. This is obviously a big issue especially in the medical field which is, let’s say, you notice a trend in some sort of variable. We don’t know is that trend because of COVID? Is that trend because of the medication? Is that trend because of any one of 400 billion different things? You’re not quite sure.

Samuel Hinton: You want to see if you can construct causal nets to determine exactly how the conditional probabilities in your model lies to see what is actually driving these trends. Of course, it’s extremely complicated especially in medicine where each patient gets treated individually. They get treated based on how they’re presenting. It’s not like you have a control group that just gets run through with the same treatments in the same way. If someone presents differently, they get treated differently and it’s so hard to standardize the results.

Kirill Eremenko: How are you going to do that? That’s a very important question not just in this application, but in other areas of life whether it’s business or marketing or product supply chains, like whether even, there’s always going to be these external factors. As we know, correlation doesn’t imply causation. Do you have any tricks you can share that you think might work?

Samuel Hinton: The trick that we’re trying to rely on right now is one that isn’t applicable to anyone else, which is we’re going back to the clinicians. There are obviously hundreds of years of medical advice and medical studies out there that we can try and make use of to say in other different … if you don’t take COVID, if you take flu or SARS, like viruses, how do they normally present? What are the non-causal features more so in those different pathogens or viruses?

Samuel Hinton: On top of that, I can’t think of a good way to explain it. It mostly just comes down to being a very thorny issue that we haven’t fully solved. There is no ideal solution. Obviously, you can use clinicians to help inform that but you can also just use a bunch of different models. One of the things is, we want consistency in our model outputs across models. You don’t just want to run some stupid random forest to get a result and just ship it out. You want a whole bunch of different models to agree so that you have confidence in it.

Samuel Hinton: Then, you want to use explain ability and interpretability techniques so that for every model that you’ve done, you can actually identify why that model is saying the things that it does. This is something like Shapley values or just looking at the weights of each decision tree in a decision tree. What are contributing to the final answers? So that you can try and hopefully get a consistent idea of the causal effects in all of your models. You hope that they agree.

Kirill Eremenko: Yeah. For that purpose, do you think a neural network could work?

Samuel Hinton: It could, it could and we will have neural networks especially with the patient evolution. Our time series data that well-suits a recurrent neural network, something like a long, short-term memory network. The main issue is, we can’t train anyways at the moment because we only have a few hundred data points especially…

Kirill Eremenko: I mean … Sorry.

Samuel Hinton: … we’re going to like…

Kirill Eremenko: I was going to say …

Samuel Hinton: I was going to say …

Kirill Eremenko: You go.

Samuel Hinton: I was just going to say, for an LSTM networks or a deep network, you need a lot of data points and it’s very difficult in our case to create new data. Data augmentation techniques are very difficult to do on data that is mostly incomplete. It’s difficult for us to do something like a nearest neighbor imputation because the dimensionality of our models is in the hundreds and we only have hundreds of data points. Your nearest neighbor may be a very great distance in hyperspace from you. It makes it difficult because you need to imputate. Then, you need to try and augment your data without biasing your models.

Kirill Eremenko: Okay.

Samuel Hinton: How to do that with only a few hundred samples for a novel disease? That’s tough.

Kirill Eremenko: Tough. For neural networks, in terms of interpretability, like, even if you got a neural network that predicts everything well, assuming you solve somehow the problem of a low, the small data set, it’s really hard to interpret why exactly, why are these neurons behave in a certain way? Wouldn’t that be a roadblock to using neural networks for this problem?

Samuel Hinton: Yes. No, for sure. That’s why we’re trying to get as much expertise as possible to come in, people that have done the similar things before. I know you’ve seen things like with convolutional neural networks. There are ways of breaking down the features so that you can try and visualize them. Essentially, we need techniques like that that apply in general. It’s very hard to deal with the neural network, especially as the depth starts to increase. Even if you try and visualize what neurons are lighting up, like how do you put that into something that a human can understand?

Samuel Hinton: It’s just a massively complicated linear algebra function, which we have essentially no intuition over and it’s difficult. While some potential partial solutions exist for some specific variants of neural networks like CNNs, I’m not sure, I don’t know of a generalized solution. If someone out there listening to this knows a generalized solution to neural network interpretability and explainability, please let me know.

Kirill Eremenko: That’s all. Okay, got you, for sure. If anybody listening has any ideas, I think at this stage that will be very useful. We’ll share Sam’s contact details in the show notes as if he’s not getting enough meetings already.

Samuel Hinton: You should have seen, I did an interview with ABC Radio National last week and it went live two days ago. I have been flooded by well-meaning people offering support. I had one lady that come in and say, “Look, I’m retired. I’m just isolating in my home in the Blue Mountains. I have nothing to do. I’m an ex-researcher in agricultural science, have a statistics background as well. Do you need a personal assistant to help you manage all this?”

Kirill Eremenko: Wow.

Samuel Hinton: I was completely flawed by her response and all the other positive responses we have received. I said no to her, of course, because the university actually listened as well when we said we’re drowning and we now have a new project manager, her assistant, a new administrative assistant on the UQ side of things. Luckily, we are getting the support that we need. Still, the response is large.

Kirill Eremenko: That’s awesome, man. That’s awesome. You are doing a fantastic job like this work can potentially help stop this or slow it down. Hats off to you, it’s really cool, really cool. The university might be listening to this one as well, is there anything else you need? Let’s do a wish list.

Samuel Hinton: A wish list. I wish I could get into America and start the job that I accepted many, many months ago. I got offered a very nice fellowship at Lawrence Berkeley National Lab. I’m supposed to have had my visa interview and everything planned to fly over there with the wife and all canceled indefinitely. Who knows, I’ll be at UQ for the foreseeable future and we’ll see how long COVID takes to be consigned to the pages of history.

Kirill Eremenko: Man, you got married. When did you get married? I completely, sorry, I missed that.

Samuel Hinton: April 1st.

Kirill Eremenko: Wow. Congrats.

Samuel Hinton: That is our anniversary and we decided it’s the best date to get married because half my friends on Facebook, especially because of COVID-19, the ceremony is limited to the celebrant, two witnesses and me and my wife.

Kirill Eremenko: Yeah.

Samuel Hinton: Five people. It’s not like a big affair. When I posted pictures saying, “Hey, by the way,” a lot of people thought it was a very elaborate joke, which I encouraged for a solid week until I was like, “Yeah, no, it’s actually real.” That was the best I think. I saved so much money on the ceremony. So much money on the reception. The honeymoon was a bit lackluster. We sat down, open Google Maps and just went through Street View in a few countries. We’re like, “Yeah, those look nice. We’ll visit them one day.”

Kirill Eremenko: Wow. Okay. How long have you been together?

Samuel Hinton: A while. It’s a very short marriage. I think we met 2018 at the end of it, maybe. I’m not quite sure. My memory is horrible. If you ask her, she will know like the exact date, the exact time and exactly. Me, I’m just like a couple years ago is fine.

Kirill Eremenko: Yeah. Awesome, man. Congrats. That’s really fun. Very cool. Awesome. Okay. Hopefully, once this whole sells down, you’ll do your work there. You’ve got your PhD, right? This is your postdoc?

Samuel Hinton: Yes. I’ve got the PhD, but in a month where the COVID-19 stuff it still hasn’t been on boarded. It was like submitted and sent off for review so long ago, but everyone has far better things to do. I’m currently sitting here, not as a doctor [crosstalk 00:35:44].

Kirill Eremenko: PhD list.

Samuel Hinton: … yeah, PhD list working like two different postdoc jobs, pulling my hair out, making courses on the side, just waiting for my reviewers to eventually come back and say, “Yeah, it’s all good.”

Kirill Eremenko: Got you, man. Wow.

Samuel Hinton: They just need to get off your asses. Get me back my thesis. I [crosstalk 00:36:03] that.

Kirill Eremenko: Got you. Okay. Thank you for the run down on COVID. Hopefully, things go well there and we all support you. I’m sure our listeners, please show Sam some support, send him some nice emails if you are supporting him. Even if you can’t do anything to help, it’s good to know you’re listening.

Kirill Eremenko: Speaking of courses, congratulations on launching your second course, man, like number two. First one was Python for Statistical Analysis about six or so months ago. Second one now is, Python for Data Manipulation. The irony is that’s exactly what you’re doing for COVID.

Samuel Hinton: Yes. I’m lucky that everything just fits in so well together. Yeah, I thought after seeing all the comments on the stats course, that the biggest skill that the people taking my course were lacking was the ability to use libraries like pandas to streamline all of the pre-processing stuff in their analysis because no one wants to spend 13 hours crunching numbers to do half an hour of machine learning or statistical analysis. I was like, “You know what, okay, pandas it is. I’ll make a crash course for that. Show everyone the easiest ways and the most efficient ways of doing all the common tasks.”

Samuel Hinton: I hope it’s been useful for those that have signed up. I’ve got some good reviews so far. Some people have reached out and said that they really liked it, which is always pleasing to hear because you don’t want to make it. You don’t need to get people to come back and say, “That was terrible.”

Kirill Eremenko: Yeah.

Samuel Hinton: I’m happy. They’re happy. I think we’re all happy in isolation.

Kirill Eremenko: That’s good, man. Yeah. Just speaking of reviews, you have some of the highest reviews we’ve seen across all the instructors we work with. Both of courses have 4.6 stars or 5 stable, which is really hard to maintain on a massive platform like Udemy. What’s your method? How do you do it? Maybe, there’s people out there looking to create a course these days, like maybe you can share some insights. How do you get such great feedback all the time?

Samuel Hinton: Honestly, I’m not too sure. I think one of the things is, I keep in mind what I want as a student, which was a few years ago now. I always remember listening to my lectures, the online recordings, and just wanting to un-enroll. Some of them would go on and on, if I stop, I really didn’t care about pages upon pages of just talking at me before getting down to anything useful.

Samuel Hinton: I decided if I ever made a course, it wouldn’t just be talk. I mean, I would talk about the code that I’m writing in front of you and try and keep it practical so there’s something for you to do whether it’s run the code in parallel with me or just read over what I’m reading or listening. Just not droning on. I try and get to the good stuff quickly, but lectures always end up taking far longer than I thought.

Samuel Hinton: I remember I recorded a lecture just about histograms and the first record, it took about three minutes. Histograms are pretty simple. There’s not much to talk about. Then, I got a ton of questions for people. They were saying, “Hey, what about this use case? What about this here? My code isn’t working here.” I realized, even with such a simple concept, there are a whole bunch of little caveats or things that people don’t quite understand intuitively.

Samuel Hinton: I went back and re-recorded it. It became like a 15-minute video, but people seem to like it. Those that already knew were able to watch it at double speed and sort of skip to the parts that they needed. Those that had never seen it before managed to get all the relevant information such that they didn’t try out the code to get an error and then have to hit up Stack Overflow for half an hour afterwards, trying to figure out what on earth this keyword that they needed meant. I’m not sure. I just try and keep that in mind but beats me.

Kirill Eremenko: Yeah, man. That’s a good approach to overdeliver because I actually met ones in one of our live events, I met a student and she told me like, “Oh, it’s so weird to hear your voice in real life because I’ve been listening to you online all the time and you sound so different.” I’m like, “How do I sound so different?” She said, “I listen to you on double speed. I’ve never.” Yeah, so a lot of people do that. I encourage people to do that. I would rather put more into a lecture than less because you can just listen on double. If there’s less, then people who are not as familiar with the topic, they will fall behind and we don’t want that.

Samuel Hinton: Yeah. There’s a personal side of it too, which I’m not sure if this is of too much interest for those listening. If you’re making a course, so we release the stat course, the Python for statistical analysis course for free, for a little while during COVID until we had to stop making it free because in that one week, we did free, I got 42,000 new students. That’s more than being the course the entire time it’s been up. A huge amount. The issue was okay, well, I have two jobs at the moment, plus the newly released course and now I have 42,000 students who ask questions.

Samuel Hinton: Even though they got the course for free, I’m not going to ignore their questions. I’m going to go in there and answer them to the best of my ability, but it takes time. If you have these short lectures that you think are like, “This is really efficient,” 30 seconds and this topic is done. You haven’t been comprehensive, well, people will just ask you about the things you haven’t covered. Suddenly, instead of spending 10 minutes recording an additional one minute in your lecture, you’re spending 10 hours responding to the same question 400 times and it’s just not efficient.

Kirill Eremenko: Yeah. I agree. Tell me this, so once you recorded in this recent course, data manipulation of Python, you said pandas. Are you using pandas for the COVID analysis or are you using some other tool?

Samuel Hinton: Yes, no, we’re exclusively using pandas essentially for the data pre-processing step, to the point where I can’t think of a single function in amongst the pipeline that I threw together that doesn’t make use of pandas in some way. Date times are best handled in pandas. Pandas has categorical features, which is amazing. Everything is pandas. That’s not going to change. It’s just such a convenient tool.

Kirill Eremenko: That’s so cool. It’s a very vivid example of practicing what you preach. I love that it happened in this order that you first recorded the course about pandas and data operation. Now, you’re actually using those same tools. It’s a great testament to that these are applicable tools in industry, in medicine, in whatever, like emergency situations like this, you know these tools, you go and use them right away. Really cool. Really cool.

Kirill Eremenko: What else that I wanted? What I wanted to ask you about is, I’ve had this question, so since our last chat I’ve been killing myself over like breaking my head because I didn’t ask you. I was so like tempted afterwards like I should have asked this question. We’re talking about Bayesian inference. We’re comparing Bayesian inference to frequency statistics, Fisher and his thing. By the way, I don’t know if the listeners listening to this, I actually read that Bayes was the 19th century and Fisher, as far as I understand, Fisher was like 1920s, like early 20th century.

Kirill Eremenko: Fisher didn’t like Bayes, right? He created his own approach to statistics and so on, which we all use now and which is taught at school the P values and things like that. Fisher actually, interestingly enough, he, as from what I read, he tried to prove that not smoking causes cancer, but cancer causes smoking. Speaking of the correlation causation, like because it’s a P value, right? The chart is there. You can run all the tests. You don’t you know, you have to have some additional knowledge to know which way it works. That’s like a side story.

Kirill Eremenko: The question I wanted to ask you, you were talking about Bayesian inference and you were talking about prior probabilities, posterior probabilities, if I’m getting the names right, and how things are interpreted. You give this lovely example, by the way, I highly recommend to listeners to check out the previous podcast, I’ll dig up the episode number and we’ll mention in the show notes. Fantastic episode and you gave us a great example of like the sun exploding.

Kirill Eremenko: You said, “Okay, so if we have this device on earth that is looking at the sun and predicts all of a sudden that the sun’s going to explode in the next hour, we can take all of the prior probabilities. We’ve seen that the sun hasn’t exploded like prior knowledge that we had. Sun hadn’t exploded in billions of years. Most likely, if we account for that, then the probability actually goes down quite a lot.” Right? Do you mind repeating or something like that, the example?

Samuel Hinton: Yeah. The example was there’s a box on earth, where if you push it, like you push a red button on the top, it will tell you whether or not the sun is going to explode or not in the next 10 seconds. When you press the button, what it does is it tosses two dice. If you get one on both the dice, it just tells you that the sun is going to explode. Otherwise, it returns the truth. The frequentist walks up, presses the button, gets unlucky, it rolls two ones.

Samuel Hinton: They know about the dice, but they say, “Two ones, that’s 1 in 36 chance. That’s less than a P value of 0.05.” We have a significant result that the sun is about to explode and they run off to publish. In the background, the Bayesian statistician is just sitting there shaking his head trying to bet that it won’t.

Kirill Eremenko: Yeah, because he is using or she is using the Bayesian inference, right, which takes into …

Samuel Hinton: Yes.

Kirill Eremenko: Can you tell us a bit about this prior probability?
Samuel Hinton: Right. Bayes’ theorem is that the posterior, which is the likelihood, the whole likelihood.

Kirill Eremenko: The end result.

Samuel Hinton: Yeah, so the end result is a combination of likelihood. Yeah, you’re right. I shouldn’t say likelihood when I’m talking about the posterior. It’s a combination of the likelihood and the prior. The likelihood is, what is the chance of getting the data, given our model. Then, the prior is what’s just the flat chance with our current information of that model? If you combine all of those, it’s proportional to the probability of your model given the data, which is different to the likelihood.

Samuel Hinton: If I speak the math very quickly, it’s saying that the probability of theta given D is proportional to the probability of D given theta [inaudible 00:46:49] times by the probability of theta, theta being the model. The idea is that in frequency statistics, we work with the likelihood and that’s fine. That’s good. That’s what you need to do. Then, when you look at Bayes factor, you also add in the prior, which is your prior, past, existing information.

Samuel Hinton: There is another part, the whole thing is a fraction. On the bottom is what we call the evidence. Let’s not get into that, because that’s a much more conceptually difficult thing to talk about without actually having diagrams or being able to write any math. Posterior is proportional to the likelihood multiplied by the prior.

Kirill Eremenko: Got you. What I was wondering since then, like literally, I think we hung up and this question popped to my mind or it was like towards the end of the podcast, we’re running out of time. Anyway, so do you know the turkey paradox?

Samuel Hinton: No, no, you’re going to have to tell me.

Kirill Eremenko: It’s simple. It’s super simple, but it’s not mathematical. It’s got nothing to do with Tukey, the mathematician with the, what’s it called, T test, I think. It’s just about a turkey like an animal. The turkey is born. It’s a bit scared of everything at the start. Then, the farmer comes along or whatever. The butcher comes along and feeds it some corn and say, “Wow, I got some corn from this butcher. That’s amazing. Okay, well, maybe we can be friends. What’s the likelihood of somebody giving me corn for free?”

Kirill Eremenko: All right? Then, day passes, two days, three and every day he’s getting corn. Then, maybe a month passes, six months, a year, I don’t know how long turkeys are raised for. Every single day, he’s getting this corn. The prior that’s part of the probability, yeah, the prior, right, it’s growing. It’s like, “There’s evidence that he’s my friend, he’s my friend.” If you apply Bayesian inference, the probability of the butcher butchering the turkey in the turkey’s mind is going down all the time because all the evidence it’s seeing is like with the sun, it’s not blown up. I haven’t been hurt by this butcher.

Kirill Eremenko: I apologize to the vegans out there, but at some point, the butcher comes and slaughters the turkey for Thanksgiving or for some other thing. This really mess with my head, just like you have the sun example on one hand, but with the turkey example, the whole Bayesian inference goes down the drain, because the result is inevitable, like it’s going to happen. I wanted to get your thoughts on that. How do you apply Bayesian inference? Or what does that say about Bayesian inference?

Samuel Hinton: Nothing. I mean, in this case, the Bayesian inference is perfectly fine. On any given day, the turkey is very likely not going to die until the day that the butcher decides he’s had enough with that thing gobbling up all his bread. That means that Bayesian inference has served the turkey correctly for everyday but one, which is a lot of being served correctly.

Samuel Hinton: It’s only an issue in our heads because we know that the butcher is coming. We have access to hidden information. Our Information is different to the turkey. We see this and we go, “Oh man, the probability of the turkey being butchered is so low from the turkey’s point of view.” It’s like, “It is,” from our point of view. We know it’s coming. That’s just because our priors are different. It serves the turkey well, like as everything does, until it suddenly stops working. If the turkey lives for a few years, it served it very well for a very long time.

Kirill Eremenko: Got you. Our priors are different. I think that’s a key that we’ve seen hundreds of millions of turkeys prior to that and we know that their result 99.999% is this.

Samuel Hinton: Yeah, precisely. Our conditional probability is conditioned on the knowledge that we know the butcher is going to butcher.

Kirill Eremenko: Yeah. Interesting. Yeah. That’s a good feature of Bayesian inference. Not a feature, the more knowledge you have, the more accurate your prediction will be. On that note, do any industries or businesses or I don’t know, applications actually use Bayesian inference these days? I’ve heard of a few, but what’s your knowledge in this space? Because it looks like everybody is using frequency statistics, whereas Bayesian has a place as well.

Samuel Hinton: I think it’s difficult to say because it’s easy to mix things up. One of the giveaways of frequency statistics is when someone starts talking about P values. We generally don’t do that in Bayesian statistics. If I use Bayesian statistics and I calculate some variable, I would say that, X has been detected at 3.8 sigma confidence or similarly. I’ll say, you would use a P value in the frequency statistics, but that doesn’t mean that someone using frequency statistics can incorporate prior knowledge.

Samuel Hinton: They just do it under a different formalism. If someone says, “Hey, are we doing Bayesian technique?” I need to sit down and say, “Okay, well, how have you formulated your model? What prior information do you have? How is that being incorporated?” It’s very easy to try and sneak that information into the likelihood. That’s fine to do in some context, but it does give you certain different mathematical properties of your outputs. It’s hard to say.

Samuel Hinton: A lot of the cases, you do use Bayesian-like techniques or using prior information almost everywhere you go. Every time that you’ve done something with deep learning, you may not have run with a Bayesian neural network, which are things and they’re wonderful and you should check them out. The fact that you’ve trained on 10 million images means that you’re incorporating prior information already. It’s just not under the formalized Bayesian statistics headline. It’s sort of blurred line that’s difficult to actually draw in the sand.

Kirill Eremenko: Okay, okay. Interesting. Thanks. Thanks for the rundown. You should create a course on Bayesian inference. I’ll be happily take it to learn …

Samuel Hinton: Yeah. I thought about it, modeling, how to fit models, whether you’re using different MCMC, so Markov chain Monte Carlo processes or similar. Because everyone has a model and it’s a lot easier to write a model than it is to correctly fit it to the data and draw the right inferences from it. That’s what I do a lot. Maybe one day when there’s enough interest, I’ll write up and record a course on that. It all depends on what people want. Everyone get kin for model fitting and let me know.

Kirill Eremenko: Speaking of what people want, we have a very cool surprise announcement. Sam is joining us as a speaker on the Advanced Day at DataScienceGO Virtual. DataScienceGO Virtual is happening at the end of June or start of July. We’re still deciding on the date. By the time this goes live, you can find out the date for sure it’s available at datasciencego.com. Get your tickets there.

Kirill Eremenko: Sam will be joining us as a speaker. What will you be talking about? Actually, a workshop. What workshop is it going to be?

Samuel Hinton: Probably something on data science pipelines. Given that I’ve spent the past few months writing a few of these. Years before that, I’ve been doing them for an astrophysics context. It seems smart to finally formalize that and write up a workshop. I’ve given workshops in the past on different topics so it’ll be good to add another knock to the belt and finally write up everything that I’ve been doing on this.

Samuel Hinton: The idea is quite simple, I think, which is, every data scientist doesn’t want to be doing data cleaning. No one particularly likes cleaning and standardizing data. How can you write a pipeline? The easiest, most flexible and most extensible way possible to streamline all of that to get you either your data products as quickly as possible or to go through generate not just the data products, but then do the machine learning and validation on them to get you not data products, but now machine learning products or business intelligence products at the end.

Samuel Hinton: The less time people spend sort of screwing around playing with the data, the more time people can spend actually getting insights from the data.

Kirill Eremenko: Absolutely. The way this topic came about is that we ran a survey and over 1,700 people interested in attending DataScienceGO Virtual completed the survey. Among other advanced practitioners, specifically, the most popular topic was data science pipelines by far. The next topics were still popular, but there was a huge, huge difference in the number one and number two topics.

Kirill Eremenko: Why do you think data science pipelines is so in demand right now among specifically advanced practitioners?

Samuel Hinton: Exactly like what I said, no one wants to spend their time doing it. Data scientists spend most of their time not doing data science.

Kirill Eremenko: Cleaning data, right?

Samuel Hinton: Yeah. It’s an awful waste of time. It’s not a fun job. It’s not a rewarding job. You get a data product at the end and now you can start your real job, just getting insight and crunching the models down to actually extract useful information and being able to do something. Like we’ve done with the COVID study, where every day we get a data refresh and that happens automatically at 6:00 a.m. It’s kicked off, I don’t do anything.

Samuel Hinton: Then, at the end of that pipeline and it takes about five minutes to run, we have data products that have been uploaded to secure sites. We have reports available for people. We have an interactive dashboard. This isn’t currently in but I have the data, we use to take it out. We will have machine learning products that have been refreshed each day because you don’t want to go through and say, “Hey, we’ve got a new data set. We’ve got a few extra records.”

Samuel Hinton: I’ll just manually run these 30 models that are thrown together and re-compare them. It’s like, no, you want to press a button, go off, have a little nap, come back and have your results there.

Kirill Eremenko: Yeah, yeah, absolutely. Maybe handle some exceptions at the most that [crosstalk 00:57:19].

Samuel Hinton: There’s always exceptions, isn’t?

Kirill Eremenko: Yeah. Okay. How do you teach data science pipelines? Give us a teaser. What’s a workshop on data science pipelines look like?

Samuel Hinton: Probably a lot of code. There’s no way around it. Maybe one or two slides to try and illustrate the topics. I guess, the way that I have it in my head at the moment is it’s essentially a collaborative coding. Everyone’s coding their own thing. I will have my pre-done version and then people can deviate from that as they will. Probably something like Google Colab or Jupyter labs in some instance, just to give everyone the basics.

Samuel Hinton: How you can throw these things together? How you can chain all these methods in a robust way? Then, how you can tie that into your machine learning product? Hopefully you start with here’s a bunch of raw data files. Then, at the end, you press a button and out come all your products that you want. Obviously, the way that this has to be done, the workshop, may be different to how people do it in industry.

Samuel Hinton: If you have a very large data set, if your data set is either billions of records or thousands of features, you may not be able to run this on a laptop. You may need to ship it out to high performance computers, submit it to some sort of batching job on a cluster like SLURM or SGA, a whole bunch of them. That’s very difficult to do in a workshop. No one wants to spend two hours setting up and trying to apply for accounts. We’ll have to cover a representative but basic example. Then, give people the skills or the pointers as to how they can scale that up.

Samuel Hinton: Whether they’re using things like Dask to try and scale out to clusters or whether they just need to know this is how you submit jobs to a supercomputer. Obviously, you can’t do all of that in a single workshop. We’ll have to cover the basics as much as possible. Then say, for your use cases, this is where you want to go. For your use cases, you’re going to go look over at this. That’s actually something that I think we’re going to run a survey with the people that responded to the first survey saying, what are your use cases? What, in your mind, is a good data science pipeline? What do you want out at the end? What are the products that you’re talking about? What are the inputs that you’re dealing with? Are we talking about megabytes, gigabytes or terabytes of data?

Samuel Hinton: The pipeline changes depending on all of these questions. That’s something that we really need from the people going to the conference is what are their use cases? Because only with that knowledge, can we create an effective workshop that actually benefits them at the end of the day.

Kirill Eremenko: Absolutely. Yeah. That’s one of the reason why we ran the first workshop to know exactly what people want. Sorry, so we run the first survey. Now, we’re going to run the second one. If you’re listening to this, and for some reason you weren’t part of the second survey, maybe we missed your response, in terms of identifying you as a key participant for the second survey or these more in depth interviews that we’re conducting, please send either our team or Sam directly, preferably Sam directly, an email.

Samuel Hinton: No, no, no preferably the team.

Kirill Eremenko: You can send us an email. We’ll include it in the show notes for this episode, so you can find it there. Basically, send us an email and explain exactly what you would like to see in this data science pipelines workshop. When this goes live, we will still have time to incorporate your feedback.

Kirill Eremenko: Yeah, that’ll be cool. I’m looking forward to it. It’ll be virtual so there’ll be people from all over the world. Yeah. You have so much experience with this now especially with this COVID stuff.

Samuel Hinton: Yeah, sounds like fun. No pressure though. Yeah?

Kirill Eremenko: Hope you find time to not go crazy with all this stuff around …

Samuel Hinton: Fingers crossed.

Kirill Eremenko: Another thing that I had in mind, once you’re talking cosmology with my girlfriend? What’s your website again? What’s that wonderful website?

Samuel Hinton: cosmiccoding.com.au.

Kirill Eremenko: Yeah. Everybody cosmiccoding.com.au. Don’t forget the .au, we’re very special in Australia. Yeah. Amazing. I love the talk. If anybody’s looking, it’s called the Dark Side Of The Universe by Sam Hinton in the BrisScience lecture. I’ve watched about three quarters of it so far, or maybe two thirds. Amazing. I loved it. I knew about dark matter. I didn’t know there was even more dark energy in the universe. That’s crazy, man.

Samuel Hinton: Yeah. I wish I knew what they were.

Kirill Eremenko: Yeah. I’ll due to that. Some cool things. I like that you provide the evidence of how the charts fall in place and that this is not just like voodoo stuff. It’s actual …

Samuel Hinton: Yeah. It’s probably the most common question I get is like, well, what if dark energy and dark matter is just a mistake? What if Einstein was wrong? I was like, “Okay, he could be wrong but there are dozens of independent, different avenues of investigating that, all come to the same conclusion.” If it’s a mistake, someone needs to come in and say how, because we have very, very substantial evidence that it’s a real thing. We just don’t know exactly how it’s supposed to function or where it came from.

Kirill Eremenko: Yeah.

Samuel Hinton: One day, we’ll have the answer.

Kirill Eremenko: One day, maybe. A million years from now.

Samuel Hinton: Pretty much.

Kirill Eremenko: Is that what your job in America is going to be about? What’s this postdoc?

Samuel Hinton: Yeah, so the postdoc is to investigate dark energy and dark matter primarily using two different probes of the universe. The first being Type 1a Supernova and the second being the large scale structure of the universe. To give a very, very tortured and brief intro to both of them. Type 1a Supernova are sort of exploding star that explode around the same brightness every time. What you can do is you can use them to map out the history of the universe, because remember, light takes time to travel. A galaxy that has a supernova that we see that’s a billion light years away, well, that supernova exploded a billion years ago.

Samuel Hinton: Because they’re all the same brightness, it means we can figure out how far away that galaxy is by how dim the supernova is. If you have a light and you start walking away, if you walk twice as far away, the dimness of the light is now a quarter, because the light is spreading out to cover the area of a sphere, the sphere 4pi R squared. The idea is with this standard candle, we can map out the evolution of the universe.

Samuel Hinton: The evolution of the universe changes depending on the properties of dark energy and dark matter. The better we can constrain the evolution, the better we can determine the properties of those mysterious components. Then hopefully, a theoretician will come along and say, “I propose dark energy is this with these properties.” We say, “That works or that doesn’t work.” At the moment, the leading one is Einstein, who just said, “Dark energy is probably just if space itself, empty space, had energy, turns out that fits with everything.”

Samuel Hinton: We just don’t know why it should have energy. You take quantum mechanics and you calculate how much energy the empty vacuum should have, it’s not zero, right? Quantum mechanics says there should be energy, but it says there should be so much more energy, 100 magnitude more energy than we observe, which is catastrophically wrong.

Samuel Hinton: The second thing, the second probe is large scale structure. Let’s see, what’s the easy way to explain that? The universe is big now. In space, no one can hear you scream. That’s true. Remember, the universe is expanding. If we go back in time, the universe gets smaller and smaller but the amount of stuff doesn’t decrease. The amount of stuff stays the same, but it’s now in a much smaller volume. Space goes from being empty to being filled with stuff like Earth’s atmosphere. It goes to be thick and dense and it becomes a fluid. Because [crosstalk 01:05:23].

Samuel Hinton: Yeah. If you go all the way back to right after the Big Bang, space looks like a fluid. It’s got so much stuff in it and space is smaller, it acts like a fluid. What that means is, well, quantum mechanics says that right after the Big Bang, some parts of the universe have just a little bit more energy than other parts. Energy, mass, light, they’re all the same thing at this point in time, so it has a bit more stuff.

Samuel Hinton: Imagine you blow up a balloon in the atmosphere. That balloon, the area inside the balloon, has more air than outside. You pop it, you get rid of that elastic shell and what happens? You hear the pop, the air spreads out. It’s a little shock wave. It’s generally not a shock wave because it’s just air pressure moving. It’s a sound wave. It’s an acoustic wave. You have these in the early universe. You have these acoustic waves from these over dense regions spreading out.

Samuel Hinton: Imagine, it’s like you’ve got a still lake and it starts to rain, you can see all the ripples from the raindrops spreading out. That’s what the early universe looks like. I’m taking a little bit of time here, but space was a fluid back then and it’s not a fluid now, which means at some point, it went from a fluid to not a fluid. This actually happens incredibly quickly in astronomical terms. We’re not talking billions of years or millions of years. We’re talking thousands of years, very quickly.

Samuel Hinton: Imagine you’ve got this lake that’s been rained on with all the ripples spreading out. Then, suddenly, instantly, the lake snap freezes for some reason. You’ve got all the ripples will now be imprinted in the ice because it’s frozen straight away. That’s what we see in the universe except we can’t hear it. We can’t hear the ripples. If we take a telescope and we measure 100 million galaxies, we can reconstruct the ripples because the ripples are patches of over density. Over density means more mass, more mass, more gravity, more stars, more galaxies, the more stuff.

Samuel Hinton: If we simply figure out where there’s more stuff in the universe than other places, that’s a ripple. We go out, we find all these ripples, and we use them in a very standard way that we use the Supernova. Instead of it being a standard candle, it’s a standard ruler. We know how big the ripples are. We know that they had 300,000 years to expand. We know at what speed they expanded because it’s tied to the speed of light.

Samuel Hinton: If we find a ripple that’s X big in some area of the universe, we know that it started off as Y big. It’s thereby increased in size X over Y. We can figure out how much any patch of the universe has expanded. Again, we use that to map the expansion history. We try and infer the properties of dark energy and dark matter. That was about a five minute explanation. I’m going to call it there. Anyone wants any more detail, there’s a whole bunch of videos on dark energy and large scale structure. That thing is called the baryon acoustic oscillation if you’re curious.

Kirill Eremenko: Wonderful, wonderful. What do you predict some side effects of your research? Will we have new MRIs or anything like that?

Samuel Hinton: It’s so hard to tell. The main benefit of current astro research in breakthroughs that we have in deep learning and machine learning. We obviously have images of the night sky. We’ve tried to identify things in those images. That’s obviously very closely related to things like identifying tumors or medical abnormalities in MRI images or similar.

Samuel Hinton: At the moment, it’s looking less like a tech breakthrough, astrophysics gave the world digital cameras a couple of decades ago. I think we’re still coasting that and we’ll coast that for as long as we can. For now it’s just about sharing techniques.

Kirill Eremenko: Okay. Got you. All right. Looking forward to that, it sounds very exciting. Hopefully, this job goes ahead very soon. It’s going to be fun.

Samuel Hinton: Fingers crossed.

Kirill Eremenko: Okay. One last thing I wanted to chat to you about before we finish up is what books are you reading? A person of your mind, of your breadth of applications and knowledge in data science and other fields, surely there are some interesting things you’re looking into where you get all this information from?

Samuel Hinton: Yeah. There are a few. There’s a few books that I was recommending recently on causal modeling that I have on my to-read list. However, in the past few months, I will admit that I have not touched a single textbook. That’s not that abnormal for me. With so much work with having both of these jobs and working nonstop, when I get a bit of downtime, I pick up my novels.

Samuel Hinton: I just need a break from all of this data science, all of these data pipelines, I just need to turn off. I’m a huge reader of fantasy. Brandon Sanderson, all of his books and similar authors. I’ve read around, I think, 45 books this year.

Kirill Eremenko: Forty-five books this year. It is April.

Samuel Hinton: Yeah, it’s …

Kirill Eremenko: Ten per month.

Samuel Hinton: I normally average like one every day or two. If I have a weekend off, I can read an entire book in a day. I generally feel bad at the end of it. I feel like oh, wow, you really should have done something else. You could have been at least a bit productive.

Kirill Eremenko: Yeah. Okay.

Samuel Hinton: I try not be, you know.

Kirill Eremenko: It’s crazy, man. It takes me a month to read a book sometimes. What’s the most memorable book even if it’s fiction that you read this year?

Samuel Hinton: Geez. I don’t even know the name of them. That’s the issue, I don’t keep track. I was like, that’s a good book. I will download it, read it and go on to the next. I don’t remember the authors. I don’t remember the names. Let’s see. Hang on, give me one second. I have the online repository in front of me. Let me just open books. Yes. Okay. I think one of the ones that I liked the most by Andrew Rowe was Sufficiently Advanced Magic, which is just a light hearted fantasy progression style thing. That was nice.

Samuel Hinton: Then, there was the Cradle series by Will Wright, it was seven books or so. I read those in about three days.

Kirill Eremenko: How do you spell that?

Samuel Hinton: Small books.

Kirill Eremenko: Yeah. Small books. Do you do like some speed reading or something like that?

Samuel Hinton: I used to. I try not to anymore. I did speed reading years ago and I got up to like several thousand words a minute. I just realized that there’s no fun. If you read a book in an hour, A, my retention is horrible. I read to relax. What am I supposed to do if I read everything I have on my Kindle or my phone in a single day? No, I probably still read exceptionally quickly but I no longer try and speed read.

Kirill Eremenko: Okay.

Samuel Hinton: It’s probably still abnormally fast but I just have to live with that.

Kirill Eremenko: It is. It is. Very impressive that’s 45 books in this year. I’m put to shame. I’ve read like three, two or three. Yeah, okay. All right, cool. Thanks for recommendation, Andrew Rowe and Will Wright, some good fantasy books if anybody’s looking for any.

Kirill Eremenko: Yeah, I think we’ve covered off everything. Is there anything you wanted to touch on before we wrap up?

Samuel Hinton: Not particularly. I’m just hoping that I get a weekend off in the next couple of months and can sit down and just chill out and perhaps go and actually read a textbook for once. That would be nice, a change of pace when things slow down enough that I can just breathe and learn. Because learning is, I think, one of the things that I like most. It’s so hard to find the time.

Samuel Hinton: If anyone out there is currently stuck in iso and is dedicating themselves to going through courses and upskilling themselves, I think, that’s an absolutely fantastic use of time. I wish everyone during that the absolute. Best of luck. I know that there’s a bunch of people that have lost their jobs recently, and at least trying to make a positive use of the time that we all have to spend at home is as much as we can ask.

Kirill Eremenko: Fantastic. Thanks, Sam. A huge thank you on behalf of all our listeners for doing what you’re doing. You’re helping the world. Even though you don’t have any free time and it’s hectic, somebody’s got to do the job and you’re the best fit for this from the people I know for sure. Thanks so much for what you’re doing. Awesome.

Samuel Hinton: My pleasure.

Kirill Eremenko: Before we wrap up, where can our listeners find you? What’s the best places to contact you? Follow your work? Get in touch?

Samuel Hinton: Let’s see. I mean, LinkedIn is an easy one. You can send me a message there. I don’t check it often but I do check it eventually. That’s probably the best way because most emails or whatnot that I get in the data science route are just as easily done on LinkedIn. No one really wants Instagram. I try not to do work on that. Apart from that, yeah, hit me up on LinkedIn, probably.

Samuel Hinton: If there are any urgent queries, feel free to send me an email. Just know that I am incredibly swamped with emails at the moment. I don’t know if I’ll have time to respond in the next couple of weeks.

Kirill Eremenko: Got you. Sam’s website if anybody is interested in watching his lecture is cosmiccoding.com.au. Very, very cool. Thanks again, Sam. Great, great pleasure chatting on the show. Awesome as always.

Samuel Hinton: Thanks for having me, mate.

Kirill Eremenko: There we have it. Thank you so much everybody for spending your time, investing your time into this episode and learning alongside with us. I hope you got a lot of valuable takeaways. Yeah, so much cool stuff, so many cool things. Without a doubt, my favorite part of this episode was all the things that Sam is describing about the COVID Critical Care Consortium where he is the lead data analyst and all the takeaways he’s getting.

Kirill Eremenko: Also, the insights into what it’s like to work with real world data, how messy it is? What challenges come up? I think that was a great refresher. Some projects, especially if there are course projects or projects prepared for you by somebody else can be too clean, like too void like they might not have any messiness in the data and anybody can be led to believe that data science is like that. It’s not. It’s actually very, very complex. There’s a lot of missing data. There’s a lot of normalization that needs to happen. A lot of pre-work of the data, building the data pipeline, all of that is super valuable.

Kirill Eremenko: Speaking of data pipelines, make sure to check out Sam’s workshop at DataScienceGO Virtual. DataScienceGO Virtual is happening at the end of June, this year, 2020. You can get your ticket absolutely free if you go to datasciencego.com/virtual. Just be careful, number of seats is limited. This is our first time doing an online virtual event. We’ve done this event many times in real life in California for many years. This is our first virtual event. The number of seats is limited. Make sure if you want to get in, apply for your seat today at datasciencego.com/virtual.

Kirill Eremenko: You’ll see Sam running a workshop on data science pipelines. You’ll actually be able to code along with him and create your very own data science pipeline. Make sure not to miss that. As usual, you can get the show notes for this episode at www.superdatascience.com/367. That’s www.superdatascience.com/367. There, you’ll find the transcript for this episode, any materials we mentioned, including books and the URLs to Sam’s LinkedIn, website, his presentations online, and any other fun things that might help your learning growth in data science.

Kirill Eremenko: There we go. Make sure to check that out as well. On that note, thanks so much for being here. Sam and I are looking forward to seeing you at DataScienceGO Virtual in a couple of weeks. Apply for a ticket today if you haven’t yet. Until next time, happy analyzing.

Podcasts SDS 367: Building Data Pipelines for COVID-19 Modeling

SDS 367: Building Data Pipelines for COVID-19 Modeling

Podcast Transcript

Share on

Related Podcasts

January 9, 2026

January 6, 2026

January 2, 2026

Podcasts SDS 367: Building Data Pipelines for COVID-19 Modeling

Share

SDS 367: Building Data Pipelines for COVID-19 Modeling

Podcast Transcript

Share on

Related Podcasts

January 9, 2026

SDS 956: From Agent Demo to Enterprise Product (with Ease!) feat. Salesforce’s Tyler Carlson

January 6, 2026

SDS 955: Nested Learning, Spatial Intelligence and the AI Trends of 2026, with Sadie St. Lawrence

January 2, 2026

SDS 954: Recap of 2025 and Wishing You a Wonderful 2026