SDS 581: Bayesian, Frequentist, and Fiducial Statistics in Data Science

Podcast Guest: Xiao-Li Meng

June 7, 2022

Listen in as the founding Editor-in-Chief of the Harvard Data Science Review and Professor of Statistics at Harvard University, Prof. Xiao-Li Meng, and Jon Krohn dive into the many data trade-offs that abound, and explore the paradoxical downside of having lots of data.

About Xiao-Li Meng
Xiao-Li Meng, the Whipple V. N. Jones Professor of Statistics at Harvard University, and the Founding Editor-in-Chief of Harvard Data Science Review, is well known for his depth and breadth in research, his innovation and passion in pedagogy, his vision and effectiveness in administration, as well as for his engaging and entertaining style as a speaker and writer. Meng was named the best statistician under the age of 40 by COPSS (Committee of Presidents of Statistical Societies) in 2001, and he is the recipient of numerous awards and honors for his more than 150 publications in at least a dozen theoretical and methodological areas, as well as in areas of pedagogy and professional development. In 2020, he was elected to the American Academy of Arts and Sciences. He has delivered more than 400 research presentations and public speeches on these topics, and he is the author of “The XL-Files,” a thought-provoking and entertaining column in the IMS (Institute of Mathematical Statistics) Bulletin.
Overview
Calling in from Massachusetts, Prof. Xiao-Li Meng joins us for an entertaining episode that explores the goals of the Harvard Data Science Review, and several other interesting topics that have rarely been covered on the show before.
Coming up on his fifth anniversary as the Harvard Data Science Review founder, Xiao-Li began the open-access publication to further our beloved data science community and shape what data science could be in the coming years. “What I saw at the time was a need for people from all walks of data science who needed a forum to debate with each other, to understand what each other were doing,” he says.
After tackling the differences between data science and statistics, Xiao-Li explains the principles behind data minding, a concept which he coined himself. According to Xiao-Li, data minding is a quality inspection process that scrutinizes data conceptualization, data pre-processing, data curation and data provenance.
Jon and Xiao-Li then dive into why there is no “free lunch” with data and discuss the many trade-offs that abound in the field. While there are too many to explore in just one episode, they focus on why the very concept of data privacy is at odds with the inherently shareable nature of data; how data quality and quantity are often inversely related; and why cleaner data can be less valid to the real world.
To cap off this lively episode, Xiao-Li deep-dives into the three schools of statistics: Bayesian, Frequentist, and Fiducial. Jumping off from his article in the Institute of Mathematical Statistic Bulletin, the professor gives listeners a primer into the three schools–or BFF as he calls them. Then, in a thorough explanation, he explains the pros and cons, and why they’re all valuable to know as a data scientist.
Tune in to hear Xiao-Li’s passionate take on these topics and so many more. 

To see a SuperDataScience episode filmed live in-person with Jon Krohn and superstar Hilary Mason on Friday, June 10th at the New York R Conference, you can get tickets 30% off with the code SDS30

In this episode you will learn:
  • What the Harvard Data Science Review is and why Xiao-Li founded it [5:31]
  • The difference between data science and statistics [17:56]
  • The concept of ‘data minding’ [22:27]
  • The concept of ‘data confession’ [30:31]
  • Why there’s no “free lunch” with data, and the tricky trade-offs that abound [35:20]
  • The surprising paradoxical downside of having lots of data [43:23]
  • What the Bayesian, Frequentist, and Fiducial schools of statistics are and when each of them is most useful in data science [55:47] 
Items mentioned in this podcast: 
 

Follow Xiao-Li:

Follow Jon:
Episode Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 581 with Professor Xiao-Li Meng, founding Editor in Chief of the Harvard Data Science Review and Professor of Statistics at Harvard University. 
Jon Krohn: 00:00:14
Welcome to the SuperDataScience Podcast. The most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex, simple. 
Jon Krohn: 00:00:45
Welcome back to the SuperDataScience Podcast. Today, we are graced by the presence of the prominent and visionary academic leader, Professor Xiao-Li Meng. Xiao-Li is the founding Editor in Chief of the Harvard Data Science Review, a publication in the vein of the renowned Harvard Business Review, but designed to further our beloved data science community. He’s been a full professor in Harvard’s Department of Statistics for more than two decades, chairing the Stats Department for seven of those years, and serving as Dean of Harvard’s Graduate School of Arts and Sciences for a further five years. And he’s on our show today. He’s published over 200 journal articles on stats, machine learning and data science, and he’s been cited over 25,000 times. He holds a PhD in statistics from yes, you guessed it, Harvard. Most of today’s episode will be of great interest to anyone who’s keen to better understand the biggest challenges and most fascinating applications of data science today. 
Jon Krohn: 00:01:40
There are moments here and there, however, particularly near the end of the episode that do get technical and so will appeal most to practicing data scientists. In the episode, Xiao-Li details what the Harvard Data Science review is, why he founded it, and the most popular topics covered by the review so far. He talks about the concept of data minding, a term that he invented. He talks about why there’s no free lunch with data, the tricky trade-offs that abound no matter what, he talks about the surprising paradoxical downside of having lots of data, and he talks about what the Bayesian, Frequentist and Fiduciary schools of statistics are, and when each of them is most useful in data science. All right. You ready for this amazing episode? Let’s go. 
Jon Krohn: 00:02:32
Xiao-Li, I’m so excited to have you on the SuperDataScience Podcast. Where are you calling in from?
Xiao-Li Meng: 00:02:37
I’m calling from Brookline, where I live, and thanks for having me here. 
Jon Krohn: 00:02:42
That’s in Massachusetts? 
Xiao-Li Meng: 00:02:45
I hope so. Yes, in Massachusetts. 
Jon Krohn: 00:02:50
Nice. Easy to commute into Harvard. Do you have to commute a lot into Harvard these days? Or- 
Xiao-Li Meng: 00:02:55
I do. But actually, if you don’t mind, I have to share a story because it now reminds me when I initially moved here, trying to figure out how long would it take for me to go from my house in Brookline to my office. So at that time, there was a Yahoo Map, so I just put in where my location is, where my office is, and Yahoo come back say, it’s seven miles away. But guess what was the estimated time of traveling from my house to the… provided by Yahoo Map then? 
Jon Krohn: 00:03:36
In seven miles? I mean- 
Xiao-Li Meng: 00:03:38
Seven miles, yes. 
Jon Krohn: 00:03:40
20 minutes? 
Xiao-Li Meng: 00:03:41
Right, okay. Now you are obviously smarter than Yahoo Map. At the time, it provides a number that I was just laughing so hard, and then now I know exactly what they did. It was eight minutes. There’s no way you can do eight minutes go anywhere. We just get out the garage here. But I was wondering, they must have done something like a 60 mile per hour, so seven miles, eight minutes. So it’s a great example how wrong things were then. But anyway, so the real travel- 
Jon Krohn: 00:04:10
Yeah, things are getting better. 
Xiao-Li Meng: 00:04:11
I hope so. Real travel is usually about, as people say, anywhere in the city, you go 40, 45, minutes-ish. Although I have done 20 minutes, to be precise 23 minutes, 3:00 AM in the morning. That I was able to do in 23 minutes, so. 
Jon Krohn: 00:04:29
Gotcha. So, we were introduced by Dr. Amy Brand in… so she was in Episode Number 567 of the SuperDataScience Podcast, a really great episode about open source publishing, and Amy is the lead at the MIT Press. She runs the show over there and she said that I had to talk to you. She said that Xiao-Li… she was like, “He would be such a perfect guest for the show,” and I instantly looked you up and I’ve been excited about having you on since. So it’s been a few months that I’ve been waiting for this interview. 
Xiao-Li Meng: 00:05:09
Thank you. 
Jon Krohn: 00:05:09
It’s so exciting to have you here. You really are an exceptional individual. So you are a Professor of Statistics at Harvard, which would mean that we would have a ton to talk about in its own right, and we are going to talk about some of your fascinating research in this show. But another thing that you do, that our audience would love, is the Harvard Data Science Review. So you’re the founding Editor in Chief of the Harvard Data Science Review, you’re coming up on your five-year anniversary? Is that right? 
Xiao-Li Meng: 00:05:40
Next year, yes. 
Jon Krohn: 00:05:41
Next year. 
Xiao-Li Meng: 00:05:41
Exciting. Next year. 
Jon Krohn: 00:05:44
And probably close to Amy’s heart, is that it is an open source publication, open access, and so that means that any of our listeners can check out the episode. So tell us about the Data Science Review, why did you found it? What is the niche that it fills? 
Xiao-Li Meng: 00:06:01
Right. Well, thank you very much. First I want to say that it is because the founding of the Harvard Data Science Review where I know Amy, and she has been tremendously helpful because Harvard Data Science Review is published by MIT Press. Now I always have to tell people that that does not say that MIT and Harvard have merged, although historically someone have tried, it’s simply because the Harvard University Press, obviously I went there first, they don’t publish journals, they only publish books. That gave me a chance to work with MIT Press. And you mentioned Amy being the leaders in open source, that’s something I want to talk a little about later as well. But let me first answer your question about how did we start to think about Harvard Data Science Review, and Harvard start to have a data science initiative, just like many other universities, every major or even not so major university these days feel the need or the necessity to start with something on data science. 
Xiao-Li Meng: 00:07:07
So at the time I was just… I just finished my five-year term as the Dean of Graduate School here. And one of the co-director of Harvard Data Science Initiative came to me, and her name is Francesca Dominici, and she basically said, “Xiao-Li, we know you are very creative, so give us some ideas here, in terms of what Harvard Data Science Initiative should do, particularly going beyond Harvard.” Because most these initiatives tends to be internal, trying to help the data science enterprise on campus. But, so I thought, “Well…” I said, “I did have an idea, because Harvard has brand names are Harvard Law Review and Harvard Business Review,” and what I saw at the time, is a need for people from all walks of data science, and we’re going to talk exactly what do I mean by that later. That needs a forum to even debate with each other, to know each other, to understand what each other is doing, to understand why we have so many different definitions or understandings about what data science is. 
Xiao-Li Meng: 00:08:22
So I thought that will be something. I knew it will be very ambitious, but I saw they were asking for opinions, so I always have opinions, suggestions. Of course, as you probably guess what happened next. After few months they come back to me say, “Xiao-Li, we love your idea, but you need to do it.” That’s usually what happened. And I thought pretty hard about, and I did feel like this would be a very good enterprise to get into it, and so that’s how I started. But I want to tell you from very beginning, the mission of Harvard Data Science Review is very clear, very ambitious. I said, “If we can do this, we may as well do a big one.” So the vision is really to help to define and to shape what data science is, or should be.” 
Xiao-Li Meng: 00:09:25
I can tell you tons about how the Data Science Reviews that we have… our boards is very broad because of… or our slogan is, “Everything data science and data science for everyone.” And so on my board, by now I think we have multiple boards. We have the editorial boards, we have a advisory board, we have early career boards, in which we engage the young people. I think we have over 150 people, and we have basically from philosophers, and I’m going to talk quite a bit about why philosophers are relevant, all the way to quantum physicists, to entrepreneurs, government data scientist leaders. Basically, we’re trying to cover everything data science. And it has been enormously rewarding, as well as challenging, just because people from very different walks and even the way they talk about data science, are very different. But that’s exactly what we initially planned to do. 
Jon Krohn: 00:10:37
Cool. That sounds like a smashing success. It’s amazing that only four years in, you already have so many different boards to work with and 150 people in really interesting, broad areas from philosophy to quantum physics. That’s cool. And you’ve had some pretty popular issues already. So one big one with over a hundred thousand views was on the 2020 election. You’ve had a COVID special issue, you’ve had issues on reproducibility in science and differential privacy. I don’t know if some of those topics, if you want to particularly dig into those, why was the election one so popular? 
Xiao-Li Meng: 00:11:16
Well, for obvious reasons, because given the 2016 election prediction were… let’s be modest, pretty disastrous, and I actually, part of my own work is really emphasized on data quality, I can get into details later, but at the time that what we decide to do, we want to make sure that we publish special themes before the election actually happen, so we basically had a variety of political scientist and others, who are brave enough to put their predictions online literally, and so we can verify afterwards. So we managed… and also, what’s very interesting time-wise, was very, very pressed, because we don’t want to people to predict too early, because that’s hard. So we publish literally, it’s about a week before the election actually happened, and you probably know at the time everybody was so nervous about it. So we got lots of hits. And then we did some podcast later actually, to follow up with these people who made the prediction and ask them, “Why did you get it wrong? Why did you get it right?” So on, so forth. 
Xiao-Li Meng: 00:12:30
So when I look at tracing the readers, the clicks, you can see clearly there’s a peak. We’re talking about data science. That’s what these signals are, just blips. Huge blips. And so yeah. And so that’s what we did. And the one on reproducibility and replicability of science, it was actually a joint venture with the National Academy of Science, because they had a special report, I think back in 2019, and so we team up with them. We published that. There was a lot of interest. And the special issues on CO-V, obviously by the time I got locked down, that’s what? One or two years ago, when I thought that like, “What do we do now?” And then obviously, given data is so important for all kinds of reasons, so we start to put together an issue. I think we had a rolling base publications. We get people submit and we’re trying to do as fast as we can. Was a lot of interesting challenges there as well. And so we start publishes in May, and the whole issue, I think will last more than a year, we could just keep adding articles. By the way too, for any audience that… this is completely free, completely open, so please check, including now, if you want to check that, so. 
Jon Krohn: 00:13:52
Nice. And so in addition to the written publication of the Harvard Data Science Review, I understand that there’s also a podcast, but that it’s actually mostly just about marijuana? No, I know that you have a- 
Xiao-Li Meng: 00:14:05
Well, yes. 
Jon Krohn: 00:14:07
… a recent episode happens to be on legalizing marijuana, but you’ve also had episodes about whether we’re alone in the universe, it sounds like lots of fascinating topics and probably digging into the data and the scientific literature behind particular topics.
 
Xiao-Li Meng: 00:14:25
Yes. Thank you, and we do have an episode about marijuana coming, I think just in a couple weeks, and probably even sooner than that. Well, the reason we did the podcast, is probably the same reason that you do the podcast, we just want to engage audience from all levels of interest and curiosity. And we so far have had… this will be the 16th episode, so we’re very young. We do once a month, not like yours that’s twice a week. It seems like an enormous amount of work, but the topic that we choose, obviously we want to choose the topics where data science is relevant, but that’s actually a very low bar because almost anything’s data science relevant these days, but we want to choose the topics are of broad interest. And the way we did it, most of topics we choose are related to a particular column we have in Harvard Data Science Review, is called Recreations in Randomness. And that’s a particular column on data science for leisure activities. And so we had episodes of wine, how to pick up a wine, we had episodes of sports and we had the first episode was on matchmaking. And so- 
Jon Krohn: 00:15:50
Wow. 
Xiao-Li Meng: 00:15:52
Right. And so we typically have these topics are related to an article or future articles that in Harvard Data Science Review, but talking a way that more of a plain language. We typically have two guests and usually with complementary views, and there’s lots of these topics are easy to get people have different views. But our emphasis had always been, we want people really talk about data science, be evidence based, support, because as you know well, it’s easy to be very passionate about something than choose the theory or choose the data, choose the evidence to support your theory or ideology. And we try to get people to really talk about as much as possible, to be as vigorous as possible. Yes. And we certainly would love to have your audience to tune in, to listen to some of these episodes, and we’ll see- 
Jon Krohn: 00:16:54
That sounds fascinating. 
Xiao-Li Meng: 00:16:55
… we’ll see whose episode is better. 
Jon Krohn: 00:16:59
See who does a better episode on weed. 
Xiao-Li Meng: 00:17:03
Ah, right. Right. 
Jon Krohn: 00:17:04
That sounds like a really fascinating topic area, and so yeah, I definitely encourage listeners to check out the podcast. That sounds like fun. This episode is brought to you by SuperDataScience, our online membership platform for learning data science at any level. Yes, the platform is called SuperDataScience. It’s the namesake of this very podcast. In the platform, you’ll discover all of our 50 plus courses, which together provide over 300 hours of content, with new courses being added on average, once per month. All that and more, you get as part of your membership at SuperDataScience. So don’t hold off, sign up today at www.www.superdatascience.com. Secure your membership and take your data science skills to the next level. 
Jon Krohn: 00:17:54
Cool. So, in terms of some specific articles that you’ve written that have made a big splash, you’ve recently wrote an article that spells out the differences between statistics and data science. So what is the difference? What is data science? And how is it different from stats?
 
Xiao-Li Meng: 00:18:15
Right. Well, I think I’ve been asked that question many, many times, and I have given a lot of thoughts on that, like many of my colleagues, but eventually I realize there is actually a very simple answer. Very surprising there’s a simple answer. The simple answer I came up with is simply what to tell people that the difference between statistics and data science is like a difference between the physics and the physical science, or the difference between sociology and the social science. And so data science by now, it has evolved. It’s not a single discipline. Data science is a collection of disciplines that’s certainly including statistics, computer science, engineer, and operation research, and I will go as far as like, there’s lots of philosophers work very hard on thinking about what makes it possible to predict the future. All these kind of [inaudible 00:19:13] questions.
Xiao-Li Meng: 00:19:15
And so thinking data science as a collection of disciplines actually is not only important to understand its scope, it actually has real consequences for university, for building various data science related enterprise. So in my first editorial for Harvard Data Science Review, again, it’s entirely free, I encourage people to read it. I wrote it, basically say data science is a ecosystem. I call it’s a artificial ecosystem. Artificial, both in the sense of artificial intelligence, because it’s computer based, but there’s also a sense of how it’s related to the whole artificial intelligence. And the real consequence here is, I wrote there, and I think many of my colleagues and increasingly more people agree with me, that we should not encourage the university to put together a data science department, because that’s a too small unit. And it’s actually make it very hard to have a reasonable curriculum to cover all these topics. You just get students way too many things. 
Xiao-Li Meng: 00:20:21
Imagine you have… you probably never heard of, unless you are from some very small colleges, they don’t have enough resources. Barely any university have a department of science, because you just don’t know what to do. Because science has biological science, physical science, all sort of stuff. So you basically… the university should build, if they want to build, is like schools. For example, [inaudible 00:20:45] is building a college, so MIT build a college of computing, with various different names, so you really should think about these. These are high level. So that’s essentially how we, at least at HDSR, Harvard Data Science Review, how do we think about data science. And it really gave me the- 
Jon Krohn: 00:21:02
So, in your view then, instead of creating a department of data science at a university, there could be a faculty of data science. 
Xiao-Li Meng: 00:21:10
Absolutely. Absolutely. What’s interesting is that, if you think about… and this is not just my view. Michael Jordan, as I’m sure you know him that from Berkeley, who is a leading true data scientist, both a computer scientist and a statistician, have been talking the very similar same things. He talk about that essentially, what has been such a good opportunity in the last 50, 100 years, in terms of, for a university build a new kinds of schools? We have school science, we have school social science, public health, business school, where is the opportunity? What has been a topic so universal and important, the [inaudible 00:21:54] ever, a new school? And so data science is one of these very, very few opportunities. There’s lots of challenges, including what exactly should be in the data science schools or faculty, but I think the general scope is quite clear. It goes beyond a single discipline. 
Jon Krohn: 00:22:14
Right. That makes a lot of sense. Cool. All right. So yeah, so data science is a big umbrella, that include statistics, computer science, philosophy, lots of other disciplines. Now in that same article, you also talk about two concepts that I hadn’t heard of before. So maybe you can elaborate on them for us, not data mining, but data minding. What’s data minding, Xiao-Li? 
Xiao-Li Meng: 00:22:44
Well, you haven’t heard of? I don’t blame you, because I coined the term, and I hoped a lot more people will think about that. Well, let me say that what I have in mind, and that’s what I wrote that article about, is that I want to really emphasize how important the issue of data quality is, particularly these days when we rely on massive amount of data and a lot of algorithms, including deep learning, or maybe particularly deep learning, that we don’t quite understand how they work. I know we have been working very hard. There is researchers has been really making big progress. They understand those things. But we rely on those mass amount of data and these really interesting algorithms to come up with these solutions that, as you probably can tell, that people used to say, what they call the garbage in and the garbage out. Because if you have very low data quality, particularly you use these algorithms to learn the pattern from the data, and you may get something is really quite bad. 
Xiao-Li Meng: 00:23:57
And now in this particular article, I said, “I’m not worried about garbage in, garbage out, because if we recognize there are garbage, we know how to deal with them, I’m more worried about garbage in and the package out.” Because those things get packaged very nicely, people just take that as it is. And the example I used, and that’s what I meant by data minding. Data minding basically said, “We need to do a very serious… before you do the data mining, that you should really think through the data quality. Understand where the data came from, who collect the data, for what purpose, how the data were measured, what question was asked, and how that were processed?” And really minding the data. Thinking through all these questions and do a serious introspection of the data quality. 
Xiao-Li Meng: 00:24:52
And I always use the example which I think that most people understand, and the 2016 election predictions, we in the end all get a surprise regardless your ideology. It’s just because we had so much data, so many predictions, all went one direction, most of them. And that was clearly there the data quality issue, because clearly there were… I did some studies in 2018 to show how a seemingly tiny, what they call non-response bias, the idea of people did not want to tell you what they really think. Not necessary they’re lying, they just say, “Oh, I have not made up my mind,” even when they have. And so that kind of issues that if you don’t understand it directly, based on the data using traditional methods to analyze, you get tons of these predictions, or go one direction and then we collectively… it gets very frustrated. Well, let me give you a specific example about the kind of data minding that I did in that particular article, which I think it was a very telling example. And it shows the importance of doing that. I wrote that article because I was asked to by this journal, this is a journal of Royal Statistical Society from England. They had a collection of articles on data science. They asked me to write an editorial, and they- 
Jon Krohn: 00:26:20
Yeah. Do you know Martin Goodson? I think he leads that section. Martin Goodson is the one who leads the data science section at the Royal Society. 
Xiao-Li Meng: 00:26:31
Oh, you’re talking about Royal society. I’m talking about the Royal Statistical Society. Yes, there is other… okay. Yeah, but the Royal Society also have been doing a lot of great stuff there. But I was asked to write… writing some commentaries in the collection of articles. And they want me to write something like the Harvard Data Science Review kind of editorial. I told them like, “Be careful what you wish for,” because I said, “I’m going to be very critical.” So I went through every article, keeping in mind in terms of the data quality issue. So I basically for every… so the article again is online, so that people who want, are welcome to read it. I basically look through every article to see how they talk about the data quality issue. Every article talk a lot about data analysis, all kinds of fancy stuff, but I say, “Well, wait. I want to see how you talk about the data quality issue before you do all those things, because I know how those thing will affect your results.” 
Xiao-Li Meng: 00:27:33
So here’s one very simple example. One article wants to study these heat warning system, like when summer’s coming, and when temperature keep getting high or you get a forecast, there going to be a heatwave for a few days, should a city to issue warning, particularly for the elderly, for the more vulnerable population. And this group of people, they want to study, so in the past they don’t know how these decision was made. I said, “Let’s look at the data. Let’s look at data to see if these warning system actually make a difference.” So they pick up a city, I think it was city Montreal, in Canada, that they want to study. They look at the 20, 30 years of data, trying to see the study, whenever there’s a heatwave warning, how does that affect the death happen in the city. 
Xiao-Li Meng: 00:28:27
So it was actually really quite a serious study, but there was one problem. The problem is they used the city data, and they say the data… there was just one sentence. The data use is whatever the city recorded how many deaths is on the day. So they used the data. Now I happen to used to study a little bit of issue about reporting delay for the city to record this data. So I had a very simple question. I had a question is like, “When they say these deaths are all say today, are these deaths happen today? Or they going to record today?” That makes a big difference, particularly… for example CDC has the data on the vaccination. They have on average about a five days delay. Right? 
Jon Krohn: 00:29:16
Right. 
Xiao-Li Meng: 00:29:17
Now five days is enough to dislocate the heatwave, and the real deaths. So it’s just a simple question. Because you wouldn’t know that the five day difference is you will see a lot less correlation, compared to if the deaths we have happen on the day. But apparently that question was never asked, because I did not find anywhere in the article talk about that. I don’t know if the reviewer remind them, but for me, these are kind of questions that anybody serious about these data analysis, should ask. It could be a single question, to check with the city. You may find out there was actual delay. Now then the problem become more complicated. How do you adjust for the say, five day, whatever, it probably will have under the distribution. And that’s the kind of question that I think a data miner should do. Otherwise you’ll be analyzing data, miss the real signal. So that’s just a simple [inaudible 00:30:06]. 
Jon Krohn: 00:30:05
Right. Data minding. Yeah, that makes a lot of sense, and is a general concept to understand your data before you start data mining or modeling, definitely critical. It is way too easy today to take whatever data you can find, throw it into a model, and then as you describe, have garbage in and then supposedly, something well packaged coming out the other end. All right. So data minding was one of the terms that you introduced to me in that article. Another one was a data confession. What is that? 
Xiao-Li Meng: 00:30:40
Yes. Well that is basically… I’m encouraged that anyone, particularly those of us who are scholars, publish any data science articles, we should tell as much as possible. The data defect. Especially if we collect the data. We tend to not confess that much. Not necessarily people want to hide things, but the journals probably have not required that much to tell people like, “You disclose.” Anytime any of us have done the real data science project, you know how messy datas are. And from conceive the questions, how to collect the data, there’s lots of judgment being made, and the process itself, sometimes is completely get lost, even if we are the ones collecting the data. So by data confession, I really want the people to think through, and also record during the process, how did you come up with these idea? Where did you come with the data? What corners you have cut? We all have cut corners. And it’s a massive process. 
Xiao-Li Meng: 00:31:44
But whenever we send the papers to publish, typically… and this has been a tradition in the field. We spend way more times talking about the algorithm, the models we fit, that everything we do. We say a lot less about the data itself, and particularly when the data we got from somewhere else. And now it’s even harder when we got it somewhere else. We don’t even know what else people have done. So I’m encourage a general culture, which is… this is not really new. I was inspired by talking to people from Library Science on my board of Harvard Data Science review, because they are the ones worried about data creations, just like library create books, because they are the one thinking about how the datas… which versions? How do you record the whole thing, and how do you think about talking about reproducibility, replicabilities, how do you make sure that future others can verify what you have done, if you didn’t record all those things. 
Xiao-Li Meng: 00:32:46
So these are hard. These things are much more tedious and often not much get rewarded, but the more you disclose, the more the review comes back say, “Hey, your study, it really does not worth that much. You should do better.” So you can see the incentive system not quite right for people to confess. And so that’s why I’m using this phrase, data confession. I think if we really take data science seriously, think about replicability of science, reproducibility of science seriously, would more [inaudible 00:33:15] really do more about disclose all the defects, the warnings for others. I’m not saying we cannot proceed, but I’m just saying it’s much better for us, or tell people what are limitations, what are the things that… we all do the best that we can. That’s how the science progress. But we should try to do as much as possible. I know that’s sometimes very hard, particularly there’s data confidentialities, all kind of issues. That’s another big topic by itself, but I’m just looking at all these articles in the statistical literature, I think we disproportionately spend way too much time on the later part, which is absolutely important, but there are this much more fundamentally early part, which we tends to say much less, so sweep it in the rug, so to speak. 
Jon Krohn: 00:34:06
Nice. Okay, understood. Yeah, data confessions do sound like another key part of helping data science with reproducibility. And again, it’s related in a way to the data minding, because we’re concerned about the data quality going into whatever models we develop. And as you say, we put outsized attention on model development and not enough on data quality. So in a separate article… actually no, this one’s a YouTube video. So we’ll be sure to include links to all these. So the article where you talked about differences between statistics and data science, as well as data minding, data confession, we’ll include a link in the show notes to that article, as well as to this video on another concept related to a data science philosophy, just like differences between stats and data science and data minding and so on. So you can see that there is not only a lot of philosophers working for the Harvard Data Science Review, but also a lot of philosophy going on in your head, Xiao-Li. So this one that you mentioned in this video with that… you claim that there’s no free lunch in data science and that data come with trade-offs. So what does that mean? What are the trade-offs that data scientists must be aware of when they work with data? 
Xiao-Li Meng: 00:35:36
Sure, thank you. Well first I want to say this no free lunch is not just a principle for data science, it’s a principle for life. We all like to get things like [inaudible 00:35:51] and cheap, and the grape, you know how in life that you usually have to make a compromise? We all like all the best stuff, but then things come with a price. And the particular examples I have in the data science space in terms of trade-off, they are really multiple of them. One of them probably these days gets lot of attention, is the trade-off between the data privacy and the data utility. The particular example, and for which Harvard Data Science Review will publish a special issues on the defense of privacy for 2020 census. 
Xiao-Li Meng: 00:36:32
Here is a trade-off, or rather a dilemma that the Census Bureau was given. The constitution mandates every 10 years, Census Bureau should collect the data as accurately as possible. Then you also have Title 13 says that you need to protect people’s privacy as much as possible. Now it’d be wonderful if there’s a way to have incredibly accurate data, but is also private. Now you know that’s not possible, and I always tell people like, “If you think about the term data privacy itself, it’s bluntly, it’s a oxymoron term.” Because data are born to review. And privacy means to conceal. So, you know that we are just forever having this problem, and I also say that we humans are very good at creating dilemmas for ourself. We as humans, we all have lots more information, and we like having information about others, we like having all the information help us to build AI systems. But anyways… “No, no, no. You collect information of others, but don’t collect them on me.” 
Xiao-Li Meng: 00:37:47
And now the problem I always say, the problem is, other people are us. There are just no way to get around it. So you can see that is one big trade-off. The other is the trade-off is we already touched on the data quality issue. It’s a trade-off between data quality and the data quantity. And now in the ideal world you say, “Well, there should be trade-off? I can surely have lots of data with high quality.” “Well, great.” And sometimes that happens. When that happens, congratulations, you have a much easy job. But often is the case that you can get the massive amount of data for example, from social media. Facebook, other places. But these data they comes with really questionable quality issues. I very recently published an article in Nature, looking to the Facebook’s survey and other surveys on these vaccination, in the uptake. 
Xiao-Li Meng: 00:38:42
And then it turned out to be they were quite biased. And we all understand why, because people on Facebook tends to have a higher education level, for example, compared to people are not on there. So when you do survey there, you end up with seeing the vaccination rate is much higher to a point. At some point it was 17% points higher than the CDC benchmark. So you know there… and then you can have very good quality datas if you have a very well controlled studies, so on, so forth. But they tend to be much smaller, they are much harder to conduct, they’re much more expensive, so you have this trade-off between data quality and the data quantity. 
Xiao-Li Meng: 00:39:25
Another trade-off is between what I call the data cleanness and the data validity. Any data scientist, yourself included, I’m sure that you know that we spend most of time trying to clean up the data when we do anything. The data comes messy, things are missing, things that do not match, I don’t know how this thing was coded, and we all wish the data was very clean. Now occasionally we would get a very clean data. Then I’ll be very worried. Because I know that there’s lots of cleanup have been done to the data. And I need to know how those things were done. And often you will find out it’s very simplistic thing, have done to it. 
Xiao-Li Meng: 00:40:03
So that’s another trade-off, that you can have very clean data, seems very easy to analyze, but you may get the answers are really not that relevant, because someone else have trying to clean up the data. One very common approach, unfortunately it still happens, is called for example, called a complete case analysis. People will throw away any cases where there’s missing data. Only study these cases where everything’s complete recorded. Well, it is very easy to analyze, but you know that everything gets recorded. This probably is a very special group of people, particularly for example when you study the [inaudible 00:40:37] issues, whose record is completely recorded. So you get a verified [inaudible 00:40:41]. 
Jon Krohn: 00:40:40
It’s not a random sample. 
Xiao-Li Meng: 00:40:43
It’s not a random representative sample at all. So these are the tons of these trade-offs, and all these things is what I meant be, there’s no free lunch. Now, the upshot of that, the good news is, there’s always job, tons of jobs for a good data scientists or statisticians, because these issues. 
Jon Krohn: 00:40:59
Right. Yeah, these are not issues that it will be easy to automate away. So people sometimes- 
Xiao-Li Meng: 00:41:06
It’s got no point. 
Jon Krohn: 00:41:06
So sometimes people are worried about, “Oh, auto ML is going to take data scientists jobs.” And it’s precisely because of these issues, this auto ML approach where out where we figure out what the optimal algorithm is, and its hyper parameters, okay, aspects of that in some circumstances can be automated- 
Xiao-Li Meng: 00:41:29
Yes, definitely. 
Jon Krohn: 00:41:29
… but it’s not useful if you have a high quantity of data with low quality, to use one of your comparisons, or you have so much pre-processing happening, the data are so cleaned up in some automated way, that the data are no longer valid in the real world. They become irrelevant. So- 
Xiao-Li Meng: 00:41:54
And this not a answer to that relevant question that you really care about. I’m all for to automate as much possible, but we have to be really reasonable in terms of understand, this is what I call the no free lunch principle here. The example I just gave on this heatwave study, it would be so hard to teach any machine to recognize say, “Hey, did you check this data? Is it really that that’s happened on the day? Or that was the deaths reported on that day?” How would any algorithm do that? The only reason I can do it is because I have done studies in some other cases before. I have this way of thinking about those things. Now I’m hoping some days that we may be able to train the AIs to be able to think like a human, and I’m talking about real AI, to accumulate from past experience, but we are far from there yet. And I’m not quite even sure we can ever get there, because it’s a kind of thinking, connecting thoughts, it’s just not… takes the human intelligence that most time we don’t even understand how human intelligence works itself, so it’s hard to model it. But that’s the kind of experience comes with the humans, that it’s very hard to automate. 
Jon Krohn: 00:43:10
So cool Xiao-Li, to hear so many data science philosophy thoughts from you, I’ve really enjoyed this episode so far and we’ve got more to come, so another one here from another article of yours. This is from an article in the Harvard Gazette. And you were talking about this paradox related to having large amounts of data. So we’ve talked already a bit about quality versus quantity, so maybe this is going to relate again here, but in this Harvard Gazette article, there were elements related to COVID-19 in the paradox of having lots of data. And then you also have a paper where this can happen with electoral surveys. So can you elaborate for us on how bias is amplified, the larger the dataset we have? 
Xiao-Li Meng: 00:44:02
Yes. Well, thank you very much for asking that question. And in fact, that Gazette article was really about the one I already mentioned, the publication in the Nature, on this CO-V vaccinations. But let me backtrack a little bit because that’s where I started, and I surprised myself after I had done the calculation. This was, I think the other are in terms of the electoral one is the 2018 article. What I did was, I tried to quantify the data quality. Now it’s much easier to quantify data quantity, obviously. But it’s much harder to quantify data quality. So I start this work actually really back in 2012, 2013. The department at the time had a visit coming from the US Census Bureau by name of Jeremy Wu, and Dr. Jeremy Wu at that time made a presentation to my department, and he asked the following question, which I think is a really fun question for all the audience to think about, and if I have chance, I usually will question the audience, how would you answer that question? 
Xiao-Li Meng: 00:45:16
So at the time, and particularly now there’s a lot more of these cases that our US Census Bureau was start thinking about, “How do we utilize all these administrative data?” There’s lots of datas are already there. And the question is, how do we use these datas that we know they cover large percentage of the population, but they were not collected for statistical inference purpose, so they are not like a randomized that they are not a [inaudible 00:45:42] sample. So Jeremy asked the question to me and my fellow statisticians like, “If you are given a 5% data of a population, which we tell you that is a good quality statistically valid in a random sample, verse 80% of the population data, which I just tell you it’s covers 80%, but I have no idea the quality of the data. Which one would you trust? Do you trust that 5% of the data? Or you trust the 80% of the data? 
Xiao-Li Meng: 00:46:16
Of course a statistician first response would be… or anybody’s response would be, “Trust for what purpose?” Jeremy respond to say, “Well, let’s say we want to estimate the population average. For example, we want to estimate in the end, how many people will vote for a particular candidate.” So, most statistician, the answer will be, “I’ll be trusting the 5% of the data, because I know how to assess its uncertainty, I have all the formulas because it’s a random sample.” So Jeremy say, “Okay, that’s fine. But now I’m going to change the question. You still have the 5% of the probability sample, but if the other one covers 90%, do you change your mind?” Now it’s 90% verse 5%. 90% means, “I know it’s cover 90% of the population, but I don’t know if it’s quality.” 
Xiao-Li Meng: 00:47:03
Well, some statisticians, myself included, start to think, “Well, maybe 90%, I probably should switch.” Some people say, “Don’t switch.” And Jeremy say, “What about 95%?” At some point he will say, “Wait, wait, wait. I have to switch, because by the time it get to 100%, assume people would answer responsibly everything, 100% should be the right answer.” But that got me thinking that Jeremy’s question is, “How do you measure quality of your data? Where do you do that switch? What’s the calculation?” And that’s a fascinating question. So that’s how I start to do this work. And now eventually… okay, years actually. Eventually this article published in 2018, and I did the following calculation, which was mind boggling, which really started my journey about pushing people to think more the data quality issues. 
Xiao-Li Meng: 00:47:51
So I did the following calculation, using 2016 data. I can estimate in 2016 data, how correlated people’s response to say whether they vote for Trump or not, and their willingness to participate in a survey. So the idea is that, that correlation should be zero if people just answer honestly, because it shouldn’t be, let’s say, if you look and say, “Because I’m going to vote for Trump, therefore I don’t want to tell you.” You know that creates bias. So we estimated, because after 2016, we know the answer. So we estimated what was that correlation? The correlation turned out to be minus half percent. It seems very tiny. Minus half percent. Minus because the answer people tend to not respond are the ones who once voted for Trump. So it’s a negative [inaudible 00:48:35]. It turned out that the minus half percent, essentially caused a tremendous loss of effective data science, by which I mean the following. 
Xiao-Li Meng: 00:48:45
Well, I did a rough calculation. Before the election there were so many surveys out there, I did a rough calculation. It’s about amounts to, we had about 1% of the eligible voting population, provide their opinion. Let’s say that’s 2.3 million people. That’s about 2,300 surveys each with 1,000 people. But if I do the calculation, what’s the effective sample size? Meaning the statistical information in that 2.3 million answer, which sounds a lot. What’s the equivalent answer if they deny any defect? It’s like coming from a true simple random sample. The answer is a surprising one. The answer, it was published in the article, is about a 400 people. So basically, that half percent negative correlation, caused a damage, which essentially lost about 99% of the data. The statistical accuracy- 
Jon Krohn: 00:49:39
99%? 
Xiao-Li Meng: 00:49:40
Right. The statistical accuracy, we can show mathematically, is equivalent to… you have the same answer from 400 people without that bias. So that was the calculation. And so I carry the same calculation for that COVID study done by the Facebook. And what we did a calculation is, they have 250,000 people, but in the end because the correlation they have, that compared to the CDC, the benchmark, their sample size is essentially anywhere from 10 to 250, compared to 250,000. So that’s the thing what I call the big data paradox, because it comes with two parts. Not only you have a tremendous reduction of the sample size, because you have so much data, and if you do not realize how small the data size is, if you just do the traditional confidence into the calculation, because you want to give error bars. 
Xiao-Li Meng: 00:50:44
Your error bar will be way too small, because you use this erroneous sample size. So your error bar is so small, you will… actually, the more data you have, the more you’ll be centered at the wrong place. You will make sure yourself you never get to the right place. This is what I meant by the big data paradox. Because the larger the data size, the more sure your answer is, but you’re centered at the wrong place. Your [inaudible 00:51:10] literally shrinks as the sample sizes goes up. You’re going to have a higher and higher chance to prevent yourself to get to the right place at all. That’s the big data paradox. 
Jon Krohn: 00:51:22
This is something I think about a lot, as we have lots of data. And so this is something that impacts me all the time. If you try to use statistical approaches like Frequentist statistical approaches that were developed over a hundred years ago. So the common way to get those error bars would be to say, divide by the square root of the number of samples you have. 
Xiao-Li Meng: 00:51:45
Exactly. 
Jon Krohn: 00:51:46
And when you’re dealing with millions of data points, that means that your error bars are… they don’t exist. You’re saying, “I know exactly where we are.” 
Xiao-Li Meng: 00:51:58
Exactly. Right. 
Jon Krohn: 00:51:58
Yeah, but- 
Xiao-Li Meng: 00:52:00
But yes, you know exactly where you are, except you are at the wrong place. So the whole paper I wrote is to really show these calculations. And then you are speaking with these Frequentist masses. They are based on [inaudible 00:52:12] assumption, which is that the datas are well mixed. It’s a probabilistic sample. And so that’s where- 
Jon Krohn: 00:52:22
So we’re taking- 
Xiao-Li Meng: 00:52:23
So the- 
Jon Krohn: 00:52:24
… we’re taking techniques where we explicitly say, and even if you’re in an introductory statistics class as an undergraduate student, you’re saying, “Okay, this statistic, it only applies if you have a representative random sample.” But then we’re constantly using that today in situations where we don’t have a representative random sample, and pretending [inaudible 00:52:46]. 
Xiao-Li Meng: 00:52:46
Right. We knew we don’t have it. Right, because… and the thing, I guess in my 2018 article, it’s also available online, I think the surprising part, even some of my statistic colleagues are still debating with me. But what my calculation show is, what matters is no longer just the sample size, is the relative sample size. How the sample size divides by the population size. And the larger the population, the worse you end up always. And to a lay person, that’s not surprising. The large population is harder to analyze. But statistically I was using the following analogy to help the general population to understand why the statisticians at the very beginning say, “We can ignore the population size.” It’s very much like when you taste soup. Now if you’re tasting soup, if you have a well mixed soup, regardless the size of the container, all you need is just couple of spoons, because you can taste how salty it is, how delicious it is, as long as the soup is well mixed, you don’t really need that much. 
Xiao-Li Meng: 00:53:58
But if the soup is not well mixed, then you can see the size of the container really matters. Because the larger the container, the more chances that some salty chunks are somewhere else. And so, once you’ve lost this well-mixed assumption, that’s essentially is what it is. It’s well representativeness. The homogeneity. Then the problem become a lot, lot harder. And the problem these days is the social media data particularly, is because of self-offering data. You cannot really randomize people, say, “You’ve got all this feeders, you go on this Facebook.” That doesn’t happen. You lost the control in terms of statistical control. And so then you really have to be careful. We all knew this things, but we all hope these problems are not that big. But what I did, it was just to show how seriously the problem is. Because it really kills 99% even over the amount of data. And you can just do this mathematics, it’s just out there. 
Jon Krohn: 00:55:01
Fascinating. That was a really great explanation of this paradox of having more data. Really enjoyed that discussion. And it leads perfectly into our next topic, because we started talking about Frequentists. And so, yet another article you have, it’s an amusing article in a generally amusing column that you have in the Institute of Mathematical Statistics Bulletin. So you have a column called the XL Files, which I can only assume is a play on the X-Files TV show name and your name, Xiao-Li. The XL Files- 
Xiao-Li Meng: 00:55:41
Correct inference. 
Jon Krohn: 00:55:46
And so amongst many fun articles in there, there’s one of them titled BFF Forever. And so BFFs for those of us who aren’t high school aged girls stands for best friends forever. BFFs. And you use that to summarize the three big schools of statistical inference. So Frequentist is one that we just talked about. Another one is Bayesian, and the third one is actually one that I am not very familiar with, Fiducial. So for our listeners’ sake, could we go over what these three different schools are, Frequentist, Bayesian and Fiducial, and in the last case there, you’re going to be explaining it to me for the first time too. 
Xiao-Li Meng: 00:56:40
Yeah. Well, I don’t blame you. Fiducial is one of these perspective has existed for a long time, and essentially coming from R Fisher. Fisher had contributed a lot to statistics as you know, but it has always been regarded as a Fisher’s biggest blunder, because it’s the most controversial one. So, let me give a try. This is self worth, we’ll probably, we can do a episode for two hours to explain all these three different perspectives, and why we have this community called BFF.
 
Xiao-Li Meng: 00:57:15
First, I think the Frequentist one probably, is the one that most of us learn statistics, that’s where you start from. And I’m sure, because the most textbooks teach that, and I sincerely think I went through the whole process myself without even knowing there are these different names, so I think the best way I can explain this, let me really give a try there. Start from think about data. So data have two pieces in any data. Well one part is what we call a signal. That’s something we want to understand. To know. The other part is noise. These are part, they are there, it’s annoying, we want to get rid of them, but data basically comes with both the signal and the noise. And so for the entire data science community, in fact all we do, is try to separate what is signal, what is noise? The complication there is first, how do you that? Second is, signal in one study is noise in other study, and vice versa. The notion itself is a relative one. 
Xiao-Li Meng: 00:58:23
So you can see why the philosophy is there all the time. Because these issues… for example, a particular one, I also done quite a bit of study, is individualized in medicine. There is what’s a signal? What’s a noise? What’s evidence? It’s very complicated. But let me go back to explain these three different schools. The Frequentist essentially focus on data itself. Every study, whatever study you do, you need to think about what is a replication. Because if you have any problem, in order for you to convince anybody, particularly scientifically, you need to talk about like, “If I do this study again, my message will work. Not just for this one case.” It’s not even clear what do you mean if I say, working with this one case. You have to think about the replication. What the replications are. And it is in thinking these different replications, these different philosophers comes in. 
Xiao-Li Meng: 00:59:19
The Frequentist thinking is focus on the data. Basically hypothetically thinking about that if I can repeat the process again and again, seeing different datas, how my procedure is going to work? For example, in the predictions, we’re basically thinking about well, that whole randomization say, “You only have one sample.” Let’s say we talk of 400 people. But the idea here is that if I do this, replicate the 400 people again and again many times, how my procedure is going to work. So that’s called a Frequentist. So you see, it basically require, do you think about the datas you actually don’t have? That’s the Frequentist. 
Xiao-Li Meng: 00:59:57
The Bayes said, “Well, I don’t care the datas I don’t have, I care about the data I have. Can you tell me, based on the data I have, what’s going on? How do I make a inference [inaudible 01:00:11], because all I have is my data. Now the problem here is that if you think like a Frequentist there, you will get stuck because there’s no replication there. Datas are fixed. And whatever scenario I want to know, is a fix of scenario. The hypothetical it’s unknown. So the Bayes will think about, “So okay, now let’s hypothetically think about all the different scenario that could have generate the data I have seen. Among all these scenario, which one is the most likely? So essentially it’s for example, if you have come CO-V symptoms. Now I have to think about, there are many possible ways why you are coughing, like what I’m having now. There are many possible ways you were coughing. And which is most likely? 
Xiao-Li Meng: 01:00:54
Now I think for most people, that is a question people want to really have the answer. I don’t care about other people’s symptom. But in order to answer that question, you need to come up with a replication. Think about all these different scenarios. So that’s what the Bayesian call the prior. The prior knowledge you need to think about. And that itself is a controversial notion. You need to think about probabilistically, what are all these different scenarios you could have. And a Frequentist can reject the notion and say, “No, no. I only have one disease. I don’t want to assume that I have multiple diseases.” But you see the direction is different? The replication is now, imagine all these different scenarios, it could have generated exact the datas as I have. So that’s the one is, so I always have these friendships. One is thinking about one scenario generate all these different datas, the other is a reverse the shape. Think about your datas, but all different scenario can generate the data as you see. 
Xiao-Li Meng: 01:01:49
The Fiducial is the hardest one, because Fiducial does not fix either one of them. If you think about the pair, think about all the datas, all these are scenarios you can have. Imagine every person in the population, you have your symptoms, you could have your diseases, and you have all these different pairs. So Fiducial, instead of working on… the Frequentists working on the data, the Bayes working on the signal. They put the distribution on the signal. And the Frequentists working with the noise part. Frequentists look at the difference the signal and the data, thinking about what are the possible noises are consistent with the symptoms I’m having? And it operates on this joint space and it itself has philosophically and even operationally, have quite a bit of trouble. That’s why you probably have not heard about, people don’t really teach that much about. 
Xiao-Li Meng: 01:02:53
But it offers a way… what the Fiducial are trying to achieve, they try to answer the base question, which is given my data, what is the disease I have. But they don’t want to assume that kind of a prior knowledge, which you may or may not have. So that’s why it makes it very harder. They’re trying to use the distribution on the noise, to infer that distribution on this disease, without assuming the prior distribution. And at the end [inaudible 01:03:29], that’s not really possible, but there are certain scenario under which that you can get pretty far. And that’s why the school itself is a much more philosophical and much harder to operate, much harder to teach, much harder for me to explain, and I’m sure my Fiducial friend will say, “Xiao-Li, you didn’t give the right interpretation for the explanation.” But I tried. I would invite anyone trying to explaining Fiducial without formulas. 
Jon Krohn: 01:03:54
There’s no way for me to know if it was the right explanation, but it did make sense to me. So I’d like that. So Fiduciary statistics, it makes use of the data that you have, it doesn’t worry about some unknown distribution that you don’t have. And it’s trying to make inferences without priors. Priors that Bayesian specifics relies on. And- 
Xiao-Li Meng: 01:04:15
Right. Right. It’s trying to give a distribution an answer without assuming a prior. And it turned out that you can do it in some sense, because you do have these distribution on the noise. So the way they do it essentially is saying, “Just imagine very simply your data is noise plus signal.” But if you have data, once you see the data, if I have a distribution on the noise, it is somehow implied that there is distribution on the signal. Because the data equal to the signal plus noise. I can solve that equation. But except that solving these distributions is not exact right. So that’s why the complication comes in. But there is an attempt, trying to do the best, in terms the best of the both schools. And that’s why the thing is still alive, and in fact, what’s interesting that there are increasing more people in the machine learning content, they’re doing things that they don’t realize. They’re actually doing Fiducial. They’re plugging this thing, solving equations, assuming this is known, that is known. If you look at them, they are not coherent. They don’t follow proper the school rules. They follow some solving equation rules. That actually is what the Fiducial does, except that most people don’t know what they’re doing, is actually a Fiducial answer. 
Jon Krohn: 01:05:30
Yeah, interesting. You learn something new every day. That’s a big one for me. I love that I’ve learned a whole other school of thinking on statistical inference, Fiduciary. And so we’ve never had obviously an episode on Fiduciary statistics on the SuperDataScience show, or I would know what it is, but if listeners are interested in hearing more about Bayesian statistics, we have an epic two hour long episode, Episode Number 507, with Rob Trangucci, from last year from 2021, that’s is all about Bayesian statistics. And I thought we might be taking a risk with that episode and that it would be too technical, but it’s one of the most popular episodes we’ve ever had. So yeah, you can check that out if you’re interested in learning more about Bayesian stats, but my question for you Xiao-Li is okay, given now that we’re aware of these three schools of statistical inference, Bayesian, Fiducial, Frequentist, there’s a lot of people who debate that one is better than the other, but you think that these three, B F and F, Bayesian, Fiducial and Frequentist, should be BFFs forever. They should be best friends forever. So why do you say that? 
Xiao-Li Meng: 01:06:51
Well, it’s a great question. And the reason I say that, and also my BFF community colleagues, is because we all went through these different perspectives. We tried to convince ourself which one is the best, and we end up realizing we really need all. Because each one of them have something to offer, and each one of them has its own weakness. And in fact that I come to a conclusion, this is part of the thing, I’m trying to do this more, a foundational thinking, that I don’t think we will ever settle which one is better. Because fundamentally, anytime you do anything about data science, there is a leap of faith involved. Let me explain what that means. This is part of reason why I think that philosophy is so important for thinking about the data science itself. 
Xiao-Li Meng: 01:07:43
But you have to ask you yourself, anytime I have the data, and I’m asked to use the data to predict something which nobody knows for sure, because if they do, then we don’t need to do the prediction. What makes possible to go from the data to whatever this thing I want to know? Logically, why would that be even possible? What’s the part of nature, or if you believe God, which part of the God that allow us to do that? So basically, that every school and this is why it’s lots of philosophy involved, is thinking about what are the links between what we know and what we don’t know. And in this process that these different approach, the perspective I just talk about, is all trying to use the mathematics to help us to make the link, to make that connection. 
Xiao-Li Meng: 01:08:30
And I can tell you what’s the problem with each one of them, and why I’m, in the end, and I see an increasing amount of people realize, we should use all. First, we start as a Frequentist, like myself. But eventually the Bayesian convinced me that the Frequentist has one fundamental flaw, which is that yes, you can study all these beautiful answers, tell me on average something works, but in the end, even you tell me that that medication has 95% chance work for you, well there’s still 5% chance that it doesn’t work for me. And a 5% of the population is a huge amount of people. And how do I know I’m not the 5%? How do we know this average answer is relevant me? So the Frequentist one forever suffer this, it could be completely irrelevant, just for me. 
Xiao-Li Meng: 01:09:24
So then we go to the Bayesians. The Bayesians is obviously… well, it will be relevant for you because that’s the way it… but that’s how it is set up. But the problem is you need to make a lot of assumptions in order to get an answer. The question is how do I know these assumptions themself are reliable? Now there is a school called Objective Bayes. The idea is, can I use a prior knowledge, which people call a non-informative prior. Can I assume, which basically says I’m very ignorant, and use that school. Now if that can happen, that will be wonderful. But it turns out, this is a fatal saying about a Bayesian statistics. Bayesian statistics, there’s no way you can put down a prior, which is truly ignorant. Probability distribution does not allow you to model ignorance. If anytime you put down a distribution, people say, “Well, let’s assume like everybody equal chance.” 
Xiao-Li Meng: 01:10:23
No. Equal chance is not being ignorant. Equal chance is a huge amount of information. Anytime you put down a distribution, it will ask you to specify the relevant frequency of two different states. So you have to do that. So it turned out the problem with Bayes, I’ll be using this phrase, because thinking about doing mathematics without the concept of zero, because in Bayes there’s no zero. There’s no zero information. You always have to assume something. So that’s where now I start moving to the Fiducial. Because Fiducial said, “We want to do that inference, but without making any assumption. Without a prior.” Essentially like it’s a zero information. And they can do that. 
Xiao-Li Meng: 01:11:04
But it turned out that itself, because the zero requirement, it turned out that, “How do you operate? How do you…”, Using in the Fiducial framework, or more generally, there is a whole area called imprecise probability.” The imprecise probability is exactly trying to address the issue. The probability itself is too precise, because the probability itself is too… you need to specify the relevant, the frequency. Sometime I just know the answer between three and a five, but don’t ask me, “How likely is four compared to 2.5?” I don’t know. I know the answer between three and a five. How do you model that? And that one, currently the Bayes probability cannot handle that. 
Xiao-Li Meng: 01:11:44
So the imprecise probability is trying to handle that kind of thing. But it turned out… I just wrote another article recently, when you do that, you still have to make a judgment how you want to update the information. Like in Bayes framework, you have a Bayes zero to tell you how to update the information. There is a clear rule. But broadly, when it goes to this imprecise probability, you have a variety of different rules. It depends on whether you are a optimistic person or you are pessimistic, or you are a opportunitist, but these different perspective will give you different rules. Give you different answers. 
Xiao-Li Meng: 01:12:18
So then I realized, okay now see, every school has problems. Every school, that either you get answers is not that relevant, or you need to make lots of assumptions, or you still need to make some judgment. So the thing in the end you realize, that all schools, they’re all trying to do the right thing. But they all come up with different answers. And not because they have defect just for the sake of having defect, it’s because if you think of fundamentally, what makes possible to make prediction with the future, I think for them the problem is, there’s no simple answer here. And that in the end, just like the whole data science itself, we basically have to use whatever the methodologies. All can help. But we need to be aware of the limitation of each one of them, and we understand where you’re going to suffer if you use it in not in the right way. That’s why I say we should use all of them. 
Jon Krohn: 01:13:09
Beautifully said, Xiao-Li. That was an extraordinary section of not just this podcast, but of any of the episodes we’ve had. That explanation of the differences between Frequentism, Bayesianism and Fiduciary approaches and that timeline that I think a lot of us go through, so I’m maybe early on in my Bayesian phase and yeah, I look forward to exploring it deeply enough that I see that Fiduciary is the right thing for some circumstances, but yet not all. 
Xiao-Li Meng: 01:13:43
It has holes. It has holes itself, yeah. 
Jon Krohn: 01:13:46
Very cool. All right Xiao-Li, so this has been an exceptional episode, and I’m sad to say that we’re starting to get near the end, which means that it’s time for me to ask you for a book recommendation. 
Xiao-Li Meng: 01:14:01
Absolutely. And I’m going to show you which book I’m reading, and it’s called Beyond Flavor. 
Jon Krohn: 01:14:09
Oh yeah? 
Xiao-Li Meng: 01:14:10
It’s a book by Nick Jackson who is a master of wine, and this book teach to you how to do blind tasting. And now you may ask why a statistician’s reading a book on blind tasting about wine? Well, first that I love wine. 
Jon Krohn: 01:14:26
I know you like wine, yeah. 
Xiao-Li Meng: 01:14:27
You know I love wine. And the second I think that if you truly understand how to judge the quality of a wine, and how nuanced the process is, you would understand the data science far better than people don’t go through that process. The reason I’m saying that is that for any [inaudible 01:14:53] we ever have any experience with a wine, wine just comes with so many different features. And a predictive wine is like… and it’s also very personal. The wine looks like this taste maybe fantastic for you, for somebody else they say, “Ah that’s terrible.” 
Xiao-Li Meng: 01:15:10
And this particular book, it was recommend to me by a former student of mine, who was working with me on a data privacy issue, and her name is Ruobin Gong, a faculty at the Rutgers University. She recommend this book to me because she knows I love the wine, but also we talk about this whole issue of how to predict the wine quality. Here, let me very specific here why this is a terrific book, why it’s called Beyond Flavor. Because most people taste wine, trying to guess which wine. They use the flavor. They say, “Well, how does it taste?” And Nick Jackson basically said, “Well, it turns out that the flavor is not a good predictor.” Because particularly in these blind tasting, particularly if you want try to pass these exams, they give you these much more trickier ones. The flavors are kind of similar. 
Xiao-Li Meng: 01:16:09
And so you need to look for the structure of the wine. For example, if you talk about the acidity of the wine. And basically, this is for those people who do machine learning, you’re basically, you’re doing feature engineer. You want to see which is a good predictor. So he particularly talk about that when we think about acidity, we just say, “Well, that the level of the acidity.” And he say, “No, that’s not enough. You need to think about the level of acidity, think about the type of acidity, but also the shape of the acidity.” And I say I never heard of… “What is the shape of the acidity?” 
Xiao-Li Meng: 01:16:53
And he talk about, it’s like, “How when you sip the wine, where does the acidity start to grow? Does it always flat all the way? Or start off strong, then goes down? Or start at lower levels then goes up?” And he said, “In different wines has that different shape. If you understand that shape, you can do much better predicting a blind wine.” So this is essentially machine learning, right? Essentially, it’s like a human way trying to recognizing these patterns. So I would certainly highly recommend the book, even you don’t care about the blind tasting, but I think there’s lots of thoughts there in terms of these very kind of a very practical way of engineering features and for these prediction. And I want to say that if we can ever teach machine to really do what blind wine tasting, now we’re talking the real artificial intelligence. And I always think about someone, maybe you should start something called deep wine. So if you can do that, and then we really think about how we think like a human. How the machine can think like a human, and to do this incredible hard prediction. 
Jon Krohn: 01:18:03
Nice. All right. That is a really fun recommendation. I also didn’t know about acidity shape. That sounds like quite an interesting feature for our deep wine algorithm that, no doubt our listeners are now going out and building, to realize AGI very simply. That’s all it took. 
Xiao-Li Meng: 01:18:23
That’s all it took, yes. 
Jon Krohn: 01:18:27
Fantastic. Xiao-Li, thank you so much for being on the show. No doubt our listeners learned a ton from you. How can they keep up with your latest? Do you post on social media or anything like that? I guess obviously the Harvard Data Science Review- 
Xiao-Li Meng: 01:18:42
Yes. 
Jon Krohn: 01:18:42
… is the place to- 
Xiao-Li Meng: 01:18:44
That’s a place to… yes, I’m on Twitter. I actually refused to go on Twitter for a long time, until I become the Editor in Chief of the Harvard Data Science Review, and people say, “Well, you really got to, because that’s a good way to promote HDSR.” So, yeah. And I’m on Twitter, I am on… well, my email is open, so if anyone is interested, feel free to shoot me email, I may or may not be able to always answer, just because there’s a volume here. But I think HDSR will be a great place to really check out what’s the latest in data science. We have these different columns. We have six columns from these kind of recreational… I should share the story with you. 
Xiao-Li Meng: 01:19:36
How do I come up with the term data minding instead of data mining? Because, we have a column on the history of data science, called Mining The Past. And I was talking to some other co-editors. We were talking about coming up a column about the future of data science, meaning training the pipelines, talking about K to 12 students. And we want to have a title. So I said something like, “Since we have a Mining The Past, why don’t we have a Mining The Future?” And one of my board members, she misheard. She said, “Oh, I like the Minding the Future.” 
Xiao-Li Meng: 01:20:16
Then I said, “Oh, that’s better than Mining The Future.” That’s how we come up with the term. So we have the column called Minding the Future. That’s where I got the data minding itself. And we have articles from very perspective, philosophic articles, all the way to the very technical ones and education ones, applications, and so it’s all free. And if you want to get the hard copy, we have these hard copies to sell, the inaugural volume online, and I think it’s worth to check out as well. Thank you. 
Jon Krohn: 01:20:56
Very cool. And of course the Harvard Data Science Review podcast as well, for your monthly fix of recreational topics, such as recreational marijuana, the wondering about the existence of aliens and so on. Xiao-Li, thank you so much for being on the show and hopefully we can have you on again in the future someday. 
Xiao-Li Meng: 01:21:20
Thank you very much for having me. And this has been really a great conversation, and I hope that your audience will enjoy. And of course I hope they will listen to our podcast and also as well, check out HDSR. Thanks again. 
Jon Krohn: 01:21:40
What an honor to be able to speak with the extraordinary Professor Meng today. Despite his world leading accomplishments, he was remarkably humble and an easygoing character. I had a terrifically enjoyable experience conversing with him. I hope you enjoyed it as much as I did. In today’s episode Xiao-Li filled us in on how he founded the Harvard Data Science Review to shape what data science could be in the years to come. He talked about how data science is a collection of disciplines, including statistics, computer science, philosophy, and many more. How data minding is understanding your data well before mining or modeling it, how there’s no free lunch with data trade-offs abound. For example, the very concept of data privacy is at odds with the inherently shareable nature of data. How data quality and quantity are often inversely related and how cleaner data can be less valid to the real world. 
Jon Krohn: 01:22:32
He also talked about the paradox that having more data often means greater confidence around the wrong estimate of the population mean, and how the Bayesian, Frequentist and Fiduciary schools of stats each have their own respective pros and cons, and so they’re all valuable to know as a data scientist. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Xiao-Li Twitter profile, as well as my own social media profiles @superdatascience.com/581. That’s www.superdatascience.com/581. 
Jon Krohn: 1:23:07
Finally, if you live in the New York area and would like to experience SuperDataScience episode filmed live, then come to the New York R Conference, which will be held on June 8th through 10th. That’s the New York R Conference June 8th through 10th. Huge names in Data Science will be presenting there such as Andrew Gelman and Wes McKinney and to close out the conference on the afternoon on Friday June 10th, I’ll be interviewing Hilary Mason, one of the worlds absolute best known data scientists live on stage, so you can react or ask questions in real time. Should be tons of fun and I hope to meet you there or if not at this conference, then somewhere else soon. If you want tickets to the New York R Conference, you can get them 30% off with the code SDS30. Thats SDS30. 
Jon Krohn: 01:23:57
Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience podcast for you. And thanks of course, to Ivana Zibert, Mario Pombo, Serg Masis, Sylvia Ogweng and Kirill Eremenko on the SuperDataScience team for managing, editing, researching, summarizing, and producing another extraordinary episode for us today. Keep on rocking it out there folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts