SDS 519: A.I. for Good

Podcast Guest: James Hodson

November 2, 2021

We detail globally impactful case studies, how you can get involved in applying AI for social benefit, the hard and soft skills James looks for in hires, and more!

About James Hodson
James is the AI for Good Foundation’s Co-founder and CEO, who has previously spearheaded Artificial Intelligence initiatives at a number of global firms and has built successful (and sustainable) for-profit ventures in a variety of industries.
Overview
James and I were introduced via Claudia Perlich who we both have as a mutual friend and colleague. James is the CEO of A.I. For Good which was born out of a collaboration of policy and professional work towards the UN’s goals of sustainable development, which were introduced as an update to the millennium goals. James’ goals for this foundation are aimed at existing ones rather than developing their own. James noticed early on that the networks didn’t exist for tackling social issues the way they existed for other industries. As a result of several conferences James was involved with, it was decided that multiple organizations were needed to tackle these issues.
A.I. For Good tackles the social impact side of the work. Another example is diversity in A.I. which is handled by A.I. For All. Both organizations came out of the same philosophical thoughts in 2014. Over time, the work has expanded to precisely define what the organizations get involved with. They have three pillars they organize their work around: education, research, and policy. These pillars tackle how universities can be more resilient and open, core infrastructure on the work they do, and working directly with policymakers and governments to best utilize A.I. to reach sustainability goals.
From there we dove into a few different case studies. One area A.I. for Good is working on is diversity, equity, and inclusion. This is an important area whose underlying issues have been greatly exacerbated and put into the spotlight by the COVID pandemic. James has noticed this pandemic has had a real impact on relationships between different groups. The gender gaps are widening, for example, and while we’re acknowledging it and talking about it, concrete solutions have not been reached. Transparency and accountability are the goals A.I. For Good look at when it comes to building out their scorecards for impact. The second area of interest for A.I. For Good are lack of access to data, lack of network, and lack of understanding of previous work. This area focuses on getting information to difference-makers who need information on previous attempts at things like tracking carbon emissions, for example. James’s work is on forcing information out of silos. The third area of study is on global public health, which is obviously a timely project. While James wishes they could have achieved this many years ago, the focus now is to better prepare the world for the coming pandemics and public health crises that are on the horizon. In all of these case studies, James went into the technical data work required to achieve AI For Good’s goals.
From there we pivoted into how you can get involved in the same work as A.I. For Good. The foundation runs a volunteer program which you can sign up for on their website. They provide a monthly newsletter with opportunities for each issue. It’s often assumed that A.I. organizations are well funded, which A.I. For Good is not. They want to be independent of agendas which results in rejecting funding from huge institutional sources of funds. They have a good mesh of individual contributions and funding for corporate stakeholders that have skin in the game.
When it comes to hiring, James approaches management from a standpoint of handing off leadership and ownership to those in charge of the projects. He likes folks who are interesting in building and owning things without micromanagement. What he sees is graduates are educated in tools rather than manners of thinking towards solutions. The need to generate data scientists has to come without diluting the depth of the learning. He suggests you strive to be both a system analyst as well as a data scientist. You need to be able to think about the world through the data available. 
In this episode you will learn: 
  • AI for Good [5:17]
  • Founding of AI for Good [8:50]
  • Case studies [14:58]
  • How you can get involved [46:29]
  • Skills James looks for in hires [50:39]
 

Items mentioned in this podcast: 

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 519 with James Hodson, CEO of the AI for Good Foundation. 
Jon Krohn: 00:00:12
Welcome to the SuperDataScience Podcast. My name is Job Krohn, a chief data scientist and bestselling author on Deep Learning. Each week we bring you inspiring people and ideas to help you build a successful career in data science. Thanks for being here today and now let’s make the complex, simple. 
Jon Krohn: 00:00:42
Welcome back to the SuperDataScience Podcast. Today’s guest is the eloquent and inspiring James Hodson. James is founder and CEO of the AI for Good organization which it leverages data and machine learning to tackle the United Nations sustainable development goals. He’s also an academic research fellow at the Jozef Stefan Institute where he’s focused on natural language processing research and he serves as chief science officer at Cognism, a British tech startup that uses data and machine learning to accelerate sales. Prior to becoming an investor and social entrepreneur, James spearheaded AI initiatives at a number of global firms, including Bloomberg, where he served as AI research manager and he completed a degree at Princeton University in which he focused on machine translation. 
Jon Krohn: 00:01:33
In today’s episode, James details globally impactful case studies from his AI for Good organization across public health, diversity, equity and inclusion and a practical database of AI progress on social issues. He also talks about how no matter whether you’re a technical expert or not, how you yourself can get involved in helping apply artificial intelligence for wide-reaching social benefit. And he talks about the hard and soft skills that he looks for in the data scientists that he hires. Today’s episode should appeal broadly to data scientists as well as to anyone who’s interested in learning how technology can be a massive force for social good in the coming years and the coming decades. All right, you ready? Let’s do it. 
Jon Krohn: 00:02:27
James, welcome to the SuperDataScience Podcast. I’m delighted to have you here. Where in the world are you calling in from? 
James Hodson: 00:02:34
Jon, it’s a pleasure to join you as well. I’m actually based in the Berkeley area in California. 
Jon Krohn: 00:02:41
Nice. And, so I am very excited to have you on the show today. We have so many interesting topics to talk about because you have so many interesting strands of your life that weave together beautifully across machine learning and doing social good with machine learning. So I was introduced to you by Claudia Perlich, who was in episode number 437. So I don’t actually know this, how do you know Claudia? So I know you both… She lives in New York, I live in New York and you used to live in New York, so I suppose it has something to do with that. 
James Hodson: 00:03:15
It’s a small world, right? And especially in New York, it’s so easy to meet everybody. Claudia and I actually worked together in 2014 on building the KDD Conference which at the time was in New York, and the theme for that year, which is very interesting I guess for today’s conversation was actually, “Data science for social good.” And at the time I was heading up the AI research lab at Bloomberg and it made sense for us to open up our offices back then and have all of the workshops be hosted at Bloomberg. So Claudia and I spent a lot of time planning this, making sure it was going to run perfectly, and as a result got to know each other, became great friends and we’ve done a lot of work together since then as well. 
Jon Krohn: 00:04:04
Oh, cool. And in case listeners aren’t aware, KDD, that conference is a big deal. So I think a lot of people are probably aware of the KDDnuggets website because inevitable if you’ve in data science for a while you have some kind of question you’ve typed into a Google query that lands you on KDDnuggets. And so what does it stand for, KDD? 
James Hodson: 00:04:28
So originally it was Knowledge Discovery in Databases which obviously is not cool anymore, right? Saying databases as part of a conference name is not going to get you the kudos that you need. So it’s informally Knowledge Discovery and Data mining, but what I would say is pretty much nobody calls it anything but KDD. 
Jon Krohn: 00:04:48
Right. Well it’s become a brand. It hasn’t even occurred to me til now that it could stand for something. It’s like BP, they’re like, “We don’t do petroleum anymore. That isn’t really just our thing. We’re not just British Petroleum, we’ve got tons of green stuff happening here.” 
James Hodson: 00:05:05
Yeah, or British American Tobacco rebranding itself as BAT. 
Jon Krohn: 00:05:10
Right, exactly. It’s another good point. All right, so we already alluded to this with the name of that conference, but you are the CEO and a founder of AI for Good, which is an organization that’s been around since 2015. It kind of sounds like I can tell what the organization does from its name but you can probably fill in a lot more detail on it. 
James Hodson: 00:05:31
Absolutely. Yeah, so what I would say is the two things are inextricably linked, the KDD for Social Good conference that we put together in 2014 while I was at Bloomberg and a few months later the fact that the AI for Good Foundation was born in New York as a collaboration between industry, the academic community and the policy community in order to try to make significantly more progress with advanced technologies, AI being the most important one at the time that we wanted to focus on, towards the United Nations sustainable development goals. And the sustainable development goals or SDGs as they’re formally known were really introduced as an update to the millennium goals, which came before them around 2014, 2015. 
James Hodson: 00:06:26
So at the time, rather than saying, “Let’s come up with the top socially important challenges of our generation ourselves, as machine learning researchers and AI practitioners,” we thought maybe it’s better to take an existing framework and try to solve problems that everybody agrees on, right? So you’ve got 196 nations who have signed up to these as being the most important things to focus on. Doesn’t stop these nations from going to war with each other but they do broadly agree that there are social prerogatives that we want to get to and a certain world that we would like to build, one that’s not burning, like California is every summer and one where maybe our children don’t have to wear masks because of the oxygen level not being high enough all the time, or things like that. 
Jon Krohn: 00:07:18
Right. You may already have heard of DataScienceGO, which is the conference run in California by SuperDataScience. And you may also have heard of DataScienceGO virtual, the online conference we run several times per year. In order to help the SuperDataScience community stay connected throughout the year from wherever you happen to be on this whacky giant rock called planet Earth, we’ve now started running these virtual events every, single month. You can find them at datasciencego.com/connect. They’re absolutely free, you can sign up at any time and then once a month we run an event where you will get to hear from a speaker, engage in a panel discussion, or an industry expert Q&A session, and critically there are also speed networking sessions where you can meet like-minded data scientists from around the globe. This is a great way to stay up to date with industry trends, hear the latest from amazing speakers, meet peers, exchange details and stay in touch with the community. So once again, these events run monthly. You can sign up at datasciencego.com/connect. I’d love to connect with you there. 
Jon Krohn: 00:08:34
Perfect, so I understand. So you use UN sustainable development goals as targets of goals that you’d like to accomplish and we focus on using technology, particularly AI to do that makes a lot of sense. So what spurred you in particular to co-founding this organization? 
James Hodson: 00:08:57
So I’d say that we spent quite a while within the research community back in 2013, 2014, having the same discussions over and over again, and these discussions were basically, looks like the research community, the academics who are meant to be at the forefront of solving really deep challenges, right? Problems that do not have solutions, seemed to be mostly funded by corporations to do things that do have solutions a little bit better. And there was a clear need within the academic community to provide a framework by which people could focus their energy on things that were more globally important, right? More socially impactful. And it wasn’t for lack of people wanting it, it wasn’t for greed, it wasn’t because people found working on problems around search and recommendation more interesting or fascinating. It was a lot of the time simply because the networks didn’t exist in order to provide people with the knowledge and people connections and resources that they would need in order to get stuck in on these challenges. 
James Hodson: 00:10:16
So primarily the Stanford Workshop on AI & Knowledge that took place in 2014, the KDD Data Science for Social Good conference and a few other high level meetings among kind of some of the bigger names in machine learning research led to us deciding that we had to create a couple of different organizations that would tackle challenges that we were facing in the AI and machine learning community. The social impact side, that’s AI for Good. We also thought a lot about diversity and inclusion, right? Who was working in AI? How do we make sure the next generation of people working on computationally complex problems is representing the people who need to be benefiting from this work. And that’s where AI4ALL comes in, which is another organization based nearby here in Oakland. So kind of these two organizations came out of the same philosophical grounding and roots, if you will, back in 2013, ’14. Just a little tidbit from the discussions [crosstalk 00:11:30] back then. 
Jon Krohn: 00:11:31
That’s exactly what I was looking for. Cool. 
James Hodson: 00:11:35
Now obviously over time we’ve expanded a lot what we’ve done and we’ve actually, I’d say, more precisely defined what it is that we get involved with and how we are solving problems related to the SDGs. We now formalize this around three pillars, education, policy and research. On the education side it’s all about how we can make our universities, schools and other education networks more resilient and more open to everybody. So how do we build just more open education systems, that can include getting more people into data science and machine learning, right, and technical STEM subjects in particular which might happen more at the high school and university level, but it can also involve ready working with local school districts and figuring out how do they get their education system to work better for everybody in their communities, right? So that can actually [inaudible 00:12:37] how schools operate as well. There is also public outreach that we do. So we get involved in a lot of conferences, right? We have events. We had more events when events could be in person but you adapt and you change with the times. 
James Hodson: 00:12:53
On the research side, it’s all about building the core infrastructure that supports more work happening on the SDGs from the AI perspective, and I suspect that’s where we’re going to put more of the emphasis of this conversation because it’s more technically grounded and technically interesting in terms of the types of things that we’re doing and how that plays into the world of the SDGs. And the last pillar is policy, right? And you cannot have an impact on the world with technology today without having the policy side really sorted out, and especially when you’re talking about the sustainable development goals because the SDGs are a policy instrument in some sense, and so you need people on board who are going to push these solutions to the field, right, and measure them in the right way, and fund them in the right way. So on the policy side we work with governments to create national AI strategies, we work with governments to figure out how to spend their capital budget that’s assigned for the sustainable development goals, each year we work with regional municipalities, with cities. So we have an Intelligent Cities Program which is really helping urban areas to figure out how AI fits into their infrastructure and to kind of the fabric of their communities. And then on the highest level I’d say we work with the UN, OECD, World Economic Forum, kind of these large international organizations that are trying to have an impact on how we coordinate change in the world, right? 
Jon Krohn: 00:14:30
Super cool. 
James Hodson: 00:14:31
These verticals drive all the decisions that we make as an organization and all the work that we end up doing, all the partnerships that we entertain and yeah, that kind of how we’ve eventually come around to structuring the organization. 
Jon Krohn: 00:14:45
Yeah, yeah. So three pillars, education, policy, research. As you mentioned, probably for our audience, the research one might be the most interesting because we can dig into some case studies. So maybe now’s the time. Do you have some case studies you’d love to share with us from the AI for Good organization? 
James Hodson: 00:15:02
There’s a lot that I would like to talk about. I think there are a couple of things that are maybe most salient also in terms of people can connect with the topics more. So I’m going to put three ideas out there that we’re working on. All right, the first one is diversity, equity and inclusion, which is one area which has become just ever more important over the last couple of years, and these are areas that have been… the underlying issues have been exacerbated by the COVID pandemic and by government responses to the COVID pandemic. So decisions that we have made about how to cope with a health issue which have repercussions on our social structure and how people work, live and connect to each other within that. And I think the COVID pandemic has caused a loosening of relationships in society, especially between groups that had frictions between them and among them already. And we’ve seen in the US in particular that this can take shape as real social unrest in one way or another. Now if you look at gender and racial inequality in the US over the last few years, from a metrics perspective, the gaps are widening, right? They’re widening along a variety of different dimensions but despite the fact that we’re talking more about these issues in society, we’re not seeing the positive outcomes. We’re not seeing things go more towards where we’d want to end up. 
Jon Krohn: 00:16:44
Right. [inaudible 00:16:45] too easy to just talk about these things. As an interesting little example of that, that I noticed over the weekend. So I love, from my time… You probably don’t know this about me, James, but I lived in the UK for five years, and I really got into Premier League Football when I was there and I’m still kind of into it. Recently, Christiano Ronaldo, one of the greatest footballers of all time is back in Manchester United and it’s just completely engrossed me in the game again. So I’m watching Premier League highlights over the weekend and I noticed that before games now, following on from what Colin Kaepernick started at the San Francisco 49ers, so playing American football in the NFL, where he was kneeling during the anthem to protest racial issues. So now right before the game started, the Premier League game starts, all of the teams take a knee except there are critical exceptions, where a few black players don’t take a knee. So I was immediately like, “Whoa, that’s super interesting that he’s standing.” And so I looked up, I can’t remember now the name of the player, but I looked it up and this player experienced racism online. He had lots of details about these people, he brought it to the police, he brought it to his club and they did nothing. And so from that point on he was like, “Why are we taking a knee, where it’s this meaningless action if you guys aren’t going to change how we act and actually do something [inaudible 00:18:28]?” 
James Hodson: 00:18:30
I think people are making that point more and more now, right? That really we need to find ways that are honest and actually have a measurable impact on building the society that we want to build. It’s going to take time and maybe you and I are not the best placed to be making the rules up about how this game is going to play out but we all have a role to play in ensuring that the society that we build is going to be the society that supports people and creates opportunities for everybody to flourish and be happy and have high quality of life and great life outcomes. And definitely at the AI for Good Foundation we feel like technology has a role to play in that transformation. Now we view that transformation from the DE&I side as mostly transparency and accountability. How can technology bring transparency and accountability to this area? And the flagship part of that, which is kind of what I’m going to dive into first today is we are building out a global score card of diversity, equity and inclusion which will cover 1.3 million companies within the US, and it will cover government entities including police forces, it will cover schools, so our education system, and eventually as part of our Intelligent Cities Program it will actually score cities as well on their ability to integrate these measures and the progress that they’re making over time. 
James Hodson: 00:20:16
Now the reason I’m mentioning this first is also it’s very interesting data set that we have for the US and it goes back to some of the other relationships I have. I’m an angel investor. One of the companies that I invested in very early on, a few years ago, is called Cognism. Cognism curates a global data set of companies and people and aren’t the only ones in the world to do something like this, but they focus very much on the accuracy of this data and it being current and maintaining time series that go into the past, which are verifiably following distributional properties that are very similar to the underlying data. So basically when you’re looking at these data sets, what you can say is that distributionally speaking, right, looking at the features of interest on the person and company levels, looking at the census data that comes out for the US, they look very similar. So you can trust the data more that it’s somewhat representative of the types of things that you might want to be measuring in society. 
James Hodson: 00:21:20
When we’re applying on top of that a variety of network based machine learning algorithms in order to understand the relationship between people and the firms that they work at, right? How careers develop over time, how different types of education backgrounds, ethnic backgrounds, kind of gender issues and how people join, finish their education, make their way through life and through careers, and which types of company structures are able to support that better, which ones have better outcomes, which ones end up with more diverse and more productive work forces, and how do they do that, right? 
James Hodson: 00:22:00
So we’ve got a set of metrics, it’s on the order of dozens of metrics that we are able to calculate over each company and over each other entity that we’re interested in and we can essentially give a dynamic score on a monthly basis, rolling score, with our aim being that we get these organizations, these entities. It’s not strong-arming them into this. We’d hope that everybody wants better transparency into what’s going on in these areas, but what we’re trying to get them to do is to take ownership of their score. Take ownership of it and then leverage the variety of resources that we offer, right, so basically internal workshops to help them deal with diversity issues, ethics issues arising from this, ways of contributing to their score and kind of checking bias in any recruiting systems that they run and so a whole plethora of tools that will them to become more conscious of how they’re operating as a business. On the other side, we’re also building tools so that you as a consumer, when you’re making a decision about what to buy, you can have information about the DE&I commitments that those companies are making, and you can decide whether you prefer buying from Walmart or Amazon based on the underlying commitments that they are really making towards their workforce and how they operate. And on like the last leg of that is investment. 
Jon Krohn: 00:23:33
I was just going to say that. I was like, “Oh man, investors would love this too.” But yeah, you’re one step ahead. 
James Hodson: 00:23:39
So obviously capital drives the market ultimately, right? And the supply and demand curves are really what’s going to make the biggest difference in the end to the company’s behavior and to the employees behavior inside who make the culture, right? So those pressures are definitely extraordinarily important. But zooming back into the machine learning side the variety is just super interesting problems that are on the edges of what’s possible with machine learning research today, right? And the nice thing about this is you can pose them as potentially some types of reinforcement learning problems, you can pose them as network problems. So how do we propagate information within a complex network that’s operating on many different dimensions, but you can also look at these data sets and you can look at the models and you can understand them intuitively, right? 
James Hodson: 00:24:36
So the idea is we’re watching people move through a structure, right? An organization is a structure, it’s a ladder, it’s a pyramid, it’s a hierarchy, however you want to describe it. Some people who otherwise look the same have differential careers. Why do they have different careers? Why is it that when you look at data science teams, men and women that look like they originally drawn from the same underlying pool, with the same education, with the same skills, right, having lived what looked like very similar experiences in data science before them, why is it that we consistently see some of them move faster through their careers, right? Is it by chance, right? Or is it that skin pigment really helps you to compute things faster? Doesn’t seem very likely. You can have lots of assumptions about what’s going on but doubt that that’s going to be one that we’re all going to be on board with. So when you look at it, there have to be mechanisms at play that are not explained just by the pure skill and ability of the people who are in these organizations, and that’s when it comes down to the organizations to actually push, right, to bring transparency to why that’s happening, and to make change, right? Sometimes you need to apply force in order to make things the way that you want, like building an IKEA table, right? Sometimes you need to apply force to get those holes to line up. And- 
Jon Krohn: 00:26:23
It didn’t say [inaudible 00:26:24] in the instructions but here we are. 
James Hodson: 00:26:26
Right, right. 
Jon Krohn: 00:26:29
That is super fascinating. It reminds me in a way of the World Bank ease of doing business rankings where you just put them out there. You come up with metrics, you publish them and some countries see how they’re rated to others and they say, “I want to jump ahead, what red tape can we cut?” It can also lead to then gaming the stats, and recently there’s been a scandal with those World Bank data of some numbers kind of being messed with to help some countries out, but I won’t dig into that. But yeah, so this reminds me kind of of that in a way but I think in the sense that we’re… In a good way, in the sense that we’re publishing information, objective metrics that allow organizations to be ranked relative to each other, and yeah, I imagine that it becomes a race, that people say, “Okay, look, people who are buying our products or people who are interested investing in us, they see that we are lagging greatly behind our competitors in DEI and we’ve got to catch up to them.” 
James Hodson: 00:27:40
Exactly, yeah. A lot of our programs end up being, how do we leverage data and intelligent processing of that data, right? To do something that is correct as reviewed by peers and experts in this area, right? So my background is not as a specialist in diversity, equity and inclusion, right, so I’m not the right person to be defining what these metrics are from the ground up. There are people who have done decades of research on this, much better place than I am, so we tap into those people and we ask everybody to go in and say, “Is this fair what we’re saying about companies? Is it the right way to measure? Are there better ways to do it?” Right? But on the other hand, we’re designing our actual deployments, right, and our interventions very much as behavioral interventions, right? What is the right mechanism in order to incentivize these entities to take the right path, right? So call it behavioral machine learning interventions, if you will, but most of our programs are not just about the artificial intelligence side of it, which is very interesting, but it’s actually how do we trick people into doing the right thing, right, because everybody is now incentive aligned on getting to the right end result. So there are two aspects of it, and you can measure all of these things, again, in the ways that we measure machine learning results on our models, right? We can set up the problem in a way that it can be evaluated effectively. 
Jon Krohn: 00:29:23
Cool. All right, so that is one incredibly rich case study, DE&I. It sounded like you had three for us, what’s number two? 
James Hodson: 00:29:32
So a second thing I’d like to kind of jump into which I think would be interesting to people and maybe it’s also something that the audience would want to take a further look at and maybe even get involved with in some way. Early on we realized that one of the main issues with data scientists, whether it’s people working in industry and academia, right, or people working in government as well, because there are lots of data scientists now in government that are, whether municipal level or higher, right, working on specific things. It was lack of access to data, lack of understanding of the work that’s already happened in a particular area and lack of network, right? Who is working on this area already? What is the benchmark? How do I even get started on understanding which approaches to carbon recapture are going to work best for my city or for a particular regional ecosystem. 
Jon Krohn: 00:30:38
Yeah. 
James Hodson: 00:30:39
And that’s just one random example [crosstalk 00:30:44]. 
Jon Krohn: 00:30:43
Yeah, yeah, yeah, yeah. I’m saying cool to the whole idea. I’m like, “Yeah, this makes a lot of sense.” Keep going, yeah, this is awesome. 
James Hodson: 00:30:52
And we were kind of surprised to find that there wasn’t really any attempt to defuse knowledge from what I’d say is the main, the primary academic communities for each of these problems, to the computational and nearby academic communities and stakeholder groups, like industry data scientists who want to volunteer and who want to put it in their time and maybe they have resources at their company that can make a big difference to some of these things, right? And you can imagine some companies would want to really get involved if their employees are excited about a particular problem or if it’s relevant to their business model. But we were finding these frictions where, for example, if you look at the latest climate change report, right, from the ICCC, you’ll notice that pretty much all of the citations in that report come from one academic culture, despite the fact that we’ve been talking for so long about how AI could have a positive impact on metrics for climate change and how it might help us to better choose interventions and so on, and better model, climate impacts by region, hyperlocally and so on. There are maybe two out of hundreds and hundreds, thousands of citations in this report that are machine learning community citations about climate change. And they’re not even the most relevant ones. 
Jon Krohn: 00:32:20
Right. 
James Hodson: 00:32:21
So these are highly siloed communities. How do we force that knowledge to come out of these silos? So the first thing that we did, now we’re working also closely with Microsoft teams on this problem. So we essentially start from the Microsoft academic graph which is this large repository, a little bit like Google Scholar of all academic research papers that are openly available on the web and indexed together with the open academic graph which is a partner project of the MAG. Now if anybody hasn’t looked at these data sets yet, they’re a very rich resource of academic work and academic networks, so highly recommend taking a look at that if you’re into understanding how academic knowledge and research knowledge moves over time and is related. 
James Hodson: 00:33:17
Now of course we are particularly interested in building an understanding of all of the open data sets that have been used for research over the years. So for us the abstracts and the citations of these papers were not sufficient. We actually needed to get our hands on the raw text, right, and the results. So after having gone through a lot of loops with our internal legal council and having checked out [inaudible 00:33:45] copyright, we wouldn’t be infringing on people’s copyright and that we’re doing everything in the right way, we started going and taking the openly available PDF resources out on the web and processing them so that we could get text out of them, so basic pre-processing pipeline. In fact we’ve been able to get millions and millions and millions of academic papers this way. We build some quite interesting models that are sequence learning models at their core, they’re looking at the text, they’re trying to understand what entities are being discussed, and the objective is to find mentions of data sets.
James Hodson: 00:34:29
So somebody is talking about the ImageNet database, right. So they might start by saying, “Well ImageNet has x,00,000 images, covers y set of classes. It was developed at Stanford by x, y, z authors and reference a paper previously about the ImageNet database.” So we want to capture all of that rich information essentially as a frame, right? A frame that tells you what is the data set, who owns it, what are all the places where it’s been used, what are the primary attributes of that data set, how many samples does it contain? And we’re pushing all of this into a unified database that shows you the relational structure of data and then divides it by the topical structure of the papers in which the data are found. So you imagine ImageNet is used everywhere, right, in almost all computer vision applications. There is some model that’s either been pre-trained with ImageNet or is doing something else, for tuning with ImageNet, and so for that kind of data set you’re going to see the topical resonance be quite broad. And when we say topics here, we’re usually talking about the sustainable development goals since those are the broad areas that we’re interested in understanding better and they’re the areas that we want to give people access into. There are 17 major topics in that case but each one of the sustainable development goals, like climate action might have a dozen sub goals, right, that are individual pieces of the puzzle. So we go down into this more granular level. There are interesting technical problems to be solved both at the side of identification of these data sets and then also at the side of how do you understand them in the context of the academic work that’s happened, right? 
James Hodson: 00:36:37
One issue that we’ve run into which is obviously an issue that most machine learning models run into at scale, is sparsity, right? So there is enormous sparsity in the fact that we’re looking at text and trying to find mentions of rather ethereal things, right? Not all data sets have names. A lot of biology data sets are really part of the experimentation process that’s being described within the paper, for instance. Some physics data sets might have five samples, right, and it might be talked about in a very different way from how we would think about data sets being talked about in the more directly computational data sciences [inaudible 00:37:21]. So out of millions and millions of data sets, you’re talking about tens of thousands of… Sorry, out of millions and millions of papers, you’re talking about potentially tens of thousands of data sets being talked about in different ways across papers but the ultimate goal is you’re a researcher who is interested in issues of gender equality and you want to see whether you can apply your machine learning techniques, right, the research that you’ve done at your core, to this particular topical area. You don’t know anybody who’s ever worked in this area, right? You don’t know what data sets are useful and where they’ve been used in the past and you can’t build on the shoulders of giants, so to speak, right, unless you have all of this contextual information.
James Hodson: 00:38:11
So the point of our sustainable data catalog and kind of open data catalog is to say, “You want to focus on a particular problem area that you find interesting, here’s where you dive in,” right? And then when you can then submit back your own work into it, this becomes a research hub in some sense. It builds on a project that we worked on a few years ago originally at Bloomberg as well which was a unified framework for doing machine learning research which would actually help with organizing results and with sharing results and reproducibility of those results, which was discontinued for a variety of reasons, but what we’re trying to do is make this as light as possible, right? You want to figure out what’s going on in this area, you come here, you get all the information you need, you go away and you do your work, you reach out to the people who are the key people, key stakeholders in this area and it helps you to get involved on problems that are more important maybe, or maybe more impactful too as an individual than you would otherwise get involved with. 
Jon Krohn: 00:39:35
Very cool. All right, that is awesome. As I said from the onset as you started explaining, as soon as you started to eluding to what you’d built, I was hooked. It’s such a brilliant idea. So we’ve got, your first study was the DEI case study, the second one is this database of social problems that people could take advantage of, and so I think you’ve got one more for us up your sleeve. 
James Hodson: 00:40:00
Yeah, and this is one thing that we’re actually doing a lot of preliminary work around now. So I’m going to kind of tell you something that’s going to get fairly big over the next year or so. We’ve got now partners who are starting to push on their ends of this as well but it’s around global public health. So obviously very timely. It would’ve been nice if we could’ve done this two years ago, right, so [inaudible 00:40:26] ready, but what we’re hoping is that this will be something that can work as a benchmark for what different public health authorities can do around the world that will make us more responsive and better able to think about the kinds of interventions that would be useful, right, for the next health outbreak. And there will be, as we’re all familiar now I think just the probability of not going through similar events like the last couple of years is quite low, right? We would expect honestly to see this happen more and more frequently over the coming decades and centuries. So it’s time for us to build the social structures so that we can prepared with the data collection and with the analytics that we need so that we can intervene at the right time with minimal effort, right, and minimal disruption to handle potential outbreaks in the future. 
James Hodson: 00:41:32
So we’re working together, I’d say a merger between an IoT project and a traditional machine learning project. What we found in our early research is that you don’t need a very large number of people in a population to be instrumented and collecting data about their behaviors and their health, in order to be able to make very accurate predictions about the current public health in hyperlocal areas. So essentially this comes down to providing a set of selected volunteers from throughout a population, let’s take the city of New York. If we can find 50,000 individuals in the city of New York with the right attributes, living in the right areas, with the right behavioral signatures in terms of how they go about their daily lives, we can actually give them essentially a type of smart watch which we’re co-developing which will provide us with and them with a lot of granular information about their current health and how this is varying from day to day and how it departs or does not depart from their historical baselines. Also marries that information with all kinds of other things like weather and path tracking and exercise and other things that we’ve come used to expecting now from our wearables but what it does very well is it can detect micro changes within small post code areas if you will like few block areas so that we can essentially predict when we’re seeing a change, right, from transmission of disease. 
Jon Krohn: 00:43:25
Ah, cool. 
James Hodson: 00:43:27
Now if we can do that at the earliest point, right, when this becomes a trackable item at the individual level and over time we’ll be able to get very good at distinguishing that from fluctuations in your micro climatic issues or things that are maybe not so serious or if we see certain types of symptoms related to colds for instance where we don’t really want to raise an alarm every time there’s a small cold outbreak. But the idea there is really to minimally intervene at the population level while having maximal understanding of the granular dynamics of public health. And the nice thing about this is, we don’t want to have to deal with data privacy issues so everybody who’s doing this will be fully consented in, everybody’s fully aware that we’re collecting this information. In fact, if we’re lucky with how this works out, we’re hoping to change the cost dynamics of local public health authorities considerably, which allows them to actually incentivize these programs by paying individuals to participate, giving them the free equipment and giving them a lot of additional health resources that they can tap into as a result of being part of this program. 
James Hodson: 00:44:49
So we might actually end up seeing people coming and actually wanting to be part of this program or even purchasing the equipment themselves, right and contributing data back. I know for example of somebody who lives in California and who’s subject to wildfires on an increasingly regular basis. One of the first things I did was to go out and buy a scientific air quality sensor for my backyard and start contributing the data back to the local government, because they only had one sensor in the entire East Bay. 
Jon Krohn: 00:45:23
Wow. 
James Hodson: 00:45:24
Now there’re about five or six sensors in East Bay and it makes a difference because the micro climates around here are insanely different from block to block. Same with public health, right. Just because you live nearby somebody doesn’t mean you’re going to spread disease to them. A lot of it depends on your behavioral signatures. So again, it’s one, it’s physical environment with data and machine learning in order to try to minimize essentially the impact of pandemics in the future. Minimizing the cost to local public health authorities and also hopefully give individuals a boost from a health and behavioral perspective with their wearables. 
Jon Krohn: 00:46:08
Amazing James, that’s so cool. James, thank you for running us through those case studies that the AI for Good Foundation does, you’ve got the DEI initiatives, the database with social problems that people can use and take action with and we’ve got the global public health case study that you just went over, so I’m sure we have a lot of listeners that are wondering how they can get involved with the organization. So there might be technical experts out there in data science or machine learning. And there might be people who are experts in these particular kinds of issues like DE&I that might want to contribute. So how can they contribute? And I also know that in particular you could use a hand with fundraising, so I’d like to ask people who can get involved with that as well. 
James Hodson: 00:46:56
Great, no that’s a very good question and there are a few different areas where we’ve tried to make the organization accessible to the data science community more broadly as well as the policy community and the academic community in particular. We do run a volunteer program, so anybody can go to our website, ai4good.org and sign up to that, also please we have a monthly newsletter which has a lot of different opportunities for getting involved with our organization, so if you sign up to our newsletter you’ll see things on a regular basis, which could be interesting. The newsletters follow SDG themes, so each newsletter is on a particular theme. The newsletter for October is about climate action, so you can go into our archives and see that and all of the future newsletters are also around SDGs, so if you’re interested in a particular topic, definitely that’s one way to get involved. Our website obviously contains a lot of programmatic information, videos, blogs and other things that you can go and check. 
James Hodson: 00:48:09
Now as an AI focused organization it’s often assumed that we’re extraordinarily well funded, but we’re not extraordinarily well funded and the reason for that is that we’ve tried to build an organization that is independent of sort of particular agendas that might exist in society. We didn’t want to be funded by one or two kind of big institutional or individual sources because we felt that that would take away from the ability to really solve underlying problems and be able to actually get involved in many areas that are sensitive to the stakeholders and where the money is coming from. So from the very beginning, our intention was to be individually supported by members. Now over time what we’ve realized is, there’s a good mesh of individual contributions, people actually providing us with small donations and working closely with corporate stakeholders on areas that are especially important to those corporations. Where there is actually an impetus to solve a problem which is beneficial not for a for profit entity as well as for the SDGs and for us as a non-profit entity. So we also look for synergistic relationships with corporations where doing something that is for good from our perspective right, can actually be a product for them and can help them achieve their for profit objective. 
Jon Krohn: 00:49:59
Right. 
James Hodson: 00:49:59
So, if any of your listeners work at an organization and they think, “Well, you know actually it’s really synergistic with SDG 12,” right? Getting us in touch and starting a conversation about how we can work together would be incredibly valuable as well. 
Jon Krohn: 00:50:20
Lovely. So that’s brilliant. Lots of ways to get involved with the AI for Good organization and make a social impact. Now, James something that has come up a couple of times in this episode is that AI for Good isn’t the only thing that you do. So you mentioned that you’re an angel investor for example in several startups. Cognism is one that came up in particular where you serve as chief science officer and you have a history as a research manager for AI at Bloomberg, so you’ve been involved with a lot of hiring and so I’d love to ask a really popular question with our listeners is, what do you look for in terms of hard and soft skills in the data scientist that you hire? 
James Hodson: 00:51:14
So maybe the best way for me to answer this question is to give a little bit of background about my management style so that you have some context for where I’m coming from with what I look for when we’re hiring. So I’m the kind of manager that prefers to see leadership and ownership of projects be taken on by the people who are actually working on them. So I don’t micro manage people and as a result, I need to find people who are very much interested in building things and in owning those things and making sure that they are never going to fail, right? And that they work as well as they could possibly work. That means thinking creatively. Right, identifying alternative solutions, experimenting, finding the data sets that they need and if it doesn’t exist, building the data sets that they need. Forging the relationships that they need in order to make something successful. What we see happening more and more frequently in the education institutions that we work with is that data scientists and even kind of computer scientists and statistics graduates are being taught more and more how to use specific tools rather than how to think about the world through the data that’s created by it, in terms of how to go about solving problems. 
Jon Krohn: 00:52:44
Right. 
James Hodson: 00:52:44
So what I’ve noticed is that for instance pandas, JupyterLabs, some Apache Spark and other cloud tools are basically replacing a lot of the knowledge that people used to have about computational complexity, about algorithmic design, about data structures. So basic computer science concepts that are key to being able to make, I’d say reasonable decisions about how to build the underlying infrastructure that supports your machine learning model are often missing blocks from the knowledge that we encounter when we’re hiring. Now this is worrisome for a few reasons. One is that it makes the code that eventually comes out of the community much, much harder to maintain, much harder to co-develop or kind of extend and work with. It also ends up meaning that we use an enormous amount of computation, right? And larger machines than we need and more time than we need and often solve problems in a way that’s more complex than we would if we had thought about it through kind of a more traditional computer science lens. 
James Hodson: 00:54:10
So as a result of that, often we end up testing for these core computer science concepts as a prerequisite for even the data science roles that we hire for and I mean that both from a AI for Good perspective, I mean it from a Cognism perspective and I also mean it from a UC Berkeley perspective where I spend a lot of time mentoring research assistants and working with my co-authors at UC Berkeley. Now the other side that I hinted at just before is obviously the social side of being a data scientist, right? And that social side is about the leadership skills required with work. Now being an individual contributor doesn’t mean that you cannot be an owner of the project. And owning project means marshaling resources, building connections, building relationships, finding ways to develop novel pipelines. Not necessarily novel machine learning methods, but combining knowledge and combining things that are out there in ways that are not out of a cookie cutter system, a templated system. And we often run into people that just think in templates. They’ve been taught three ways to solve data science problems and they need to apply those in a fairly blind manner in order to feel that they’ve done something. 
Jon Krohn: 00:55:38
A hammer looking for a nail. 
James Hodson: 00:55:40
Yeah. And the worrying thing is that I see this more and more frequently. I was actually just having a conversation a few days ago with another friend who does a lot of hiring in this area and he and I both deal with the top universities in the US and in Europe and it’s just interesting to see how the focus on quality of understanding of the underlying concepts has basically disappeared in the last few years because I would conjecture that the demand for data science talent is so strong, right, across all industries that it’s difficult to keep up generating data scientists, right, without diluting the content and the kind of required level of understanding and depth. So what I would recommend is to really try to get the fundamentals down, right? And to become somebody who is both a systems analyst and a data scientist. Right, able to look at the world and look at the problem and think about it in terms of the real dynamics of that problem, right, and what is available to solve it even outside of machine learning, right? Because you don’t have to use machine learning for every single data science task. 
Jon Krohn: 00:57:08
Mm-hmm (affirmative). 
James Hodson: 00:57:09
But you do need to be able to think about the world through the data that’s available, right? And how it represented. 
Jon Krohn: 00:57:16
Very well said, and I have detected in the market that same gap in what you described as the fundamentals, so I’ve created something called a machine learning foundations curriculum that covers linear algebra, calculus, probability theory, information theory, algorithms, data structures and optimization in a way that is tailored specifically to machine learning applications, so yeah. So I couldn’t agree with you more, bit of a shameless plug there, but it does seem to really have value and so it’s nice to kind of hear you say that you see this gap because I’ve noticed a lot of uptake of this curriculum because I think a lot of people notice that they are in this circumstance where, okay, I’ve been taught like you mentioned, tools like pandas and actually we just landed Wes McKinney as a guest on the show, so he’s going to be coming up soon, the creator of pandas, but tools like that, like pandas, you mentioned Spark, they can allow us to obfuscate considerations about what kind of data structures we’re choosing, how we can optimize those, how the data flows through our model and how that can impact the output and if we stay solely at this really obfuscated level it might mean that we’re missing pieces of what we’re doing. 
Jon Krohn: 00:58:40
It certainly loses the opportunity to be more creative with solutions which I think ties in a little bit to the point you were making there at the end, so yeah. So, yeah I guess it’s in an effort, the kind of the broader market in an effort to deal with this huge gap between data science job openings and available talent, it’s kind of people have come up with curriculums that allow people to quickly develop a lot of these skills but sometimes it means that quick development is a superficial development. 
James Hodson: 00:59:16
Well the other side of that equation is something that we found in research at kind of at the edge of machine learning and social sciences which is an area where I do a lot of academic research is that the number of computer science graduates in the US has had a static rate of change over the past 30, 40 years whereas the demand for computer science graduates in the US has basically become exponential. 
Jon Krohn: 00:59:47
Right. 
James Hodson: 00:59:48
So the computer science programs are no where near being able to keep up, which means the vast majority of data scientists come from other disciplines. They were never exposed to the underlying fundamentals of a computer science curriculum. And they wouldn’t have been in the computer science curriculum because there wasn’t enough space. 
Jon Krohn: 01:00:09
Right. 
James Hodson: 01:00:10
So it’s a capacity issue almost at that level. I also want to mention very quickly that I think pandas is an amazing tool. 
Jon Krohn: 01:00:19
Yeah, we’re not suggesting anything otherwise. 
James Hodson: 01:00:23
I think [inaudible 01:00:23] did a fantastic job getting it to be open and available to everybody and for econometric analysis it makes the world a much nicer place to live in than [Stater 01:00:36] for instance, but it would be nice that from a data science education perspective we don’t forget that people need to have an understanding of what it’s built on, right? Great to hear about your curriculum, it sounds like we need to talk more because we’re also developing these types of resources and it would be good to do something together on that front. 
Jon Krohn: 01:01:00
Nice, yeah for sure I will share those with you. And for listeners I’ll also share a link in the show notes. Brilliant. All right, so James I’ve learned so much in this episode, I am blown away by the breadth and the impact of the work that the AI for Good Foundation is doing, so that you so much for sharing those case studies in detail with us. My last question for you is, do you have a book recommendation for us? 
James Hodson: 01:01:33
Can I give you two book recommendations? 
Jon Krohn: 01:01:35
Absolutely. 
James Hodson: 01:01:35
So I guess the first one, I did a kind of philosophy in computer science undergraduate degree and I focused a lot on linguistics while I was doing this, but the idea of scientific method. So something that we often don’t think about when we’re doing data science is how to be more scientific about how we do data science. And one of the books that changed my perspective on science when I was fairly young, like 13 or 14 was a book by Bryan Magee, it’s called Karl Popper and it’s about Karl Popper and his work on establishing scientific method and what it means to create workable hypothesis and how we go about actually creating new knowledge in society and it’s only maybe 80 pages long, it’s a tiny book. Well it’s not illustrated and I know Jon that maybe if we could make a comic book version of it it would have higher uptake, but… And the other book that I’ve recommended to every new joiner that’s ever worked for me and it’s one of the best books for data scientists, computer scientists and AI people and it’s extraordinarily difficult to find and I’m sure that I’m about to say this and it’s probably going to triple in price on Amazon as a result because it’s no longer in print. I believe the last time I paid $150 for it. It’s called Heuristics and it’s by Judea Pearl. 
Jon Krohn: 01:03:20
Oh. 
James Hodson: 01:03:21
[crosstalk 01:03:21] big big name in the area of AI and logic and inference. One of my favorite people in this discipline and somebody that you should follow before you follow me, but his book on heuristics was basically the reason I got involved in AI in the first place. It was my inspiration and it’s still a great desk reference for a variety of problems, so if you get a chance to steal it from somebody, do it, right, ethically I think it’s okay. 
Jon Krohn: 01:03:56
Super cool. Well the publisher should get the message and create a new edition. It sounds like a super valuable book, Judea Pearl is brilliant and I wasn’t even aware of that book’s existence by him. All right James so as you mentioned, so you’ve also got that recommendation to follow Judea Pearl wherever he is on Twitter or whatever. I’m sure we can find that and put it in the show notes as well, but James how can people follow you? 
James Hodson: 01:04:25
So you can definitely, obviously connect to me on LinkedIn, right? You can get connected to us at the AI for Good Foundation and for anybody that wants to reach out to me directly, be happy to include email in the show notes and we’re always happy to have people reach out directly and have a chat. 
Jon Krohn: 01:04:47
Brilliant, really appreciate that James. I don’t think anyone’s made that offer before. And so that’s great, thank you, we will include your email address in the show notes, thank you so much and yeah thank you so much for being on the show. Hopefully we’ll have you on again sometimes soon and we can see how these socially beneficial projects that you’re leading are coming along. 
James Hodson: 01:05:07
Wonderful, no it’s been a really great conversation and really appreciate the invite and look forward to talking more. 
Jon Krohn: 01:05:19
The work James is doing is so inspiring and he communicates this work so clearly and effectively I hope you enjoyed this episode as much as I did. During it James described how his AI for Good organization is tackling the UNs sustainable development goals, including by creating quantitative DEI scorecards across over a million US companies and public entities. He talked about how collating a database of social problems and who’s tackling them so that anyone can get up to speed and get involved and developing sensors that can detect population wide changes in human health, thereby predicting and hopefully getting ahead of the onset of pandemics. He also talked about the soft skill he looks for in data scientists he hires namely being able to recombine ideas creatively and he talked about the hard skill he looks for in data scientists specifically around the computer science subject of data structures and algorithms. 
Jon Krohn: 01:06:17
On that note, if you happen to have a subscription to the O’Reilly learning platform my Data Structures, Algorithms, and Machine Learning Optimization course was published there over the summer. This course enables you to shore up your computer science specific skills with particular application to data science and machine learning, that is the particular hard skill gap that James identified is commonplace. Eventually I will make this data structures and algorithms content freely available via my personal Jon Krohn YouTube channel as well via my usually pretty darn cheap, math for machine learning course on Udemy, but it could be a year or more before I have the opportunity to do that filming because my hands are currently much more than full while I focus on writing my math for ML book alongside my day job and hosting this very podcast. All right, admittedly shameless plug that you may have found valuable over. 
Jon Krohn: 01:07:11
As always you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for James’ social media profiles as well as my own social media profiles at www.superdatascience.com/519. That’s www.superdatascience.com/519. If you enjoyed this episode I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel. I also encourage you to let me know your thoughts on this episode directly by adding me on LinkedIn or Twitter and then tagging me in a post about it, your feedback is invaluable for helping us shape future episodes of the program. 
Jon Krohn: 01:07:49
Thanks to Ivana and Mario, JP, Jaime and Kirill on the SuperDataScience team for managing and producing another inspiring episode for us today. Keep on rocking it out there folks and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts