SDS 709: Big A.I. R&D Risks Reap Big Societal Rewards, with Meta’s Dr. Laurens van der Maaten

Podcast Guest: Laurens van der Maaten

August 29, 2023

Get ready for an extraordinary episode of the Super Data Science Podcast! In this captivating installment, we’re joined by Dr. Laurens van der Maaten, a Senior Research Director at Meta, who takes us on a journey through the fascinating world of AI. From pioneering dimensionality reduction techniques to unlocking the potential of privacy-preserving ML and tackling monumental challenges like climate change, he shares expertise and insights that will leave both seasoned data science practitioners and curious minds inspired.

Thanks to our Sponsors:
Interested in sponsoring a Super Data Science Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Laurens van der Maaten
Laurens van der Maaten is a Senior Research Director at Meta AI (FAIR team). He supports a team of world-leading researchers, engineers, and designers that are developing the AI technologies of the future. Together with Geoffrey Hinton, he invented the t-SNE algorithm for dimensionality reduction that has since become a widely used tool for data visualization. Laurens was also the lead developer the CrypTen framework for privacy-preserving machine learning. His work received Best Paper Awards at the CVPR 2017 and UAI 2021 conferences.
Overview
To kick things off, Laurens takes us down memory lane to one of his earliest endeavors with the team – a colossal project involving large-scale learning of image recognition models from web data. Armed with a staggering amount of weakly-labeled images, he rewrote the rulebook for machine vision systems. He offers a glimpse into the infrastructure that powered this project (back when Tensorflow or PyTorch did not exist) that resulted in significant advancements in image recognition accuracy.
Next, Laurens unveils his involvement in de novo protein design, another challenging project aimed at creating proteins that don’t exist in nature. His team’s approach employed language modeling on extensive protein datasets, with potential applications spanning from drug discovery to designing enzymes for specific purposes.
Laurens then invites listeners to explore his CrypTen framework, an innovative concept resembling PyTorch in functionality but designed to ensure secure computations within the realm of machine learning. He also sheds light on the role of AI in climate change mitigation and the simulation of wearable materials for augmented-reality applications. By applying AI to such pressing global challenges and merging it with reality augmentation, Laurens emphasizes the transformative capabilities of AI.
Transitioning to the technical aspects, Laurens takes a closer look at the t-SNE dimensionality reduction technique. This technique compresses the dimensionality of high-dimensionality vectors, which can be used for visualization of natural-language token similarity but is also widely used with biological data.
Concluding the episode, Laurens shares his forward-looking insights on AI’s trajectory in shaping the future and shares his career advice to those looking to make a similar impact in the world. Whether you’re well-versed in AI or just embarking on your learning journey, this episode offers a window into the potential that AI holds. Tune in to gain insights and expand your understanding of AI’s evolving capabilities.  
In this episode you will learn:
  • Large-scale learning of image recognition models on web data [05:05]
  • Evolutionary Scale Modeling protein models [16:45]
  • Fighting climate change by building an A.I. model [29:49]
  • The CrypTen privacy-preserving ML framework [38:36]
  • Concerns about adversarial examples [53:25]
  • Laurens’ t-SNE algorithm [58:56]
  • How to make a big impact [1:07:25] 
Items mentioned in this podcast:
Follow Laurens:

Follow Jon:

Episode Transcript: 

Podcast Transcript

Jon: 00:00:00

This is episode number 709 with Dr. Laurens van der Maaten, Senior Research Director at Meta. Today’s episode is brought to you by AWS Cloud Computing Services, by Modelbit for deploying models in seconds, and by Grafbase, the unified data layer.
00:00:20
Welcome to the Super Data Science podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple. 
00:00:51
Welcome back to the Super Data Science podcast. Today we’re joined by the visionary, world-class, repeatedly game-changing AI researcher Dr. Laurens van der Maaten. Think that sounds bombastic? Well, just you wait. Laurens is a Senior Research Director at Meta overseeing swaths of their high-risk, high-reward AI projects with applications as diverse as augmented reality, biological protein synthesis, and tackling climate change. He developed the “CrypTen” privacy-preserving NL framework. He pioneered “web-scale” weekly supervised training of image recognition models. Along with Geoff Hinton, he created the t-SNE dimensionality reduction technique. Their paper on this alone has been cited over 36,000 times in aggregate. Lauren’s works have been cited nearly a 100,000 times. He holds a Ph.D. in machine learning from Tilburg University in the Netherlands. Today’s episode will probably appeal primarily to hands-on data science practitioners, but there is tons of content in this episode for anyone who’d like to appreciate the state of the art in AI across a broad range of socially impactful super cool applications.
00:01:56
In this episode, Lauren details how he pioneered learning across billions of weekly labeled images to create a state-of-the-art machine vision model. He fills this in on how AI can be applied to the synthesis of new biological proteins with implications for both medicine and agriculture. He provides specific ways AI is being used to tackle climate change, as well as to simulate wearable materials for enhancing augmented reality interactivity. He introduces a library just like PyTorch, but where all of the computations are encrypted. He talks about the wide range of applications of his ubiquitous dimensionality reduction approach, and he fills us in on his vision for the impact of AI on society in the coming decades. All right, you ready for this exceptional episode? Let’s go. 
00:02:45
All right. I’m here with Laurens van der Maaten. Thank you for coming in person to record this Super Data Science episode. It’s a joy to have you on.
 Laurens: 00:02:52
Thanks so much for having me, Jon. 
Jon: 00:02:53
So, we know each other through Alex Miller directly, who I guess reports into you at Meta AI.
Laurens: 00:03:01
He used to. Yeah. 
Jon: 00:03:02
He used to, yeah. So, he was in episode number 663, which was focused on CICERO, this amazing algorithm. I think it was the biggest innovation in AI in 2022, which makes it one of the biggest AI innovations ever, up to this point. And so we’ll talk about that a bit more in the episode. We also had Noam Brown who was in episode number 569. And that episode kind of laid the groundwork for the kind of stuff that we talked about related to CICERO. So, these kinds of projects all relate to machines being able to do incredible, not only just natural language capabilities like we’re now seeing with generative AI models, but being able to, being able to negotiate and being able to anticipate what humans would want to do in a certain situation, and even if the machine just has limited information. 
00:03:57
So, Noam primarily talk about poker and then with with Alex, we primarily talked about this board game Diplomacy. And yeah, just really, really, really fascinating groundbreaking AI. And I don’t know if you have anything particular you want to add on CICERO.
Laurens: 00:04:15
Yeah, I mean the exciting thing about CICERO right, is that I think it’s pretty much the first time that we were able to combine, you know, sort of like really advanced sort of reasoning and planning sort of the kind of stuff that people have been doing in-game AI for a long time with natural language communication, right? And so it’s really that combination that I think makes, makes CICERO a really unique achievement. 
Jon: 00:04:40
Yep. And worked really well. It’s, it was the, it was the top competitor at 90th percentile. 
Laurens: 00:04:48
Yeah, yeah, yeah. Like we’re, the CICERO robot is, is basically competitive with like the top players in the world of the game of Diplomacy.
Jon: 00:04:57
Yeah. And so listeners, you can refer back to episode 663 if you want lots on that project. So, Laurens, you are a Senior Research Director at Meta AI. What are some of the most impactful projects that you’ve worked on in your time at Meta AI? 
Laurens: 00:05:14
I’ve, I’ve been at Meta AI for eight and a half years, so I’ve worked on on many things. So, I don’t know, sort of what is the most impactful one. Maybe one of the projects I’m most proud of is, is some of the work that we did on large scale learning of image recognition models on web data. Sort of the story behind this effort sort of comes, goes back to when I joined Meta back in 2015. And at that time, sort of the common approach to building image recognition systems was you would collect a set of images and you would manually annotate sort of what is in those images. Like here’s a coffee cup, here’s a cat, here’s a dog, et cetera. 
00:06:01
And then you would train a model to predict those classes, right. And this was the common approach sort of in the, in the community at the time. And to me that always seemed a little bit silly or at least sort of, I was putting on my computer vision hat and was sort of looking at the internet as like, well, you have all these images and then there’s all this metadata around the images, right? And that metadata tells you something about where, you know, what is in that image, right? And you can get that from, you know, sort of text around it or hashtags or whatever sort of like, information that’s attached to those images. And so the question that I set out to answer is basically like, what can we learn from sort of images and that that metadata? 
00:06:46
And so it was immediately clear to me that in order for that to be successful, we would have to do that at scale, right? So, like the sort of the typical sort of metadata that you have is not going to say exactly sort of what is in the image, because it’s just like people talking about images or things like that. But it does give you information. Right. And so what I worked on for sort of a couple of years is de-risking various parts of this project. So, like figuring out what type of metadata to use. And sort of what we realize is that hashtags are sort of the most informative type of information because people often assign them to images in order to increase sort of searchability of those images, right? And sort of by definition they capture something about the contents of the images and they’re available in in abundance, right?
00:07:40
But then there are lots of other things that are different compared to sort of your standards image recognition training pipeline. So, for example, hashtags tend to be distributed according to a Zipfian law, right? So, like some hashtags occur really, really often, like #throwbackthursday, you know. Every other image right, is has hashtag #tbt, it’s probably not very informative, right? But then you have a really long tail of hashtags that don’t find frequent use, but they’re, they’re often really valuable, right? And so in your training, you need to correct for that. And then we also needed to build out sort of the infrastructure needed to train at scale, which, you know, is right now, today is quite commonplace, but sort of in 2015 this didn’t exist yet. There was no TensorFlow, there was no PyTorch. We were working in Torch, but it was in a Lua and sort of, you know, on GPUs that are probably slower than your phone right now. Right. 
00:08:43
And so like that took time as well to sort of figure out how we could scale this up and sort of in a way that’s, that’s effective. And sort of around 2017, 2018, sort of all of this came together in an effort where we trained these models on about 3,5 billion web images with associated hashtags and sort of like used this data to pre-train the models on, on hundreds of GPUs already at the time. And then we would fine-tune these models on image recognition tasks that we cared about. So, for, you know, an example would be like the ImageNet academic benchmark. And what we showed is that this approach led to huge improvements in sort of their recognition accuracy, right? 
00:09:34
And that was kind of the first time that we were able to convincingly show that and it’s an approach that sort of like over time has become much more popular. Right. And if you look at sort of all the recent excitement about large language models and so on, it’s basically that same approach except, you know, apply to language and sort of at much larger scale with, with more modern tools and so on. But this was really, I think, a project that I was really proud of because sort of at the time when we started it, there was a lot of skepticism, like, well, why would you do this? You know, will this work and so on, right? And so we took the, we took the risk, we took the bets, and sort of the bet paid off, right? And that’s kind of like, you know, the dream as a researcher. 
Jon: 00:10:21
For sure. And at that time, it would’ve been a hugely expensive project to convince people to give you hundreds of GPUs at that time would not have been commonplace. 
Laurens: 00:10:27
Yeah, I mean that as well, right? Like you need to convince people that this is worth investing in, including sort of the compute investments necessary. But it’s even, I think often it’s harder to convince people to work on it, right? Because, you know, people want to, you know, people can be reluctant to take risks, right? Like, they want to make sure that like the work that they do pans out. And in research you don’t always have that guarantee, right? And in order to deliver the biggest research results, you sometimes have to take a leap of, take a leap of faith. 
Jon: 00:11:07
Nice. Yeah. Let’s get back to that kind of risk-reward trade-off in data science R&D in a second. But just to kind of recap this project and why, I think it’s really fascinating to make sure I understand it. So, with at the time, the Facebook platform, and I guess maybe there, it would’ve been Instagram at that point, and to some extent in 2015 as well. So, in the platform you have huge amounts of images. They’re not labeled in the sense of having had a person come around and definitively say, this is a cat, this is a dog, this is a car. But you could use hashtags and you investigated different kinds of meta data. But it turned out that the hashtags were a pretty good indicator in some cases where it wasn’t a super common one, like #TBT where it would provide, yeah, if it says like #beach or something. It’s probably a pretty good chance it’s a beach scene. And so you could use, you used 3.5 billion images and these associated like weekly supervised labels to have a rough idea and to be able to train, probably at that time a convolutional neural network. 
Laurens: 00:12:17
That’s right. 
Jon: 00:12:17
And then once you have your model weights kind of generally primed to this huge amount of possible variability, from these 3.5 billion images, you could then subsequently fine-tune on something like the ImageNet dataset, where you do have all these labels, millions of labels and you get better results than if you just trained on the ImageNet. 
Laurens: 00:12:41
Exactly. Yeah. Exactly. And what’s, what’s particularly cool about it is that sort of in this way of training you, because you’ve seen so many images, sort of the visual variety of stuff that you’ve seen is, is just much bigger than you can ever get from a manually annotated dataset, right? So, like we are here in downtown Manhattan, I’m pretty sure like sort of every corner of this neighborhood sort of like is present in your training data set, right? And so your model has captured some knowledge, some understanding about that. And so you can use that like in the tasks that you finally care about, which is kind of the same thing that is happening now in language models, right? Like where, because these models have read the entire internet, they’ve sort of like, somehow like incorporated all that knowledge and, and, you know, sort of in the fine-tuning, the only thing that you’re doing is you sort of like surface the relevant knowledge, the stuff that you care about. 
Jon: 00:14:18
Are you stuck between optimizing latency and lowering your inference costs as you build your generative AI applications? Find out why more ML developers are moving toward AWS Trainium and Inferentia to build and serve their Large Language Models. You can save up to 50% on training costs with AWS Trainium chips and up to 40% on inference costs with AWS Inferentia chips. Trainium and Inferentia will help you achieve higher performance, lower costs, and be more sustainable. Check out the links in the show notes to learn more. All right, now back to our show.
00:14:21
Yeah. It, exactly like you’re saying, these huge large language models today relates to them perfectly because it’s the same kind of idea of taking often completely unsupervised with the natural language case, but training on like, all of the internet. And so that gives some sense in the same way you’re describing with the image dataset, the 3.5 billion images, every street corner around us in downtown Manhattan is captured in the same kind of way the large language models today are capturing every, every use of language, every kind of context, every kind of meaning of a word. But then we can fine-tune it subsequently as well, Llama 2, which recently came out from Meta AI’s … is it Meta AI? Is it ok to say [crosstalk 00:00:00] 
Laurens: 00:15:13
It’s Meta AI, yeah, yeah. 
Jon: 00:15:14
Yeah, yeah, yeah. And actually I think we’ve, we’ve just lined up as of today as a time of recording one of the key authors from a Llama 2 paper to come and speak on the show. So, yeah. So, that’ll be a good one. But so we won’t, we won’t get, get dragged into that too much today. 
Laurens: 00:15:32
It’s a great project. 
Jon: 00:15:34
Yeah. It’s a really cool project, but same kind of, but, but quickly, yeah, it has the same kind of idea of taking a huge training in corpus. And I think one of the great things that Meta’s done with the original Llama as well as Llama 2, is it’s the cleaning of the data that is so important to have a great model downstream. And so yeah, with this unsupervised training on a huge corpus with Llama 2, and then unlike the original Llama, fine-tuning it then on these chat use cases with millions or over a million, the technical paper said, human annotations of conversation. 
00:16:14
And so, yeah, similar kind of it, yeah, I like that analogy that you introduced from this large-scale image recognition from years ago now, that same kind of idea are popping up today in LLMs. 
Laurens: 00:16:24
Yeah. And of course, I mean, I didn’t foresee that, like, you know, so similar approaches would take off in language modeling, right? Like, I don’t think anyone saw that coming sort of at the time. But it’s, it’s really interesting that sort of like, if you squint as the, at these projects that they’re, they’re kind of the same, the same high-level approach and, and same high-level motivation.
Jon: 00:16:44
Very cool. Another really exciting project that you’ve been involved with at Meta AI is Evolutionary Scale Modeling, ESM protein models. And I think we should also, something maybe to acknowledge right away with these kinds over the course of this episode, we’re gonna cover so many different topic areas. So, maybe is there some way that you can kind of orient us to your, the body of research that you’ve contributed to? Is there a common thread? I mean, is it just machine learning and you’re just like, what is the coolest thing I can do with machine learning right now and you end up having all these different kinds of disparate projects? 
Laurens: 00:17:24
Yeah, I mean, I do always like projects where you like try to build something and where you try to bring a step change in sort of the capabilities that we have, right? And so, you know, in terms of like risk, like I always want to be sort of on the high-risk part of the spectrum. Like, I’d rather feel miserably than not, than not try at all. Which is sort of a choice, right? Like other people are, are gonna be somewhere else on the spectrum. And so I’d say that is sort of the main thing. Like, you know, like, let’s, let’s really build things. I think that is what motivates me. And, you know, and sort of thinking about like, at a high level, what is wrong with, or like what is missing sort of incurrent approaches that we could, that we could change. And so ESM project, I think, you know, is a, is a really cool example. You should get someone from the team on your show at some point. Like someone like Alex Reeves, like, it would be a really really exciting person I think to have on, on the podcast.
00:18:31
The basic idea of this project is trying to do de novo protein design. So, trying to do design of proteins that are unlike anything that exists in nature. And sort of the way that the team is going about that project is by taking a somewhat similar approach, by doing essentially language modeling on really large data sets of proteins. So, proteins are really just sequences of amino acids. They’re about 20 amino acids. And so like, it’s really a sequence of like tokens, and each token can have 20 values. And it’s gotten really easy to get those sequences right, like sequencing proteins has become really cheap. And so they’re publicly available databases with hundreds of millions of proteins that were, that were sequenced. And so what you can do is you can train language models on that, and those language, those language models develop a representation of protein that captures, you know, something about sort of the structure of those, of those proteins. And you can use that representation to sort of condition folding algorithms on. And that’s exactly what the team did.
00:19:53
So, what is folding about? Folding is basically, you know, like you have this sequence of amino acids, but really what you care about is sort of what is the shape of this protein? This protein sort of like starts folding in all sorts of funky ways and does, sort of the way, sort of the 3D structure of the protein is lined up, determines sort of what function does it have. So, like, what, what does it actually do? And so what you want to predict is like, what is the 3D structure of this protein given this amino acid sequence? Becasue the 3D structure is really expensive to get, it’s only the sequence that you can cheaply. And so what the team did is they condition folding algorithms on these language modeling representations. And so what they were able to do is to deliver folding, sort of prediction results of the same quality, as AlphaFold 2, but without homologs. 
00:20:52
So, AlphaFold 2 requires you to specify sort of like, well, I want to fold this protein, and here’s some other proteins that are like, kind of similar to the one I want to fold. So, those are, those are called homologs. And the folding algorithm that the ESM team developed doesn’t require, this doesn’t require these homologs. And that is important if you want to do de novo design, right? Like, if you want to design completely new things, then there is nothing, right? Like, there, there are no sort of comparable proteins that you could use as homologs to sort of condition your folding on. And so it’s really a stepping stone towards doing that type of protein design. 
Jon: 00:21:38
That’s really fascinating. And it is something that we’ve talked about on the episode a few times this year. So, we had an episode number 643 at the beginning of the year. We had Professor Charlotte Deane, who’s an Oxford University professor. Do you know her? You kind of-
Laurens: 00:21:51
I don’t. 
Jon: 00:21:52
Yeah, you kind of, you nodded so much that I was like, oh, it’s gonna turn out that he knows her too. But she does, her research is focused on this protein folding kind of problem as well. And she, at the time of recording, she’d been having dinner with Demis Hassabis the night before. And so she was also talking, so, while AlphaFold 2 is, has the biggest name recognition in this space in terms of protein folding, it is specialized, as you say, to this specific problem. I guess when you have homologs, I wasn’t aware of that detail, but, you know, for, for dealing with, with this particular folding challenge.
00:22:27
Whereas Charlotte Deane was describing that for some kinds, some kinds of proteins, there’s complexities that, that that kind of AlphaFold 2 modeling doesn’t work for. So, in her case, it was, I think it was hemoglobin binding, or maybe it was, it was binding structures in general. And there’s just, there’s so much variability in what proteins, in how they need to be configured to be able to bind to all different kinds of objects in the body, antigens in the body, I guess, that yeah, and because, as you’re saying to create these data sets to actually know what the 3D structure is, is very expensive that there’s still a lot of unsolved problems there. And so it’s, it’s interesting that you at Meta AI with this ESM protein project have been able to do these predictions without the homologs and be able to predict protein structures that we’ve never seen in nature before. Is there a, when you create these de novo proteins, is there some kind of purpose deliberately in mind?
Laurens: 00:23:30
No, nothing, nothing specific. Sort of right now, like the use cases of this could be very diverse from like drug discovery to design of enzymes that compost plastics faster. So, so it can be sort of like all over the, all over the place. Mostly, I guess sort of like in medical applications, right? Like that’s where you’d, you’d sort of have most sort of potential use cases of this work. But we’re just using it as a sort of test case to you know, like build up sort of our experience in things like language modeling and so on, right. And like, have different sort of different use cases for the same technology. So, that sort of the underlying technology that we’re developing becomes more robust. 
Jon: 00:24:25
Yeah. Very cool. The medical application thing, and that’s actually something we talked about very recently in episode 707 with Joey Gonzalez, he’s a Berkeley professor. And he was talking about how he’s really excited about the potential for AI in general to be doing exactly this kind of work yoe were talking about. You know, his vision for how AI could transform the world in the coming decades, a question that I’ll have for you later on in the episode. And yeah, that was one of the things that he’s most excited about, and then beyond just medical, is in episode 705, we had folks from Syngenta on, and they were talking about how they use this kind of, they call it generative chemistry, where they’re I think for, in their case, it’s not necessarily just proteins, but it could be any kind of agricultural compound. And so they’re, they’re trying to figure out ways of making crops more fertile you know, maybe taking abundant you know, some abundant material and being able to convert it into something that could be useful as a fertilizer or a pesticide. Yeah. So, cool applications in agriculture as well. 
Laurens: 00:25:37
Yeah, I think in general, like in sort of the chemistry space, there are a lot of, lot of exciting opportunities for, for AI to have impact. And it’s an, it’s an area that isn’t as heavily studied yet as things like computer vision or, or natural language understanding. And so that I think, sort of creates a lot of, lot of opportunities. And it’s often what I’m looking for, sort of where’s the wide space in the current research environment, becasue that’s where often where the biggest opportunities are.
Jon: 00:26:08
It’s so cool that Meta gives you the wide berth to explore those kinds of things where, you know, this, you know, I’m not aware of a, you know, a de novo protein simulation company being incubated by Meta, you know, but but you get to do this kind of research, you get to publish on it, and there could be these really great downstream social impact applications. But then I’m sure, and this might not be something that you can really talk about on air, but I’m sure that doing these kinds of projects, there’s, there’s internal things that you discover, like tackling this kind of, maybe this kind of scale the problem where you’re using amino acids as the tokens, but maybe there’s things that translate. There’s, there’s packages that get built that can be used internally for just natural language modeling. And so I mean, I’m completely making up that example, but yeah, there’s, there ends up being-
Laurens: 00:27:01
They’re, they’re often like surprising use cases of things, right? Like, since you bring up sort of chemistry is one of the projects I had, the privilege of supporting is the Open Catalyst project which is started by Larry Zitnick. And sort of the premise of this project is that one of the problems in sort of, you know, the clean energy revolution is the storage of of that energy, right? Like you have, you know, solar and wind and so on, but like, where do you store all that energy until you actually need it. And one of the solutions there is like hydrogen storage and sort of one of the limiting factors there is the efficiency of these hydrogen reactions where you like, you know, convert electricity and hydrogen and, and vice versa.
00:27:52
And so he started the project to basically do AI guided design of those catalysts, right? So, instead of like designing proteins, we’re designing, you know, small molecules that, that that catalyze these, these reactions. And that work, you know, turned out to be super useful for all sorts of other problems in sort of the material design space, right? And sort of at Meta we build, you know, sort of headsets. We’ve talked a lot about the metaverse and so on. And so like, if you if you develop those kinds of wearables, there are a lot of material design problems that you have. And what we found is that, you know, there are those kinds of like surprising use cases that can pop up, right? Like, that you don’t realize exists sort of on the day you start the work, right? 
00:28:49
And so in research, you sometimes just have to take a bet, “Hey, this seems like a useful direction to go into.” And you sort of have to, you know, like have conviction that like, over time this will become, this will become useful for the company as well. And, and very often it happens more often than not, I would say. 
Jon: 00:29:06
Very cool. 
00:29:44
Deploying machine learning models into production doesn’t need to require hours of engineering effort or complex home-grown solutions. In fact, data scientists may now not need engineering help at all! With Modelbit, you deploy ML models into production with one line of code. Simply call modelbit.deploy() in your notebook and Modelbit will deploy your model, with all its dependencies, to production in as little as 10 seconds. Models can then be called as a REST endpoint in your product, or from your warehouse as a SQL function. Very cool. Try it for free today at modelbit.com, that’s M-O-D-E-L-B-I-T.com
00:29:47
Yeah. And I don’t know if you want to give more into these kinds of research efforts at the intersection of AI and climate technology or AI and augmented reality, maybe the AI and climate technology first. 
Laurens: 00:29:58
Yeah. So, I already mentioned sort of this Open Catalyst project a little bit. I’m mentioning that project because it’s been quite visible sort of externally. So, we’ve released a lot of data sets that are used by the academic community, not just in AI, but also in computational chemistry. And that have been really helpful there in terms of, in terms of understanding like how these how different catalysts behave. So, the underlying idea is that if you have a if you have a catalyst, what you can do is you can basically simulate the chemical reaction using a method called DFT. And this works great, it’s just very computationally intensive, so it can take like, you know, many days to assimilate one reaction. And now if you have to choose from like millions or even billions of possible sort of candidate catalysts, right, there’s just no way that you can run all those, all those simulations in a reasonable time.
00:31:04
And, so the idea is like, can we train machine learning models to predict the outcomes of those simulations in a much more effective way? And sort of the way we go about this is we run those simulations on a bunch of sort of examples. And now we train, so now we have examples, right, of like, okay, this is a catalyst, and like if you, if you use it in this reaction, and this is the outcome of that reaction. and those training examples we can use to train machine learning models, right? And in practice in this project, we use graph neural networks. And so now we can train up these models, and now if we have a new catalyst, we can make a prediction really quickly without running the sort of the expensive chemistry chemical simulation.
Jon: 00:31:55
Yeah. Very cool. So, it’s DFT, is that-
Laurens: 00:31:59
DFT, yes. 
Jon: 00:32:00
Yeah, yeah, yeah. Is the expensive simulation, and then, yeah, this makes perfect sense. So, you develop some kind of, it sounded a graph neural network to estimate much more compute efficiently the input to the output. Yeah, it makes so much sense, a great application of that kind of technology. And then, yeah, augmented reality technology, that is something that we associate with Meta. That’s, you know, something we, we closely associate. So, yeah. So, how, how does AI relate to AR?
Laurens: 00:32:34
So, I think in, in, can be in different ways. So, one I think is sort of like material design, right? 
Jon: 00:32:42
Right, right. The wearables, right? 
Laurens: 00:32:43
Similar problems that we’re, that we’re trying to apply sort of similar methods to. And then there, but then there are also things related to sort of contextual AI, right? So, like, how can these wearables sort of understand like what it is you’re, you’re trying to do and help you perform those tasks. And so this is something that, that Michael Abrash at Meta has been talking about, has been talking about a lot in sort of like rethinking, basically like how human-computer interaction works in sort of the, in the metaverse, right, like in the, in the world of augmented reality. It seems pretty clear that AI is going to play a big role in that. And so we’re, we’re trying to pioneer that that field as well. Yeah. 
Jon: 00:33:36
Nice. Very cool. Nice. So, I am sure it’s clear to all of our listeners that you’re working on so many different fascinating research areas. In order to be working on all these different projects, you’ve got teams at multiple US locations and European locations. How do you manage all of this? Like, and how do you, how do you strike a balance that you need to strike in order to be getting deep in the weeds on some things and helping out or you know, just dealing with the regular administration of management? You know, like what’s it like being such a senior person overseeing so many different projects with so many different people in so many different locations?
Laurens: 00:34:15
Right, right. I mean, it, it’s about finding a balance, right? So, there are projects that, you know, I cannot sort of get very deeply involved in. So, for example, like the CICERO project, I didn’t have very deep sort of technical involvement in the, in the project, but what I’m trying to do is sort of create the environment in which the teams working on these projects can succeed, right? And that sort of entails like, you know, sort of like making sure that these teams have the right resources to be successful, but also can be sort of like encouraging the teams to like, take bigger risks. I think is a sort of a big part of my role in supporting those teams.
00:35:02
The geographic distribution? I think being in New York is a huge advantage because sort of what I’ve found is that since Covid, you know, remote work and so on has become a lot easier, right? And so, like, having distributed teams in a way isn’t that hard. The only thing that’s difficult and that you cannot really fix is time zone differences, right? And so, like, working between Europe and the West Coast of the US is just really hard because of the eight to nine hour time difference. But being in New York is kind of an advantage there. Becasue like in the morning you can work with folks in Europe, and then in the afternoon you can, in evening you can work the, with folks on the West Coast that, so that really helps. 
Jon: 00:35:49
Yeah. Yeah. Yeah. I couldn’t agree more. I experienced the same thing working with, yeah, with teams in Europe as well as in Africa. A lot of our, half of our team actually, at my company Nebula is in Sub-Saharan Africa.
Laurens: 00:36:04
Awesome. 
Jon: 00:36:04
And so same time zone as Europe, and it’s really nice being able to work with them. But then we also have a lot of clients, prospective clients, investors that are on the West Coast. And so, yeah, I think there’s a lot of great reasons to be in New York and time zone is another one.
Laurens: 00:36:17
Gotcha. I actually go to Africa every year to like teach in a program called the African Masters on Machine Intelligence which is a program that was started five or six years ago by Moustapha Cissé, who used to be a researcher at Meta. And it’s a, it’s a program to like bring AI education to Africa. It ran in Rwanda for the first couple of years, and now it’s running in Senegal. And I go there every year to teach sort of introduction to computer vision. So, really excited to hear, you know, about sort of the employment opportunities for, for those students that your, your company’s creating.
Jon: 00:37:03
Yeah, it’s very cool. It was something that, you know, for us as well, prior to the pandemic, we were a hundred percent in person in New York, and with, with having to go to this distributed model, you know, there’s things that I don’t love about it. Like I don’t love not being around a whiteboard with my team and seeing them every day. And also just those social elements of kind of knowing what’s going on in the lives of everyone on my team and being able to celebrate in those and help them through things that are tougher in person lives as well. You know, people don’t want to, or at least I think data scientists and engineers, you know, with all the time that we’re, that we tend to be spending in Zoom meetings already, you’re not, you don’t, you rarely spend extra time just like getting caught up on people’s personal lives.
00:37:48
But one of the really great things has been able to leverage talent all over the US and the world. And so, and that, and we’ve found amazing engineering talent in Sub-Saharan Africa. So, yeah. So, if other folks are out there having a tough time finding the right engineers that’s another place to look. All right, Laurens, another topic area that you’ve covered at Meta, which I guess kind of sits across all of these different topic areas, because privacy is something that concerns every possible area. And I imagine this is something particularly with Meta dealing with so much personal data, privacy’s got to be top of mind. So, you’ve been involved, you’re part of the team that developed the “CrypTen” C R Y P T E N, the “CrypTen” privacy-preserving machine learning framework. So, what’s this project all about?
Laurens: 00:38:47
So, CrypTen, you can think about it as basically PyTorch. So, like, as a developer, like what you would write is essentially PyTorch code. And sort of like the goal that we were trying to get to and got really, really close to is that you can take your PyTorch code and you can replace import torch by import CrypTen as torch and your code just runs, except the computations that you’re performing are performed on encrypted data. So, like your whole like neural network training is, is happening on data that you don’t really understand and sort of the weights you don’t really understand either. And sort of what’s implemented under the hood is a set of techniques called secure multi-party computation. And sort of the basic idea is that it, it enables like two different parties for example, you and me to like together perform competitions on data that we, that we don’t understand. 
00:39:46
So, like the, and, and sort of the key principles of those approaches are pretty easy to understand. So, let’s say you have some data, let’s say the number 12, and you want to do computations on the data with me, but you want to keep your data private, so you don’t want to tell me about the 12 that you have.
Jon: 00:40:07
I certainly wouldn’t. 
Laurens: 00:40:08
No, no, don’t, don’t want to know my 12, right? So, what you do is you draw a random number, let’s say 7, and you give the seven to me. So, now I have a random number, I don’t understand anything. And what you do is you compute 12 minus 7. So, now you have 5. So, now together we have a representation of the 12, but I don’t, I don’t know that the underlying number is 12. And actually you don’t know it either, right? Like if you had sort of thrown away the 12. So, why is this an interesting representation? Well, it’s an interesting representation because you can still do competitions on it. So, for example, if we wanted to multiply your 12 by 2, I can multiply my 7 by 2, you multiply your 5 by 2, and together we have a representation of 24, right? If we had two of these numbers that are secret-shared is sort of the term that’s used to describe this, this way of sharing data.
00:41:10
If we had two of these numbers, what I could do so let’s say you had another number, you know, 6, and you know, I got 3 and you got 3, and we want to add this 6 and 12. Well, I can add my 5 and sorry, my 7 and the 3, and you can add your 5 and the t3hree. So, now I have 10, you have 8, and together we have 18, right? So, we have the sum of 12 and and 6, right? So, we can form addition as well. And we can even multiply these kinds of secrettshared numbers. The protocol to do that is a little bit more involved. So, I’m not going to explain it in full detail, but you can do that as well. Right? So, now what do we have, we have a representation of numbers that, you know, both of us don’t understand the underlying data, but we can do addition and we can do multiplication. And really that’s all you need, right?
Jon: 00:42:00
Yeah, yeah for neural networks, for sure. 
Laurens: 00:42:03
Well, I mean, you need other functions, right? Because you need to be able to compute values and you need to be able to do normalization and so on. But essentially you can build all of those up from addition and multiplication, sort of in the same way that like, you know, CPUs do that, right? Using all sorts of approximations or by doing the multiplication and addition sort of at the binary level, which enables you to implement things like, you know, sort of logical operators, which you need in order to do comparisons and so on. And so what CrypTen is implementing is basically all of that. It’s sort of like, is implementing these like super primitive operations, and then on top of that, it enables you to evaluate all the functions that you need in order to do deep learning. And then on top of that, it builds autograds sort of in the same way that you expect from PyTorch. And then on top of that you, it implements like a full neural network library that looks, that uses the same API as PyTorch.
00:43:08
And what’s cool is that sort of in this lowest level of implementation, it is actually calling into PyTorch again, which means that we can do these computations on GPUs as well to do acceleration. And so really what this, what sort of this whole framework is enabling us to do is it enables us to do deep learning on that 12 of yours and that whole bunch of like other numbers that you don’t want to share with me, but together, we can still sort of get the value out of that, out of that data in particular out of the union, sort of of the data that we have. 
Jon: 00:43:44
Very cool. So, I guess it would allow different counterparties to even collaborate together. So, you could have two corporate entities that, maybe two pharmaceutical firms that want to be able to pool together research that they’ve conducted separately, but they want to keep the research proprietary to the individual organizations. They could pool their data together and train a neural network model, do inference with that neural network model, even on GPUs without ever having to share the data with each other. 
Laurens: 00:44:16
That’s right. That’s right. And all of this is like cryptographically secure. So, they’re very like very strong sort of theorems like to sort of prove that this is actually secure and that I will never be able to learn about your, your 12. Yeah. 
Jon: 00:44:35
Yeah. Very nice. And is there any impact on like computational efficiency?
Laurens: 00:44:39
Yeah, so I mean, this is definitely a lot slower. Right? Like that is the sort of big price that you pay. And so we are able to do sort of like modern neural networks, but we wouldn’t be able to train a model like Llama 2 in this way. Typically, like the computational overhead is in the, in the order of like a 100x, right? 
Jon: 00:45:02
Oh, wow. Yeah. Wow. 
Laurens: 00:45:02
So, so it’s quite substantial despite sort of all the optimizations that we did. And that is mostly because the protocols that we’re, that we’re using in particular the private multiplication protocol similarly we use that in things like convolution or in attention mechanisms and so on. Those are quite sort of computationally intensive. And more importantly, they require us to communicate. Like we are going to communicate intermediate results, and that communication takes time. 
Jon: 00:45:32
Yeah. Yeah. Yeah. It makes perfect sense. It would’ve been wild if you told me, I guess now in retrospect, having you asked my question if you’d said, yeah, it’s just as fast. 
Laurens: 00:45:40
Yeah, that would be amazing. I mean, then everybody would be using it for like every possible use case, right? Yeah. So, that is, I think, not quite feasible, but like, but it is unlocking use cases like the one you just described, where the alternative is to not do the work at all. Right? And so that is really sort of the baseline you’re comparing with. Like you’re, you’re not getting the value at all, or you’re getting it and you know, you pay a computational cost for it. 
Jon: 00:46:06
Very cool. And CrypTen is open-source. So, our listeners right now, they can pip install CrypTen and use it just like the PyTorch interface that they’re used to for building and training, and deploying deep learning models.
Laurens: 00:46:17
Yeah, for sure. 
Jon: 00:46:18
Super easy. That’s amazing. Thank you. Great resource there and lots of applications, no doubt as well. 
00:47:07
This episode is brought to you by Grafbase. Grafbase is the easiest way to unify, extend and cache all your data sources via a single GraphQL API deployed to the edge closest to your web and mobile users. Grafbase also makes it effortless to turn OpenAPI or MongoDB sources into GraphQL APIs.Not only that but the Grafbase command-line interface lets you build locally, and when deployed, each Git branch automatically creates a preview deployment API for easy testing and collaboration. That sure sounds great to me. Check Grafbase out yourself by signing up for a free account at grafbase.com, that’s g-r-a-f-b-a-s-e dot com 
00:47:11
With a bit of a technical question here, our researcher dug up that something called Fisher Information Loss is a better alternative than Differential Privacy. So, thanks to Serg for digging that one up. Do you want to dig into what that means? 
Laurens: 00:47:28
Sure. Yeah. I don’t know about better, I mean, it’s definitely different. You know, I don’t want to upset all the Differential Privacy fans in your, your audience. So, what this is about, like what we just talked about is cryptographic security or cryptographic privacy, right? So, here we do computations together and we don’t change the competitions or the data or the results in any way, but, but all the competitions are encrypted, right? And so even the end result of like the competitions that we did on your 12 are encrypted, right? And so at some point we need to open up that result, right? Like we, we will need to decrypt something in order to be able to see like what the result is, right?
00:48:14
And at that point there is information leaking, right? Like information about sort of the training data that went in and so on. And so that’s where you want to use statistical privacy. So, things like Differential Privacy to sort of measure and control the amount of information leakage that happens there. And so there, the sort of like standard way of doing that is through Differential Privacy, which is basically an approach that looks at, can I, if I have a data points that was used to, if I have a data points, can I answer the question? Was this data point used in the competition that I just performed, which is called membership inference? So, like, was this data point part of like the data that was used to, you know, train our deep network or sort of whatever it is we were, we were computing together.
00:49:13
And that is, is an interesting definition, but it’s also a little bit weird because it sort of assumes that you already have the data points that we cared about keeping private, right? But if we cared about keeping it private, like why, why did you already have it? Right? And so there’s a question on like, is that sort of the right formulation of statistical privacy? And so what we did in Fisher Information Loss is we propose an alternative definition of statistical privacy. That’s says more like, “Hey, if I give you, for example, the trained model weights, can you infer, like, can you back out any of the training examples?” So, like, can you reconstruct those directly from the model weights or sort of from the parameters or the outputs of the competition that you’ve just performed?
00:50:03
But we’re not assuming that you have that data points already. And so this is a different type of sort of attack on the privacy mechanism, which is called data reconstruction. And Fisher Information Loss sort of provides a, provides sort of a strict guarantee on your ability to do that type of data reconstruction in the same way that Differential Privacy provides a strict guarantee on your inability to do membership inference. And so like there are two different, different sort of definitions of statistical privacy. And people should use the one that’s like most suitable for like the privacy use case that they care about.
Jon: 00:50:47
Right? Right, right. Okay. Very cool. And yeah, nice for you to be able to get in deep in the weeds there for a bit on these differences in statistical privacy options. So, it sounds like, so with the, with the Fisher Information Loss it’s, it, we might want to use that in a scenario where we’re we’re less worried about being sure about some specific data point from the training dataset or like it’s more about using the model weights to reconstruct.
Laurens: 00:51:23
It’s, it’s, it’s essentially about like protecting against a different concern. And sort of the reason that is important is because in statistical privacy there’s always a tradeoff between privacy, sort of how much privacy are you able to guarantee and sort of what is the utility? So, sort of how much of a price do you pay in terms of like the correctness [inaudible 00:51:47] of the app, right? Because in practice, how you guarantee the privacy is you add sort of random noise in a certain, in a sort of very specific way. And so like, you could do this, for example, when you make a prediction, right? So, like let’s say you have an image recognition model, you make a prediction on, you know, sort of like, is this a dog or a cat? And now you can ask the question, how much can I learn about the training data from this prediction, right? And I can reduce the amounts that you can learn by adding noise to my prediction, right?
00:52:19
But now my prediction gets worse, obviously, right? So, I use, I lose some utility, right? And so there’s a fundamental trade-off there, right? And so the, that’s why it matters sort of like what is your definition of privacy? Because what Fisher Information Loss will be able to give you is a better trade-off of privacy versus utility. If you feel that sort of like the Fisher Information Loss definition of privacy. So, this data reconstruction definition is sufficient for you, right? Like it’s sort of sufficient to feel, you know, to feel good about like the results that you’re, that you’re sharing with the world. 
Jon: 00:53:00
Gotcha, gotcha. Yeah. So, it’s, it’s about these, these trade-offs and how much noise you add and yeah, the impact that it has on yeah, your ability to accurately reconstruct results. In addition to privacy being a really important issue across machine learning, I guess for privacy, it’s not just machine learning, but it’s like computing in general. But another big machine learning issue that people confront a lot is this concern about adversarial examples. But I understand that you don’t think it’s as nearly as big of an issue as some other related issues. 
Laurens: 00:53:44
I think adversarial examples are real, right? Like they exist. I worked on them myself, but then also got a little bit disillusioned with working on them. Like there’s a lot of, lot of work out there if, and, but to me, sort of it, it started feeling a little bit artificial and sort of the main- 
Jon: 00:54:04
Maybe we should also quickly, if in case there’s listeners that don’t understand what it is we should quickly define. 
Laurens: 00:54:10
Sure. Yes. So, so adversarial example is basically this idea of like, let’s say you have an image recognition network and like you give it a picture of a doc and it predicts doc. And what people realize is that if you perturb the pixels in the image by an imperceptible amounts by really small amounts, but you do it in a clever way, you can change the prediction of the network into kind of whatever you want. And so like you still see a picture of a dog and you have this really, this network that you think is working really well, but suddenly it predicts cats or guitar or you know, sort of what, whatever, right? And so adversarially, you can change the inputs to change the output of the, of the network. And there’s a lot of work that people have been doing in sort of designing these kinds of attacks as well as trying to design defenses against these attacks. And it seems like the attackers are kind of winning in this sort of in this back and back and forth. 
00:55:13
And, and I worked on this myself as well, right? So, like, you know, I got excited about this as well because it seems like it tells us something about, you know, sort of the limitations of some of the models that we are, that we’re developing. But I kind of stopped working on it when I realized that, you know, there are much bigger sort of more natural limitations of our current systems. In particular in 2019, we did a study where we looked at, “Hey, if we take an image recognition model and we use it in different parts of the world, does it work just as well?” And so you may wonder like, well, why does it matter? Right. Like where I use my image recognition network and the reason is that things look different in different parts of the world, right? So, like, if you, if you’ve ever been to an Indian wedding it’ll look very different from a Western wedding. It’ll be very colorful, and there won’t be a white dress, right? 
00:56:08
And so we did a study, like if we, if we try to recognize these kinds of things in different countries across the world, like do we get the same recognition accuracy? Right. And what we found is that these models work better in, typically in western countries. So, in the US and Europe than they do in Africa or Southeast Asia or Latin America. Right. And so what that suggests is that these models, like they haven’t captured sort of the full, you know, sort of natural variation in the world. Right. And so I didn’t have to be adversarial to break the to break sort of the systems that we had at the time. I just had to fly to another country. And sort of, for me, that was a little bit like, okay, you know, sort of in the grand scheme of things, this just seems like a more important problem to work on and sort of motivated a lot of work sort of inside Meta to make sure to address these issues in our, in the image recognition and sort of other systems that we have.
Jon: 00:57:14
Yeah. So, these adversarial examples where some potentially bad actor is deliberately trying to confuse your machine learning system that’s probably in the real world, something that we confront far less often, than a machine learning algorithm just being irrelevant, or being completely off base.
Laurens: 00:57:34
Exactly. 
Jon: 00:57:35
Yeah. Yeah, yeah, yeah. So, yeah. So, that’s a, that’s a big point to make. So, yeah. Are there any other big kinds of issues, privacy robustness, other big ethical issues that you’re tackling or that you think we need to tackle as a machine learning community? 
Laurens: 00:57:52
I think these are big ones, yeah. Like, I’m sure there are many, many issues that we, that we need to tackle. And I think that’s why sort of at Meta AI, like we’ve always been taking an open approach to research, right? Like, that is the reason why we’re open-sourcing models like Llama 2 so that the community out there can like test those models, find the problems in them, and sort of work with us together to solve some of these problems.
Jon: 00:58:26
Nice. Great explanation, great summary on kind of that topic area. Moving on to what you were doing before Meta, you had a storied career even before then, so eight and a half years at Meta. But you were an academic researcher at Delft University of Technology, UC San Diego, Tilburg University. And last but not least, the University of Toronto, my hometown where you worked with the legendary professor, Geoff Hinton. And while you were at Toronto, you developed the revolutionary, truly, Dimensionality Reduction Algorithm, t-SNE t-distributed Stochastic Neighbor Embedding. And so this is an algorithm that I use all the time. My data science team uses all the time, that I teach about all the time.
00:59:15
Because it’s absolutely fabulous for taking, when we’re working with a lot of kinds of machine learning models. But for me in particular, it’s often natural language data. We were talking earlier, when Llama 2 is trained, we have this pre-training step where we take basically all the language on the internet, clean it up, and then and, and convert the, all of these tokens, all of these, typically we’re working with these kinds of sub word tokens, and all of those get converted into a high dimensional space. And exactly how many dimensions are in that space is a hyper-parameter of the training that you’re doing. But you might have a hundred or a thousand of these dimensions. And so every word, every sub-word in this case, every token gets placed in a location in this space. And as soon as we’re going over three dimensions, it’s impossible for our mind to visualize what the relationships are like between these sub-words.
01:00:25
And it can be interesting, it can be very interesting indeed and kind of give us a sense of how well our model is trained, for example, if we could see what words are nearby a given sub-word, because it’s a property of these vector spaces that sub-words that are more similar in meaning are closer together, or if, yeah, and even in, I imagine if we have the same kind of vector space in you know, in an image recognition algorithm, then, you know, you can kind of traverse the space and locations in the space nearby each other are probably going to have a similar looking kind of person or a similar looking kind of dog. And as you move along a particular axis in this many-dimensional space. So, let’s say we have a thousand dimensions, if we move, we could move along a particular axis in that thousand-dimensional space.
01:01:17
And it might happen to correspond to, I don’t know, I work with, we’re, we’re creating spaces, these spaces using resumes and job descriptions. So, something, for example, one of these dimensions could be an axis of whether somebody’s hands-on or not, managerial or not. It could be whether, you know, they’re more a backend developer versus a frontend developer. In the visual world, it could be something like age. So, you could have a similar-looking kind of face, but as you move along a particular axis in the space, the face appears to age. So, these dimensions can have real meaning, but there’s far too many of them. If we have a thousand dimensions, you can’t possibly wrap our head around you know, visually what those look like.
01:02:11
And so t-SNE allows us, your algorithm, allows us to compress that high dimensional space down to a much lower dimensional space. So, in my case, that’s typically two or at most three dimensions, so that we can visually explore and get a sense of how, in my case, often the language model has converted sub-word tokens into some kind of, into, into its representation of meaning. So, thank you so much for creating this. And the really, the thing that makes it so powerful is that despite taking it from say, a thousand dimensions down to two or a thousand dimensions, down to three, it preserves as much as it can the distance between these tokens, these sub-board tokens, which is a mind-bending thing to me. I can’t really like, I mean, I suppose there’s no way to visualize how that works, but- 
Laurens: 01:03:06
Yeah, and it doesn’t, and it doesn’t preserve everything, right? Like it’s sort of fundamentally impossible, right? Like sort if your points really fill up a thousand-dimensional space to sort of preserve all the structure that is in that high dimensional data. But it does a really good job at like sort of capturing the key parts of it. And, and much better than sort of like prior techniques, like things like Principle Component Analysis that, that people used to use for this, for this purpose. 
Jon: 01:03:39
Yeah. And I’ll, I’ll remember to include in the show notes, I’ve got a Jupyter Notebook that people can use to implement t-SNE so that they can, so you can get a really simple sense, although, I mean, it’s, it’s not very hard, it’s typically a one-liner you know, you can dig up, there’s lots of implementations of it, and then you just specify how many dimensions you want the space to be compressed down to. But I do have a Jupyter Notebook that I will put in the show notes for folks. And just to give a sense of the impact of this, how, how many times Lauren’s do you think that your t-SNE paper has been cited? 
Laurens: 01:04:19
I don’t know. I mean, tens of thousands of times. 
Jon: 01:04:21
Yeah. Over 35,000 times. That’s wild. 
Laurens: 01:04:25
And what’s interesting is that most of those papers, I don’t understand. Because, you know, sort of the problem that you described, right, like you run into it as a data scientist, but you run into it like in any other field of research as well, right? Like, we’re very often you find yourself sort of with very high dimensional data, right? Like, could be gene expression data, like if you, if you work in the biological domain, like that’s actually, I think probably one of the biggest drivers of these citations, I think is sort of the combination of like, you know, single sequence or like sort of sequencing of that type of data becoming really cheap. And then what you have is like, you have these like 20,000 dimensional vectors, right? Like where each number that you have represented, corresponds to like a gene, right? And sort of like how expressed that gene was in the, in the sample that you were measuring.
01:05:23
And you kind of want to know like, what’s going on in this data, right? Like, that’s kind of the question that you want to answer. And then something like t-SNE is kind of like a go-to tool, right? It’s like, okay, well, you know, let’s, let’s make one of these visualizations and at least we’ll, we’ll get a first sense. And often it’ll give you interesting information right off the bat. So, I think often what happens in those kinds of experiments is that you’ll find that you recorded data on two days and you’ll see two clusters, and one is day one, and, and the other one is day two. And it points at something like in your experimental setup being, you know, sort of different between the two days, right?
01:06:03
And so then it tells you like, oh, okay, the first thing that I’ll need to do is like I need to figure out like how to correct footage, right? Like, is there some kind of normalization I need to do in order to change this? Or do we need to redo the experiment, right? Like sort of worst-case, worst-case scenario. So, I think that’s where, you know, a lot of those use cases are. And so most of those papers, I can look at the titles and I can be like, I don’t, I understand half the words, basically. 
Jon: 01:06:31
Yeah. Yeah. I mean, research is like that in general. I guess there’s so many different things that people get deep in the weeds on. But yeah, this, this exploratory data analysis tool, yeah, I pretty much only think about it for natural language data. But yeah, as you’re pointing out tons of different areas yeah, allowing you to explore your data, explore these vector representations whatever the field is. Very cool. So, thank you. And yeah, in general, you’ve had a huge number of citations. So, there’s this, there’s this metric called the H-index, and so you have an H-index of 53, which is a very high number. You, before we started recording, said it’s just because you’ve been publishing for so long, it’s like a- 
Laurens: 01:07:13
It just means I’m old. That’s what it means. 
Jon: 01:07:15
Well, it, it’s, it would correlate with age, yes. But other factors as well. You know, I think 53 is huge. And so this means for our listeners who aren’t aware of this metric, it means that Lauren has at least 53 papers that have been cited 53 times or more. And so it’s this, it, it’s tricky to have a high number because it’s hard to have a large number of papers all with a large impact. So, yeah. Amazing. And so, yeah. I don’t know, do you have advice for people who would also like to make a big impact? You know, they’re, they’re getting involved in data science, they might like to be an AI researcher, they’d like to make a big real-world impact with AI in their careers? What kinds of things should they be doing to be able to try to emulate the kind of success you’ve had? 
Laurens: 01:08:11
Yeah, I get that question a lot, and honestly, I find it really difficult to answer, like, sort of the first thing I tell people is like, you know, like, don’t whoever you ask this question to, like, don’t trust the answer. And sort of the reason I’m saying it is because sort of the time when I got into AI research was so different from today that like, whatever recipe I used may not be the right recipe today, right? So, so just for context, like the first time I went to the NeurIPS Conference, which is sort of one of the main sort of scientific AI conferences, it was in Vancouver, there were about, you know, maybe 700 or 800 people like at that, that conference. And sort of like at the end of the conference, you would sort of talk to everyone who was there, right? Whereas right now, if you go to NeurIPS, it’ll be like, in a massive conference center, it’ll be 10,000, 15,000 people. And so the environment has completely, has completely changed.
01:09:10
But I do think there are some things that are, you know, that are that, that carry over sort of like always, right? Which is, you have to sort of in your career, you always have to take some risks sort of at certain times, right? Like you cannot sort of like play safe and then, you know, also deliver big impact sort of in the, in the process. And so, one question that I like to ask people about the projects that they’re doing is like, what if you succeed, right? Like the thing that you’re working on right now, explain to me like what is the best-case scenario, right? Like, everything pans out, everything works great, what impact does it have, right? And if you don’t have a good answer to that, then you’re not working on the right projects, right? Like because then you know, that sort of your, your upper bounding the impact that you, that you can have, right?
01:10:08
And so I think sort of for all the projects that we talked about in the episode so far, I think sort of the question to, the answer to the question, what if you succeed is like, it changes everything, right? Like it changes like how we do drug design, right, like for proteins. Or it changes sort of our ability to build AI systems that do reasoning and planning and conversion in natural language for Diplomacy. Or it changes the way in which we can make deep learning privates for a project like CrypTen, right? And so you always need to have an answer to that question, what if I succeed? 
Jon: 01:10:51
Very cool. I love that way of framing the problems that probably you tackle as well as I imagine people, you manage people that are asking for career advice. If you were to try to extrapolate forward, and obviously this is a very tricky thing to be able to do with how quickly everything’s moving in the space, but what would you like to see, what’s the kind of upper bound on what we can achieve with AI in our lifetimes over the coming decades? What kinds, you know, do you ever, you know, think about how the future might unfold and what, you know, benefits that could be to humans? 
Laurens: 01:11:31
I don’t think about it much, to be honest. And the reason is that I’ve, you know, that it’s just really hard to answer those questions, right? Yeah. Like at times it can feel like you’re sort of in sort of a time of exponential progress, which is a little bit how AI feels right now, right? Like, it’s definitely sort of, you know, it feels exponential, but you always might be on a sigmoid, right?
01:11:56
Like, it always might, you know, like the first half of a sigmoid looks like an exponential, right? And so you never quite know, like when you sort of like hit a fundamental boundary or fundamental limit in terms of your, your ability to to make progress. So, I find it really hard to predict, but there are definitely things that I’m really excited about. In particular, sort of like, you know, sort of in the large language modeling space, I think one of the, one of the big things that stands out for me there is that I think it’ll change our way, it will change the way in which we build AI systems or even systems sort of more generally. 
01:12:37
So, like one of my favorite papers maybe this year was a paper out of the Allen Institute called VisProg, where basically what they do is like, they take sort of a library of like sort of different models, different components for visual recognition in this case, and they train, they basically use a language model to like, based on a natural language description of the task that you want to solve, to like figure out how to piece together sort of the right components, right? And so it’s kind of like you’re programming these, these systems in natural language from sort of these like really high capability sort of like components, right? And to me that seems like a really, a really promising paradigm in terms of like how you’d really want to build systems, right? 
01:13:29
If your systems are going to be more complex, then you cannot sort of like always be like, okay, we need to train this whole thing end to end. Like you need to think in a more compositional and modeler way in the same way that we’ve always done in software engineering and sort of like, those kinds of ideas need to come into sort of the development of AI systems as well. And I think large language models give us a pad to doing that in a, in a way that’s really intuitive. 
Jon: 01:13:57
Yeah. Very cool. VisProg, was that- 
Laurens: 01:13:59
VisProg, Visual Programming.
Jon: 01:14:01
VisProg, yeah, yeah. I was like, that sounds a lot like Facebook. And I was thinking maybe like interface, like between the programming languages, VisProg. Very cool. We’ll be sure to include a link in the show notes to that. Yeah, and I mean this is, you know, we’re kind of tying back to Llama 2 again now, but that ability for Llama 2, big open-source LLM like that, to be able to provide a natural language interface between lots of different concepts, it also lends itself nicely. Something we were talking about with Professor Joey Gonzalez from Berkeley in episode 707 was his Gorilla project, which took the original Llama and I’m, I suspect they’re now working on the same kind of thing with Llama 2, and they allow it to call APIs. So, like the ChatGPT plugins kind of analog in open-source. 
01:14:56
And this kind of thing is super, super powerful because all of a sudden, you know, any kind of programming interface, whether it’s a model on the other end or some other kind of service, like being able to do math in Mathematica, or being able to book a trip in Kayak or whatever. Having this natural language interface for all that we’re, you know, there certainly are things that happen in our space that seem exponential and we end up discovering that we’re on a sigmoid curve and it flattens out. But I think there’s still a lot, there’s still a lot of legs now in these kinds of natural language applications. You know, we’re, we’re allowing people now and, and there’s tens of thousands of startups creating these kinds of natural language applications where people can now be intuitively using the kind of language that they grew up with, but being able to call on machines to do work for them seamlessly. 
Laurens: 01:15:52
Yeah. It’s a really exciting time to be in AI. 
Jon: 01:15:54
Yeah, for sure. And then I guess, yeah, even Llama 2 is probably also a really great resource for asking for AI career advice as well. Going back to that-
Laurens: 01:16:03
I guess so. I haven’t tried it. [crosstalk 01:16:05] I should. I imagine it could do a great job. 
Jon: 01:16:06
Curious what it tells you. Yeah. You could have a lot of back and forth with it on that. All right, this has been an amazing episode, Laurens. Thank you so much for coming and filming it with me. Before I let my guests go, Laurens, I ask them for a book recommendation. Do you have anything for us?
Laurens: 01:16:23
A book recommendation? A book that had impact on me and that was surprising to me myself that I’m, that I’m giving their recommendation is a, is a book called Reframing Organizations by Bolman and Deal. So, it’s a, it’s a book about sort of how organizations work. And so there are many of those types of management books and sort of, even though like I’m in a manager role, I’ve always stayed away from them because they’re often of the type like, oh, you do these four things and then your organization will run amazingly. Right? And in practice, I never quite found that to be the case, right? Like, there’s a lot more that sort of comes to it. And this was kind of the first book that I found to be actually useful in terms of like looking sort of at an organization in different ways and sort of being able to reframe problems in different ways and being more, more effective in sort of solving those problems.
01:17:28
So, like if there’s sort of any people out there like who are managing teams or who have aspirations to manage teams, this is the sort of the one book that I would recommend. And then, you know, don’t, don’t read any of the others. 
Jon: 01:17:41
Nice. Very cool. That’s a great recommendation probably even for me. And very last thing, Laurens, what’s the best way to follow you? Do you use social media a lot?
Laurens: 01:17:51
Not a whole lot to be honest, which is a little bit surprising, but I have been on Threads. You know, that’s the new cool thing. So, follow me on Threads. My handle is @lvdmaaten. I think that’s where I’ll be most active. 
Jon: 01:18:07
Nice. That’s great. I think that’s, you know, we asked that after the end of every show, and I think you’re the first guest that has had that, but it-
Laurens: 01:18:14
Sign of the times.
Jon: 01:18:15
Yeah, but I mean, it was, it took over as the fastest growing consumer application by far. We were like blown away by how quickly ChatGPT spread around and Threads blew it out of the water. So, very cool innovation there. Laurens, thank you so much for coming on. I have no doubt that the audience enjoyed this a lot. I certainly learned a lot from our conversation. Thank you so much. 
Laurens: 01:18:35
Thanks so much for having me.
Jon: 01:18:41
Yes, yes, yes, another fascinating conversation for me. I hope you enjoyed it too. In today’s episode, Laurens filled us in on how Evolutionary Scale Modeling, which uses transformer-based language models to design new biological protein structures. He talked about fighting climate change by building an AI model that can efficiently simulate catalyst reactions to identify compounds that could thus reduce pollutants in the air. He talked about the CrypTen privacy-preserving machine learning framework, which has PyTorch syntax, but all of the computations are seamlessly encrypted under the hood. He filled this in on Web-scale weekly-supervised image recognition, using billions of images to create at the time a state-of-the-art machine vision system. He talked about the simulation of materials for new augmented reality wearables, and he talked about his t-SNE approach, how it compresses the dimensionality of high dimensionality vectors down, which can then allow visualization for natural language token similarity, for example. But it’s also widely used in biological data.
01:19:42
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Laurens’ social media profiles, as well as my own Super Data Science.com/709. Thanks to my colleagues at Nebula for supporting me while I create content like this Super Data Science episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the Super Data Science team for producing another exceptional episode for us today. You can support this show by checking out our sponsors’ links, by sharing, by reviewing, by subscribing, but most of all, just by continuing to tune in.
01:20:19
All right. I’m so grateful to have you listening and I hope I can continue to make episodes you love for years and years to come. Until next time, my friend, keep on rocking out there and I’m looking forward to enjoying another round of the Super Data Science podcast with you very soon. 
Show All

Share on

Related Podcasts