Bioengineering and Generative AI converge under the visionary leadership of Dr. Pierre Salvy at Cambrium GmbH, propelling material science into uncharted territories. He sits down with Jon Krohn live at Merantix A.I. Campus in Berlin to discuss how he’s transforming material design, exemplified by his swift development of NovaColl, a vegan collagen crafted within two years.
About Pierre Salvy
Pierre Salvy, PhD, is Head of Engineering at Cambrium GmbH. There, he coordinates the vision and the team on everything related to automation, computational biology, and machine learning, to program the sustainable materials of the future. Cambrium uses generative AI to design and develop the sustainable performance biomaterials of tomorrow, and biotechnology to manufacture them at scale.
Overview
In a dynamic discussion at the Merantix A.I. Campus in Berlin, Jon engages with Dr. Pierre Salvy, Head of Engineering at Cambrium GmbH, leading a groundbreaking revolution in material design. Dr. Salvy’s fusion of generative A.I. and bioengineering, termed the “double deep tech” sandwich, highlights their creation of biological proteins for multifaceted applications.
The conversation spotlights Cambrium’s data-driven R&D prowess, exemplified by NovaColl, a vegan collagen for cosmetics developed in just two years using genetic engineering. Dr. Salvy underscores the strategic integration of data science to mitigate risks in DNA manipulation, leveraging machine learning for swift and successful product development. They envision Cambrium evolving into a company pioneering materials to supplant plastics, framing protein construction as akin to programming and unveiling a protein programming language, drawing intriguing parallels between protein and natural language.
This episode delves into the nexus of material science, biology, and AI, showcasing the transformative potential of advances in natural language processing (NLP) on protein engineering. Dr. Salvy also sheds light on the challenges faced by cutting-edge science startups, particularly in pitching disruptive concepts to investors, offering a candid glimpse into the realities encountered at the frontier of innovation.
Items mentioned in this podcast:
Follow Pierre:
Did you enjoy the podcast?
- In blending A.I. and bioengineering for material design, what ethical considerations arise regarding the manipulation and creation of biological substances?
- Download The Transcript
Podcast Transcript
Jon Krohn: 00:02
This is episode number 738 with Dr. Pierre Salvy, head of engineering at Cambrium.
00:19
Welcome back to the Super Data Science Podcast. Today’s episode was filmed live in-person at the Merantix AI campus in Berlin with Dr. Pierre Salvy. He’s an absolutely brilliant mind who’s revolutionizing material design by using generative AI to create biological proteins that are used as materials for a broad range of purposes. Crazy, right? He calls this the double deep tech sandwich because it blends cutting-edge AI with cutting-edge bioengineering.
00:46
Pierre has been at Cambrium for three years, initially as head of computational biology and then head of engineering for the past two years, growing the team from two to seven to bridge the gap between wet lab biology, data science, and scientific computing. He holds a PhD in biotechnology from EPFL in Switzerland, and a master’s in math, physics and Engineering Science from Mines in Paris. Today’s episode touches on technical machine learning concepts here and there, but should largely be accessible to anyone. In it, Pierre details how data-driven R&D allowed Cambrium to go from nothing to tons of physical product sales inside two years, and how his team leverages large language models, LLMs, to be the biological protein analog of a ChatGPT-style essay generator. All right, let’s jump right into our conversation.
01:28
Pierre, welcome to the Super Data Science Podcast. It’s nice to have you here in person with me in Berlin at the Merantix AI campus. How are you doing today?
Pierre Salvy: 01:38
I’m doing pretty good. I actually had a nice sunny walk in the park to come here, so.
Jon Krohn: 01:41
A sunny walk in the park.
Pierre Salvy: 01:43
Yeah, it was five minutes during the day, and I was here under five minutes, but you have me in my best mood.
Jon Krohn: 01:48
It’s been gray the whole time I’ve been here, in fact, my whole Euro tour so far. I’m in week three now, and it’s been gray everywhere I go in Europe in November, which I guess shouldn’t be that surprising. So you lead data science R&D at Cambrium?
Pierre Salvy: 02:06
Engineering.
Jon Krohn: 02:06
Engineering.
Pierre Salvy: 02:07
Which encompasses indeed data science, machine learning, robotics, and cloud operations.
Jon Krohn: 02:12
Nice. And so what does Cambrium do?
Pierre Salvy: 02:15
We had several ways to capture that in a single tagline that everybody would understand, but I think the one that works right now is we use generative AI to create biomaterials for tomorrow. And there’s a lot to unpack in there, but basically we use machine learning and cool science to make materials that are sustainable to displace unsustainable materials that we have around us using technology and a lot of deep tech stuff that I’d love to talk about.
Jon Krohn: 02:40
Nice. Yeah, let’s get into that. You were talking about a double deep tech sandwich.
Pierre Salvy: 02:45
Yeah, I think that’s correct. So really the first attempt that we had at describing our technical stack was of course a disaster, but it’s like we use generative AI to create proteins, then we transform these proteins into DNA, we order the DNA, and then we translate this DNA into strains with genetic engineering, including molecular editing tools, and then we produce that in big vats, purify that, and then sell that as a chemical, right? That’s a lot of complicated terms. So yeah, generative AI for biomaterials.
Jon Krohn: 03:14
Yeah, I mean, that’s wild. So let’s dig into that more step-by-step. So you use data-driven processes throughout all of your R&D. Typically when we’re thinking about AI, the end product is a model, but in your case that is just an intermediate step to create some new product. And so you mentioned to me before we started recording that in two years, Cambrium went from zero, nothing, founding team to tons of products sold. So can you give us an example of the kinds of products that you sell and then how you got from zero to a hundred miles an hour, 60 miles an hour, or whatever the expression is in two years?
Pierre Salvy: 04:01
I think it’s worth noting that at least in Europe, this is the fastest ever recording in our industry, right?
Jon Krohn: 04:05
Oh, really?
Pierre Salvy: 04:06
So that’s pretty exciting. The reason why this worked out is because we used AI as a de-risking tool for our biology. Oh, yeah, and the product. So we have one product right now, it’s called NovaColl. It’s collagen, it’s vegan collagen for your skin that is a human match and basically has a really good effect to prevent wrinkles and anything that aging and stress can do to your skin.
Jon Krohn: 04:35
So it’s like a topical thing.
Pierre Salvy: 04:35
It’s a topical application. Basically it’s what skincare vendors want to sell, but that’s what almost nobody has. And how can we make vegan collagen? Because there’s a general problem about collagen, which is it’s inherently an animal protein. So the previous way to make collagen was to dispose of dead animals and boil them and then extract the collagen, and then people don’t really want to put dead animals on their skin. So if you can come up with a value proposition of we have animal-free, cruelty-free collagen that you can use, then people are willing to buy this. So that’s basically the short story of how we chose this product. How do we actually build this, right? So it’s a protein, so it’s a product of a biological process. And biological processes, we’ve been learning to tame them for maybe the last 50 years to produce a bunch of things, including insulin, for instance, as a therapy against type 1 diabetes.
05:35
So we use similar technologies, genetic engineering, which is basically you take a microbe, it’s GMO. You take a microbe, you change the DNA, so it produces what you want. And I like to use this image because I think it works well. It’s also technically accurate. It’s like when you make beer, you have a microbe that eats sugar, transform the sugar into ethanol and then spits that out. We have a very similar microbe that basically it’s a form of sugar and inside the microbe, instead of having the instructions to make beer, it has the instructions to make collagen. It spits out the collagen, we scoop it up and sell it. Roughly, that’s what happens.
06:09
In practice, biology is a very messy science. A lot of things can go wrong, and when they go wrong, it costs a lot of money because it’s a lot of expensive stuff. Like DNA in itself is expensive. Handling DNA is expensive. The robots to handle the DNA are expensive. So there’s a lot of risks, technological risks, that are associated with this type of project. And that’s where data science, machine learning and the whole thing comes into play. You can use it to maximize the chances that your experiments are going to work out.
06:38
And that’s actually with this philosophy that we built Cambrium and our data-driven R&D, we built the lab first on the computer. We actually did the full lab simulations of our processes to understand what’s the throughput was, the error rate we can expect, and so on. And then we actually ordered the robots and so on. And what the data science allows us to do is to make sure that whatever experiment we run has a maximal likelihood chance of working out. And that’s how we manage to be so fast. We basically one shot every single feasibility gate for our product to go from idea to an actual thing we could sell on the market.
Jon Krohn: 07:16
That’s wild, man. That is so cool. It seems like such a complex undertaking. It’s wild to me that you were able to kind of think through all of these steps eventually that you need to have and kind of model them. So did you or other members on the founding team have experience in this kind of genetic engineering before?
Pierre Salvy: 07:39
So Mitchell and Charlie, the two co-founders have PhDs in this field. I initially trained in the applied mathematics or general engineering in France, and I studied a bunch of things from material science to microeconomics to nuclear physics. And then I actually landed into biotechnology and I was like, oh, that’s exciting because it’s a very unsolved problem. And then I actually worked for almost two years on the West Coast with Total and Amyris for making jet fuel using genetic engineering.
Jon Krohn: 08:08
On the US West Coast?
Pierre Salvy: 08:09
Yes. In Emeryville, California. And then I got a PhD in this field because I thought it was exciting. And that’s when I started working at Cambrium as employee number two. They hired me right after my PhD. So it was the three of us with strong technological background. And then we have Ruben with [inaudible 00:08:29] genius. And then we were joined later by Lucille, the current head of R&D, who basically has a PhD in chemistry and worked in cosmetics and also digitization consulting. And Aya, who is our head of finops, and basically was employee number one in a startup that did personalized skincare. So with this sort of crew of six, we started the whole thing, and that’s where we’re like, okay, we’re going to do this project, this product, and we are going to try to sell it this market and this is how we’re going to execute.
Jon Krohn: 09:05
Very cool. It’s amazing what you guys are doing to have this double deep tech sandwich. It’s such a cool thing to be doing. So the deep tech of genetic engineering combined with the deep tech of machine learning. Let’s dig a bit more into the machine learning. So you already described how you were able to get from zero to lots of NovaColl developed and sold within two years. So we have some idea kind of generally of the data science there. Can you dig a bit more into the machine learning behind some of these processes? Maybe even in particular you mentioned that generative AI, like large language models can be involved.
Pierre Salvy: 09:47
So I think there are maybe two verticals here. One is the general data cycle in order to make sure that your experiments run smoothly. And the other is what machine learning can do in order to improve your design of your product. So starting with the data-driven R&D, there are things for which we can model as much as we want, but nothing beats actual data. I think a very good example at which we got pretty good is the protein sequence to DNA sequence translation. So skipping all the biology, basically proteins are made out of building blocks. These building blocks correspond to DNA sequences, but it’s not a one-to-one, it’s a one-to-many, right? And depending on how you make your DNA sequence, you might have problems producing it, and you might have problems having the microorganism read it. So there’s a little optimization problem here when you have a protein you want to design to find the right DNA.
Jon Krohn: 10:40
Yeah. So I’ll break this down even a little bit more for the audience so that they get kind of this visualization in their minds of how sequential this is. So DNA, I think a lot of people in their mind, they have this idea of a double helix. And so it’s these two kind of winding strands, and it’s double because one is a copy of the other so that if you have damage, you can repair it hopefully correctly. But basically you can think of, even though there’s two strands, there’s really one message. And so you can think of each chromosome. And so you have 23 chromosomes as a human being?
Pierre Salvy: 11:20
I think it’s, yes, correct. 23 pairs.
Jon Krohn: 11:23 23
Pairs of chromosomes. Yeah, that’s right. Yeah, that’s a key detail. So on each one of these chromosomes, it’s one continuous string of what we can think of as letters.
Pierre Salvy: 11:39
Correct.
Jon Krohn: 11:40
And there’s four possible letters in DNA, at least in humans. There are a few edge cases in other species. So you have these four letters that make up the sequence. And then you were describing this one-to-many mapping to proteins where it’s, if I remember correctly, it’s every three DNA letters corresponds to one amino acid in a protein.
Pierre Salvy: 12:06
Right.
Jon Krohn: 12:07
And so proteins which do all of the work in your body. And so all of the various kinds of functionalities, from the way that your eyes work, your brain, your liver, your skin, there’s just different proteins. And these proteins, they are made up of similarly strands, one dimensional strands, like how a DNA is a one dimensional strand. The proteins are made up of one dimensional strands of amino acids. But with amino acids, we have more. So we only had four DNA letters, but with amino acids there’s like 20 something amino acids.
Pierre Salvy: 12:45
Right. Correct.
Jon Krohn: 12:50
So that leads to this one to many mapping because you have these three character sequences of DNA that lead to a single character, one of these amino acids, of the protein structure. So yeah. So I don’t know. I interrupted you.
Pierre Salvy: 13:08
No, you do pass the biology exam. Congratulations. But that’s exactly this. So the proteins we’re making are a strand of a bunch of 20 letters, and the DNA is also a strand of a bunch of four letters. And the mapping of one to the other is not one-to-one. So it is something for which you can learn from the failures of your experiments. I try to order this DNA, either it doesn’t produce or my cell doesn’t accept it. You get the data and then you can actually train the machine learning algorithm saying, okay, this is actually the best way to translate from protein language to DNA language. And to give you another magnitude, we increased up to a hundredfold our success rate in making the right type of DNA for the proteins we design.
Jon Krohn: 13:52
Wow. It’s a big multiple.
Pierre Salvy: 13:53
It’s a big multiple. It also saves us a of money because DNA is really expensive. But I think it’s just one flagship example of how you integrate machine learning on your day-to-day R&D operations. Every single step that could be a bottleneck in terms of efficacy or efficiency, then you try to apply tiny little machine learning algorithms to smooth them out. And I think that’s touching to a key question that I often get asked by investors. So what is your machine learning algorithm doing, singular? And you have to do a lot of education to explain like, no, actually, it’s a tool and we apply this tool at many places. So I think that’s one of the cool parts of what we do. Arguably the more hyped part right now is everything related to generative AI.
Jon Krohn: 14:38
Because, sorry, just really quickly, you were probably just about to do this, so I’m sorry that I’m interrupting. So it’s the same thing. It’s these sequences. It’s with the DNA. We had sequences of four possible characters in our DNA sequence. In the protein, we had 20 odd possible characters. And so collagen, for example, to make this a little bit concrete for the audience, if the collagen has a specific sequence of amino acids specified by maybe a few different possible DNA sequences. And so in order to get that sequence that specifies the collagen protein just right, you’re using machine learning to be able to optimize, get this hundred x multiple, save a lot of money. And the obvious comparison here and why you’re now going to talk about generative AI is that generative AI is also generating sequences.
Pierre Salvy: 15:35
Correct.
Jon Krohn: 15:36
Sequences of characters.
Pierre Salvy: 15:37
Correct. So we’re getting maybe to the double deep tech sandwich part, right? So there’s this whole question of sequence to sequence translation, but now there’s this question of how do you generate the initial sequence? And actually saying we make a collagen is already a complicated statement because there are order of magnitude 10 to the 18 possible collagens that we could choose from, counting for natural mutations and so on in the ones that we looked at for NovaColl. And so NovaColl in particular, we made before the generative AI was commoditized as it is right now, and we had an intermediate solution. We actually built a protein programming language. It is actually a solution that bridges computer language and natural language. We were like, okay, building a protein is actually a programming problem. We built this, which has actually a compiler and so optimizes the right sequence for your protein.
16:31
So that’s how we made NovaColl. But we’re also working on new materials because we don’t want to be a cosmetic company. We want to be a company that replaces plastics in the future. And so that’s where generative AI comes in. There is actually a natural analogy between the protein language and the natural language. We just said proteins are basically sequences of letters. And much like in the English language, if you just take a bunch of letters and randomly put them together, you’ll hardly ever get a word that makes sense. And even less so an actual essay or a piece of poetry, right?
Jon Krohn: 17:03
Great analogy.
Pierre Salvy: 17:04
Exactly the same with proteins. You put random amino acids together, you’ll get a spaghetti ball, but you won’t get a functional protein. And so actual transformative advances in NLP, in natural language processing, have directly translated into advances in protein engineering. And so AlphaFold was one such example in 2018 and then 2020. But basically what is AlphaFold is a sort of grammar checker, but it’s not something that will write your own essay. And recently with the LLM boom, there’s actually LLM derived architectures that allow us to write our protein essays for us. And that’s actually some really cool technology we have at Cambrium. We literally have an image generator, like Midjourney equivalent to proteins. We can specify in mathematical terms what we want the protein to do. And out of noise, the protein will emerge and just be like, Hey, this is our candidate protein you should try in your lab.
Jon Krohn: 18:04
Whoa.
Pierre Salvy: 18:05
And that’s pretty cool.
Jon Krohn: 18:06
That is really cool. And so that also, that’s visual?
Pierre Salvy: 18:10
I mean, in practice it’s a bunch of atoms, but you can transform that into a video. Right? That’s also pretty cool.
Jon Krohn: 18:16
Right. Yeah, that sounds amazing. Super, super cool. It’s amazing what you guys are doing. You really are, I mean, this is what you’re doing is like a Holy Grail in terms of people when they’re thinking about careers in data science or machine learning, making such a tangible impact in the way you are on really cutting edge science that is changing the world for the better. It’s awesome.
Pierre Salvy: 18:49
It is really exciting. There are some unexpected drawbacks. One of them is when you try to pitch that to a venture capital, it is so hard because you’re not a B2B SaaS. You are a chemistry company using cutting edge science in order to make impossible products. And that requires, on both sides of the negotiation table, a lot of understanding of all the world. But we are literally at the frontier between material science, biology, AI, and then trying to find people that are able to grasp all of that at the same time. I mean, it takes a lot of talking, and VCs don’t necessarily have much time. So I think that’s maybe one of the unexpected struggles of cool stuff.
Jon Krohn: 19:36
Yeah, it’s interesting. That wouldn’t be something that I’d think of right away, but of course you’re living through that, I guess. And so there’s got to be some VCs out there. I mean, maybe even the ones who say that they specialize in deep tech, they probably still don’t have all that much time. And it would be rare that you would find a kind of deep tech specialist investor that would specialize in this kind of chemistry. And then if they do, maybe in the odd chance that you find one that specializes in the chemistry, they’re not going to have the generative AI part.
Pierre Salvy: 20:05
Correct. And that’s why, I mean, we have to do a really good job at explaining what we do and what are the shortcomings and the technology landscape of what we’re doing. But I do think there is a more hidden problem, which is a lot of the VCs that run themselves at deep tech are VCs that have invested in deep tech that is a software as a service solution with a deep tech backend. So the business model is very well-characterized, but then you come in and you’re like, no, we are actually a chemical company. And all of a sudden it looks different, right? Because we buy robots, we need plants, we need scale of facilities with little cubic meters of capacity. So the CapEx versus OpEx distribution is so different. And so you often have this sort of aha moment. It’s like, oh wait, this is actually not the company I thought I was interviewing.
20:56
And I think it’s been going better, right? Over the last years, there’s going to be a lot of successes in our field that have educated a little bit every player on the ground. But it is a problem because you say high CapEx and lower agility on scaling the OpEx. And then people are just like, maybe I’m less keen on investing, which is a problem because at the end of the day, if you do want to change the world, you can’t just do that with apps. At one point, you have to get your hands dirty and just build things, and that’s what we’re trying to do.
Jon Krohn: 21:29
Yeah, beautifully said. And so yeah, hopefully we’ve got some VCs out there listening and realizing the opportunity to make a really big impact here. It’s got to help that you are getting success so early, that you’re seeing these great returns on the data science and machine learning investments that you’re making into this generative chemistry process.
21:52
Very cool, Pierre. Thank you so much for taking the time. And yeah, maybe we can check in again in a couple of years and see how the journey’s coming along.
Pierre Salvy: 21:59
We’d love to. Thanks, Jon.
Jon Krohn: 22:02
Amazing and engaging episode of concepts, isn’t he? In today’s episode, Pierre covered how data-driven R&D has enabled Cambrium to develop NovaColl, a commercially popular vegan collagen for cosmetic products, inside of two years. And how while AlphaFold acts as a kind of grammar checker for proteins, Cambrium’s LLMs act as an analog of an essay generator for proteins.
22:24
All right, that’s it for today’s episode. If you enjoyed it, consider supporting this show by sharing, by reviewing or by subscribing, but most importantly, just keep listening. And until next time, keep on rocking it out there. I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon.