SDS 643: A.I. for Medicine

Across the world, healthcare systems are overstretched. How might AI and machine learning help alleviate the burden on our hospitals and clinics? On this week’s episode, Jon Krohn talks to Chief Scientist of Biologics AI for Exscientia Charlotte Deane about a potential new era for medicine with the help of machine learning and AI.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Charlotte Deane

Professor Charlotte Deane is Chief Scientist of Biologics AI for Exscientia, driving the application of artificial intelligence, machine learning, and the design of protein structures in the discovery and development of novel drug candidates. Professor Deane is one of the UK’s most accomplished bioinformaticians. She has held numerous senior roles at the University of Oxford, where she is currently Professor of Structural Bioinformatics and leads the University’s Protein Informatics group. Prior to this, Professor Deane was Deputy Executive Chair of the Engineering and Physical Sciences Research Council at UK Research and Innovation (UKRI). Professor Deane holds a B.A. in Chemistry from the University of Oxford and a Ph.D. in Biochemistry from the University of Cambridge.

Overview

Antibodies are critical to the human body in overcoming viruses. When exposed to a virus, the human body will react by raising a resistance of antibodies to the invader. A human antibody molecule is a Y-shaped protein that can immobilize viruses by attaching to their receptor-binding site. Viruses have different structures, and through a process called somatic hypermutation, our bodies’ defense mechanisms will learn which antibodies can bind to the virus. These antibodies will then rapidly mutate in order to ensure that the body has enough resistance to the virus.

That is, provided we already have those antibodies in our system. Even if we have been exposed to a specific virus in the past, it still isn’t clear exactly how long the antibodies we have raised to combat it actually last in our immune system, and their duration varies from disease to disease.

With these problems in mind, this week’s guest, Charlotte Deane, and her team are working on ways to design and replicate antibodies on a computer to facilitate the human body’s natural processes. The artificial intelligence program AlphaFold can already predict the structures of proteins, and Charlotte’s ambition is to specify this capability to antibodies. Her team has found a way by using the loss function in their protein prediction models.

The end goal for Charlotte is to be able to put information about a disease into a computer and for that computer to prepare the right sequence for an antibody that could bind to it. Charlotte notes that part of the effort to achieve this will require attention to the computational technique within experiments and where medical datasets begin to be generated with algorithms in mind. Currently, the necessary statisticians are brought into the project too late: after the data has already been gathered.

Jon also asks Charlotte about her experiences as Deputy Executive Chair of the Engineering & Physical Sciences Research Council and how they responded to the pandemic in the UK. She gives the example of a project that tested wastewater to detect viruses. The advantage of testing sewage was that researchers could find localized levels of COVID that were not connected to personal information. Charlotte explains that this method can help local clinics prepare in advance for a wave of COVID cases, as the signs of COVID can be detected in waste before patients feel symptoms.

In this episode, you’ll also find Charlotte’s favorite deep learning methods and what she looks for in new recruits!

In this episode you will learn:

What does Biologics AI mean? [03:48]
How to use AI to predict protein structures [07:37]
What antibodies are [14:00]
Personalized Medicine is slow but A.I. can speed it up [24:36]
The future of predicting 4D protein structures [44:30]
Applications of machine learning during the pandemic [53:27]

Items mentioned in this podcast:

Follow Charlotte:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn:

This is episode number 643 with Dr. Charlotte Deane, professor of structural bioinformatics at Oxford University and chief scientist for biologics AI at Exscientia. Today’s episode is brought to you by Kolena, the testing platform for machine learning.

Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now, let’s make the complex simple.

Welcome back to the SuperDataScience Podcast. We’re incredibly fortunate to have the real life superhero, Professor Charlotte Deane with us on the show today. Charlotte is a global leading expert on using machine learning for designing therapeutic drugs. She’s been faculty at the University of Oxford for over 20 years, where she serves as a Professor of Structural Bioinformatics and heads the 25-person protein informatics lab. She’s also Chief Scientist for Biologics AI at Exscientia, a NASDAQ listed pharmatech company that uses computational approaches to drive drug development in a fraction of the time of traditional drug companies. On top of all that, she was COVID Response Director for UK Research and Innovation, resulting in Queen Elizabeth honoring her as a member of the Most Excellent Order of the British Empire.

Today’s episode should appeal to technical and non-technical folks alike, as it features an absolutely brilliant scientist and communicator, describing how we can use AI to speed the discovery of new molecules that help our body fight off ailments as diverse as viruses and cancer. In this episode, Professor Deane details how your immune system works, what biologics are, and why they’re such an important class of drugs, what’s holding back the widespread use of precision medicines that are pinpoint customized to a specific tumor in a specific person, what the celebrated AlphaFold algorithm does exquisitely and where it and all other computational models of protein folding still need to improve. How she used data to so effectively marshal the UK’s scientific response to COVID, and how data and machine learning will transform drug development over the coming years. All right, you ready for this epic episode? Let’s go.

Charlotte Deane, welcome to the SuperDataScience Podcast. I’m so excited to have you here. You are an absolutely brilliant guest. We’ve had so much excitement from the audience that you’re going to be here, hundreds of engagements on social media, tens of thousands of views, and now it’s happening. Charlotte, welcome. Where in the world are you calling in from?

Charlotte Deane:

So, I’m sitting in Oxford, which obviously is my favorite place to be in the whole world.

Jon Krohn:

Especially in the winter, I bet.

Charlotte Deane:

No, it’s [inaudible 00:03:04]. It’s very cold here today.

Jon Krohn:

Oh, yeah. Yeah, some listeners who’ve been listening for a while would know that I spent five years in Oxford doing my PhD, or as we actually properly call them a DPhil, but I seldom mention that, that just generates more confusion. And yeah, on a nice day it is actually paradise. On a wintery day…

Charlotte Deane:

No. There are far, far worse places to be, but it isn’t a warm place to be right now.

Jon Krohn:

Nice, all right. Well, we’ve got lots of deep technical questions for you, so let’s jump straight into those. Let’s start with your recent role that you took on at Exscientia. So, you’re Chief Scientist of Biologics AI, which I understand is a newly created role for you. And so, let’s start with that word biologics. What does that mean, and how does AI relate to it?

Charlotte Deane:

So, maybe step back slightly, Exscientia is a company that’s made its name about using AI to do small molecule drug discovery, and everyone’s very comfortable that they know what a drug is. There’s a vary slightly about what we mean by the word drug there, but biologics are just another form of drugs. But instead of thinking about small chemical molecules, we think about proteins. But in particular, the easiest one to talk about is antibodies, because everybody’s heard of those now because of the pandemic.
So antibodies are natural in your system, but they can also be developed into very, very successful forms of drugs. They’re the most successful class of drugs currently, for treating lots and lots of diseases. So that’s the biologics bit and really, this whole job was created for me to start building off what has been done in AI for small molecules to see how far you can push that, so that you design using AI to make these much larger protein-like molecules, so the antibodies and things like that.

Jon Krohn:

Wow, that sounds super interesting. Yeah, so you’re focused on applying artificial intelligence in general, I guess, if we can define artificial intelligence neatly, machine learning and other kinds of computational statistical techniques to do things like predict protein structures, allowing Exscientia to discover and develop new drug candidates, right?

Charlotte Deane:

Yeah, exactly that. I like the way you said all the different techniques, because people always talk about AI, but what you really mean is using the data we have with all the tools now at our disposal. And that runs from some really quite, I don’t want to say standard statistics, but it is, through to really quite complex AI techniques in deep learning. And the idea is to take that as far through the process as we can. How much can computation speed this up, make it easier, make it less expensive, so that we get better drugs that are more effective, to people faster?

Jon Krohn:

Yeah. This is something that actually, I spent a lot of time thinking about in my PhD. So in my Oxford PhD I had a collaboration at the University of Edinburgh where we were developing machine learning techniques, I should say the Edinburgh people, they were the really smart AI people. And so they were mostly developing the techniques and then I was applying them to real world data. And so we were trying to identify causal pathways in biological organisms. So genetic data are interesting because we’re not aware of a mechanism by which the world around DNA can impact DNA in some systematic way. So that means that we can infer that if there’s a correlation between some genetic pattern and some biological pattern, that the biological pattern isn’t causing the DNA sequence changes.
And so you can use that information to look at multiple correlated biological molecules and using conditional probabilities say, “Oh, it seems like biological molecule X is impacting Y.” As opposed to the other way around because of the way that the genetics impacts it. So I don’t know, this is stretching back 10 years ago. So if you quiz me on anything I’ll probably do terribly. But that kind of area, this idea of using biological information to try to find drug candidates is something that I guess 10 years ago I was thinking about a fair bit. So do you have any interesting case studies? Obviously you can’t go into proprietary things that you’re doing at Exscientia, but yeah.

Charlotte Deane:

Yeah, I should say, obviously I have two jobs still. So I’m CSO at Exscientia, but I’m also a professor in the Department of Statistics at Oxford. And so I can talk some about our work there, but they’re obviously related. So the easiest recent example to talk about is when you mentioned that concept of protein structure prediction. Everybody has heard of AlphaFold, it made quite a lot of noise about being able to predict protein structures. We’ve done a lot of work on if you want to predict the structures of antibodies, you might want a model which is instead of being general for all proteins, you make the model specialist for antibodies. And the reasons you want to do that, there’s the obvious reason, and of course, if you can specialize a model you might be able to make it more accurate.
But it’s also about the types of data you have available to you, so you can change the way you structure the model.

Some of the things that become important in this are things like, we don’t use some of the functions that are used in AlphaFold, so we can go much faster. So it’s much, much more rapid. But the interesting internal machine learning parts of this are thinking about how you use the loss function, because the majority of the antibodies are always the same shape. So if you use a standard loss function, you might not actually collapse to the correct structure because you just get most of it right. And that’s fine if you’re AlphaFold, because you’re mostly getting most proteins right.
Whereas, I’m really interested in the bits which we’re getting wrong currently, because the rest of it I could have predicted anyway, because they’re all the same shape. And the bit which is different is where the binding happens. So you want to refocus the way you use these types of functions to do that kind of thing. So it’s that kind of work. So this is a paper we’ve put out recently if people want to look at it, called Immune Builder. But it’s using those kinds of ideas about how you take the data in, how you use your loss functions, and it actually changes the kinds of results you can get or how you focus your computational work.

Jon Krohn:

So do you start from scratch if you know some of the structure, does that mean that you can have, I’m probably mixing terminologies here, but in Beijing statistics we have priors and so do you have that kind of prior information that you can provide to the model?

Charlotte Deane:

So this is, and actually interesting and when we built our first versions of this, we built something which called [inaudible 00:10:14] and effectively we did provide it with here is the right answer for the majority of the structure and I just want you to predict what are called these complimentary determining regions, this difficult bit to predict. And that is really fast actually, that’s the fastest prediction software we’ve written. We then went on to test, well what if we let the model predict the whole thing because it will find it easy to predict the main part and we play with these ideas about lot functions and that’s where we moved on to Immune Builder.
And I still don’t completely understand why we have some theories of why, but the second version is more accurate, even though we’re asking it in ‘a harder question’. Now it might be because it has better control of the overall function because it’s not set up like a prior here, you actually just literally provide it with the information of what this piece is and it has to fit onto that. And that’s possibly why this happens. But it turns out even though it’s an easy question to answer and it slows us down in terms of making predictions, you make a bit better predictions if you do the whole thing as one concerted prediction inside these types of methods.

Jon Krohn:

Nice. So then is there an easy way to explain, without having to have a whiteboard or other kinds of visuals for our audio only listeners? Is there a way to intuitively explain how we can set up a loss function to handle this situation that you’re describing, where we know some of the structure and we want to give clues to that part?

Charlotte Deane:

So it’s almost like not giving clues to that part, it is changing the weighting of the loss function. So in some sense you’re going to get that bit right. This is explaining it very poorly I’m afraid, but if you imagine that if you had a loss function that said how good is my overall structure? Then it drops straight down very quickly because the overall structure is really good, yeah. And then most of the changes it makes as it’s trying to learn don’t affect improving the CDRs. So the loss function doesn’t move. So the bit that was predicted badly doesn’t improve as much as you want it to. Whereas if you reweight the loss function, so the set of atoms that you’re going to get right, you actually don’t give it much reward for doing so because even if you barely train, that sounds awful, it will get them right anyway.
So it’s basically balancing out the parts. Another part of the loss function we found, which was good to be able to do, was this idea of actually, you could make the loss function less strict.

And there are various reasons behind this compared to the way AlphaFold was trained. Some of it is because however strict you make the loss function in terms of trying to make the ideal structure, you never actually make something that’s physically completely accurate. So you always have to do a bit of physics on the end. You do a sort of bit of minimization to make all the bonds and angles. So you can waste, it sounds weird, waste less time trying to make it get really great bonds and angles because you know you’re going to run this kind of minimization on the end, so you have a loss function that’s slightly different there as well.

Jon Krohn:

Very interesting. Are you unit testing your machine learning models? You certainly should be. If you’re not, you should check out Kolena. Kolena is an ML testing platform for your computer vision models. It’s the only tool that allows you to run unit and regression tests at the subclass level on your model after every single model update, allowing you to understand the failure modes of your model much faster. And that’s not all, Kolena also automates and standardizes your model’s testing workflows, saving over 40% of your team’s valuable time. Head over to Kolena’s website now to learn more. It’s www.Kolena.io, that’s K-O-L-E-N-A.io.
And so this is going to show how little I know about antibiotics. So are antibiotics formed by a single strand of amino acids or are there several that have to fold together?

Charlotte Deane:

Okay, so you want to say antibodies when you ask that question?

Jon Krohn:

Yes, that’s right. I wrote it down in my notes wrong immediately at the onset of the… I’m trying to make an effort to always be listening, but also jotting things down to have a great summary at the end of the episode and sometimes I’ll do things like that. Yes, the antibodies please.

Charlotte Deane:

So actually of true human antibody is like a giant y shape and is made up of four chains. But the only bit we really focus on for all of this is if you imagine the very tip of the Y and you have two chains at that point. So there are two chains of amino acids and they’re called the VH for variable heavy and VL for variable light. And this adds a level of complication because of course, one of the most important things to predict is how they orientate with respect to another because the binding site is sitting on the very tip of the Y. So it’s like the end, it’s the bits of both chains pointing out together make the binding site. So a small change in angle between them will change the binding site a great deal.

Jon Krohn:

Nice. And the binding site is critically important because that’s what’s actually doing the work inside your body, right?

Charlotte Deane:

Yeah, yeah. So that’s what will bind to your target of interest. So if you’re trying to make this a drug, that’s what needs to bind to the protein, you want it to bind to, that’s what needs to hit the target, whatever, those kind of things.

Jon Krohn:

So I’ve managed to completely fluff my lines already in this episode and it gives me the idea that maybe we should take a step back on the idea of antibodies and antigens and explain that to the audience.

Charlotte Deane:

Okay, so now we’re by my sort of favorite territory. The easiest way to think about this is you would all be dead if you didn’t have antibodies in your system. So that’s a good way to start. They’re one of your major lines of defense against invading molecules in a natural system as a human. Effectively in your body what they do is there is some form of infection or some non-native protein, yeah, what your immune system does is it raises antibodies against that, that bind specifically to that target. So if you have caught flu recently or had a cold or had COVID, you’ve probably had at least one of those in the last few months because most people have, your antibodies have reacted to that infection. And by a process, it’s like a serious set of mutations, they mutate very, very rapidly, find ones that bind very specifically to that protein.
And there are various different immune mechanisms, but basically they mark them out to be destroyed, so your body knows that it’s in some way or block them from functioning. The average circulating diversity in your body is about 10 to the 12 different antibodies. That’s an estimate, might be a bit less, but this gives you an idea of the numbers. Might be a bit less than that, but something like that. Different sequences of antibodies in the human naturally circulating around. The estimate of the actual diversity possible is somewhere in the 10 to the 14.

So these are estimates, so there might be a factor out here or there. So when you are trying to design an antibody, what you’re thinking about is I need to make something, and this is really important given what I’ve just told you, if I inject a protein into you, even if it’s an antibody, if it isn’t a human antibody, your own bodies will go, “Ah, kill it.” Yep, because that’s what they’re really good at.
So it won’t be very effective as a drug and it could also cause you to have a massive immune response, which isn’t very good for you. So we want to operate within that natural set of antibodies, all these human antibodies and find ones that will bind specifically to targets of interest. And the really cool thing about antibodies is they are incredibly good at this. So you can do it with laboratory experiments. You can always, I should say almost because there’s bound to be exceptions, but it is incredibly easy, in some sense, to find antibodies that bind to a target because that’s what they do because they’ve got this huge sequence diversity ability to do this.

The problem is to make them good medicines, they have to both bind to the target, bind where you want them to, to the target, there’s no guarantee of that. Be human, so that they will actually be safe inside your body. And then a load of other properties which are more about being able to manufacture them. So the easiest one to imagine is if I want to make something into a medicine, I’ve got to keep it in a bottle, it’s a bottle in a fridge for an antibody. So they mustn’t aggregate because I mustn’t inject you with aggregating protein either, that’s very bad for you. So there’s a whole set of these conditional properties that you’ve got to work out whether a sequence will have, and also you have to work out how do I, on a computer, design an antibody that will hit that site.

Jon Krohn:

Wow, that sounds really fascinating. All right, so let me try to summarize back for the audience, in my own words, some of the key points here. So your body is full of a very large number of antibodies, 10 to the power of 12 roughly, that are floating around. So they can conform to all manner of different kinds of shapes. So you’re saying that all the kinds of possible shapes that they could conform to were a couple of orders of magnitude more, something like 10 to the power of 14. So you’ve got all these antibodies floating around, just looking for random shapes to attach to. And those random shapes could be things like a Coronavirus particle, a specific part of it, you probably know off the top of your head-

Charlotte Deane:

Like the receptor binding domain, that’s the one that’s served or on the spike, those kind of things, yeah.

Jon Krohn:

Yeah, those are the kinds of words that I was like I know those are around. So you get a little bit of spike protein from a COVID virus that you’ve never had before, and your body is very likely to somewhere amongst those millions and millions… Oh yeah, go-

Charlotte Deane:

Well it may not even have one that binds already or binds particularly well. So what happens is because it’s got that much diversity, one of those may or may not attach. And as soon as something does, your immune system goes through this process of, it’s called somatic hypermutation. So things that bind weakly mutate and it keeps things that bind stronger and stronger. So it’s like a self-proliferation system, improving binding in your immune system. And then I suppose the really cool thing, which I didn’t say is after you’ve had an infection, antibodies that bound something useful are kept in a sort of memory bank. So when you get the same thing again, exactly the same thing again or very similar thing and this is really how vaccines work, you have antibodies that are ready, not thousands of them, so it has up and make lots of them to already bind to it to get rid of it.

Jon Krohn:

Nice, okay. Yeah, so that fills in a few more gaps. So you have an antibody that could weakly bind, just bind to this new COVID spike protein that showed up in your body for the first time. And then as a result of that weak binding your body goes through this somatic hypermutation process that creates very rapidly, lots of mutations close to that spike protein shape that it weakly bound to. And then we’re likely from that process going to end up with something that does bind very well. And then I guess once that happens then you have some separate process that scales up, so it says-

Charlotte Deane:

Yeah, yeah. And basically, the immune system is really complicated, but it’s this idea that as things bind better to a foreign invader, more and more of it is made. But the making process contains this error system in its somatic hypermutation, so then you get better and better binders through that process. And then what’s called in your memory of your B-cells, you store the kind of things that you’ve had previously. So of course it’s there when you get a new infection, something that this will bind already really quite well. So the process is much faster. So if you imagine when you get a vaccine, what they’re doing is effectively giving you nothing that is dangerous about what’s in a disease, but some example proteins from it, they won’t really do you any harm. But then it can learn how to bind to those, so that when you actually get the real virus, very quickly you respond and can clear out the disease.

Jon Krohn:

Nice. And then I guess you’re also well prepared for slight variations on that say COVID spike protein. So the first time that your body encounters it, it’s quite a foreign shape, whether it encounters it for the first time from a vaccine or from getting an infection. But then in the future, even as the coronavirus mutates and has slightly different spike proteins, you’ll be better prepared because you’re more likely to weakly bind, yeah.

Charlotte Deane:

Yeah, but one of the things that makes this complicated is of course how long that kind of memory lasts in your immune system. And once again, I’m not actually an immunologist, but this varies or seems to vary for different diseases in terms of how long it does that. And this is only one part, there are lots of other parts of your immune system, but the other part that’s really important, and this is TCRs. But let’s not go there yet, but they operate in a similar way, but they do it for proteins that are inside the cell and there are other complications.
But it’s the same kind of thing about recognizing non-self as fast as possible and then once you have recognized non-self, you can do this. So that’s how the natural system is working and the idea in making them into biologics is quite often what we want to bind to, might even be a self protein. Say for example, if you’re targeting cancer, your own immune system tends to not target cancer because actually, cancer is self primarily or you’re self replicating something too much in your own body. Or it’s very close to self or there are reasons why your own immune system is not doing something. So what you then have to do is, outside, in a lab obviously, you are trying to generate an antibody that will bind to that.

Jon Krohn:

Yeah, so that brings up a really interesting point. So up until this point in the conversation, we’d been talking about the biologics, these big drug molecules that you are focused on synthesizing and studying. And so all of the examples that we covered up until this point were with respect to antibodies. But you just also mentioned there the really interesting case of cancer, which is a different kind of problem for… Yeah.

Charlotte Deane:

But the point is here what we do is we design antibodies that will bind to say a specific protein in cancer and then they become a therapeutic. So my design problem is to design the antibodies against now a disease target. And the reason that we know they’re so effective is that our own immune system shows us that antibodies work because that’s how they work. And so what we have to do is work out, one, how you replicate that first in a laboratory experiment, which is what most people have done up till now. And what I’m interested in doing, which is how do you replicate that on a computer such that I can design the antibody that will bind to the target that you want it to hit.

Jon Krohn:

Gotcha. So these biologics, it’s very much state of the art. It’s not something that is widely used in humans today?

Charlotte Deane:

Oh yeah, no, in terms of biologics, I can’t remember exactly, I think probably seven or eight of the top 20 drugs are biologics. So they’re very widely used for lots of different diseases. There were several antibodies that were anti-COVID antibodies that were used and many are used for things like cancer treatments and that kind of thing. I think one of the things to think about them in terms of treatments though, it’s an important distinction because people get very excited, is that unlike a small molecule, to deliver this, it’s always going to be an injection or an IV drip because you can’t take this as a pill because your stomach would destroy a large molecule of this type. So you’re only going to design them for quite serious conditions because I probably don’t want to have to go and have an IV drip every time I have a headache. But I would be quite happy to do that if I was a little more seriously ill.

Jon Krohn:

Right, right, right, yeah. So I’m beginning to get the impression that I personally have not been administered any biologics.

Charlotte Deane:

I’m going to go for a no. They’re mostly for quite serious conditions. There’s a lot of them targeted at some various cancer types, that’s probably the biggest space they’ve been used in. There’s a few now, there are COVID antibodies that were used, particularly for people who were unlikely to raise their own antibodies against COVID. So people who had other conditions [inaudible 00:27:01] help them there. And there are some for some other diseases as well.

Jon Krohn:

Can they be personalized to an individual’s cancer?

Charlotte Deane:

I think the answer to that would be yes because you’re designing for a specific site. And to be honest, in terms of antibodies themselves, I don’t think that’s been done. Biologics more generally, which opens a massive can of worms if I go too far, but if there are things particularly, so I mentioned TCRs earlier, certainly with those, there is work where people personalize those to be used as particular medicines for an individual. So you take samples from an individual to create a medicine that might help them.

Jon Krohn:

Right, that sounds like something that could be super promising in the future. I imagine it’s the kind of thing that today is extremely expensive.

Charlotte Deane:

I think it is both expensive, but also quite slow today. And I guess that’s one of the reasons why I’m so interested in the computational methods. If you have to experimentally work out what you’re going to do, that takes time. And so time costs money as well, but that is almost more serious if somebody is seriously ill. Taking six months, nine months, a year to get the right medicine is probably not optimal.

Jon Krohn:

We’ve got help, it’s coming.

Charlotte Deane:

Yeah.

Jon Krohn:

Just sit tight. Best treatment ever is going to be here in a couple years, just wait, yeah.

Charlotte Deane:

And to be honest, I don’t know how far those kinds of treatments have got towards actually being a licensed treatment, but there’s definitely work where people are doing that kind of specialized, trying to make them personalized in that way.

Jon Krohn:

That’s very interesting. So even though these kinds of biologic administrations via IV are relatively rare and only done in serious cases, they are nevertheless, you were mentioning a stat there of something like them being a half dozen of the 20 most-

Charlotte Deane:

Yeah.

Jon Krohn:

Yeah.

Charlotte Deane:

I can’t remember the exact numbers, but they’re just a massive class of-

Jon Krohn:

They’re not rare.

Charlotte Deane:

… and they’re not rare at all, no. And some of it is because they can target some of these, I can’t think of the polite words for it, but basically diseases that are very serious. And another reason for this is also, and this is another reason why I want to do this on a computer, is that they’re expensive to make currently. So it’s once you know what small molecule you want to make, it might have been very expensive to do all the research to get to that small molecule. But actually, small molecules are quite cheap to manufacture. It’s not like totally zero, but it’s not that expensive.
Whereas biologics to actually make them at the end, you have to basically express them in cells. That’s how you make them. So you’re imagining a massive factory to do this. And so they’re considerably more expensive to make. And as I said, this is a biological molecule, so you’ve got to work out a way of keeping it safely as well. So they’re a much more expensive type of medicine and that means that that limits their use. But of course it can make them very profitable for very important diseases and things they want to do. So it’s really important to work out ways of actually making them cheaper to discover in the first place and cheaper to make once you’re doing that, so that you can really use them in more places as well.

Jon Krohn:

Got it. Want the best possible start in Machine Learning? SuperDataScience’s top instructors Kirill and Hadelin are back creating courses and have released a brand-new ML course that will give you that perfect start. It’s called “Machine Learning in Python, Level 1.” From their experience teaching Machine Learning for over 6 years and collecting feedback from their 2 million plus students, they know exactly what you need to be quickly on your way toward ML expertise. You will get crystal clear explanations of introductory machine learning theory backed by practical, hands-on case studies with working code. Enroll today at www.superdatascience.com/start and get ahead of the game! Again, that’s www.superdatascience.com/start. So if we don’t use data and computing to try to find these molecules, how do you do it then? You’re just kind of guessing, right?

Charlotte Deane:

Yeah, well there are two main ways to get these molecules. And the first one, the squeamish should not listen to the next bit because the most obvious way to do this is to use another animal’s immune system. So I’ve just described how good we are at it. So what’s the best way to raise an antibody? Well, one way to do it is, and actually, what’s done commonly here is there are mice now that exist, but mice would be a good example here, where we’ve ‘humanized’ their immune system. Because obviously I need it to be human, not mouse. But you could just use normal mice and then you have to work out how to humanize it. And then you inject the animal with very large quantities of whatever it is you want to raise the antibody against. Obviously that’s for step one.
And then the way to describe this is then you have to harvest the animal’s immune system to see what it’s done. Or you have to take lots of extract samples from the animal to see what antibodies it’s raised. So what you’re doing is using the natural immune system of say a mouse to identify antibodies that bind to your target-

Jon Krohn:

Right.

Charlotte Deane:

… so that’s one way. The other way is really the best way to imagine this and it’s kind of the way I want to think about doing it on a computer, but you do it as an experiment. What you have is a massive library of antibodies. Usually we’d say something like a phage display library. And what you’ve done is artificially made yeast cells for example, doesn’t have to be with, but that’s the easiest way to describe this, and each one expresses loads of one type of antibodies. They’ve all got different sequences. Now these libraries can get quite big, so this is a big experiment you’re running. So imagine you’ve got this and then you flood it once again with your thing you want it to bind to.
So I would say the antigen, the thing you’re trying to get.

And then effectively, you find all of the cells that bound to it. So now you have some binders and then what you do is you sort of refish. So you take those cells, you make them do mutations in a similar way to in the body, but obviously this all ends up being a bit smaller scale because the numbers to be able to do these experiments are smaller. And you repeat that experiment until you get binders to your target of interest. So the phrase I use is before you do it on a computer, the way you do it is you go fishing. Because effectively what you do is you go, here’s the antigen, fish in the pool. Oh I’ve got something that sticks, excellent.

Jon Krohn:

Right, and in order to do that, so you’ve got a million different kinds of lure that will get stuck to a very specific kind of fish.

Charlotte Deane:

Yes.

Jon Krohn:

Got it, okay. So that’s the traditional way of identifying biologic drug candidates. But then as we kick the show off, we can use data, machine learning, AI to be able to predict what the drug candidate would be, so that you’re, I guess, maybe fishing from a much smaller pool. Or are you sometimes, you’re just able to say that’s the fit, we know exactly what the right lure is for this particular kind of fish?

Charlotte Deane:

I think the answer there is can’t do the second one yet, want to get there in the end. So at the moment what you want to do ideally is for somebody to say this is the target, this is the thing that is disease-causing and I want you to produce an antibody that binds to that. And I’d love to just be able to effectively put the target into a computer program and it would give me the sequence of the antibody that would bind, okay. That’s the big end goal and there are lots of pieces you need along the way to be able to do that. Things like thinking about understanding the structure and shape of the antibody. So what shape will it make? As I change the sequence, it will change the shape and that changes whether it would bind or would not bind to my target.
What we are really at the stages of doing now, and this is what everyone’s doing is working out how can I make those experiments much, much more efficient? So given I have a small amount of data about what might bind weakly for example, could I work out things that would bind much more strongly?

And the other thing about this, and retain all those other properties I mentioned earlier about it’s got to be human, I don’t want to inject you with something that’s not human. I’ve got to be able to keep it in a bottle in a fridge, so there’s a lot of physical properties it must have. I want it to express really well. So that’s that question about manufacturing. If it doesn’t express well enough, there’ll never be enough antibody, it won’t make a good drug. So you kind of have this big multi-optimization problem that you’re trying to do as you do this.
And I hope it’s fairly obvious to everybody, big multi optimization problem. What I want is lots of data that describes these kinds of things are human, these kinds of things will be able to be kept in a bottle. These kinds of things express well, yep. How do I feed all that information into an algorithm such that it makes good decisions about what you should do. And in one sense, at the moment, what you want to do is connect that to a lab so that you have algorithms that make good decisions about what you should test next in the lab to learn as much information as possible to design the antibody better and better to bind specifically to your target of interest and keep all those properties I’ve just talked about.

Jon Krohn:

Nice. So it sounds like there’s lots of fertile ground here for your research for quite a few years to come. So why is it that people like Demis Hassabis and I’m probably mispronouncing his name and given how long I’ve been hosting the show, I shouldn’t mispronounce his name, but let’s go with Demis Hassabis, in fact Google DeepMind CEO. He recently claimed that their AlphaFold algorithm had ‘solved’ the protein folding problem. But in talking to you, it seems crystal clear that we’re nowhere near it.

Charlotte Deane:

I think, well I’ve got to be careful here because I was at dinner with him last night, so I must be [inaudible 00:36:41].

Jon Krohn:

How much did I butcher his name?

Charlotte Deane:

A bit. So he gave a lecture in Oxford, he was doing the Kavli Prize Lecture yesterday in Oxford and then a group of us went out to dinner with him afterwards. And I think well there’s several things I should say. The first one he knows I would say, which is I don’t like the phrasing of solving the protein folding problem. So that is the way people talk about it. Actually, what AlphaFold does is it solved protein structure prediction. It hasn’t even completely solved that, and he wouldn’t say it was completely solved either, but it has sort of. And what that means is if you have a sequence, what AlphaFold does is tells you what the end point structure is. What is not clear, and my group’s done some work on showing this, is it doesn’t actually know how it folds to get to that structure.
It’s like it’s learnt the end points and that’s really useful. So I don’t want to do it down at all. I think that’s an amazing achievement. It’s an incredible piece of software.

But what it has done is moved us from a realm where we thought doing this kind of structural prediction was really difficult to a realm where actually a lot of the time you get it very right. The problem is that, and maybe I should have explained this a bit better, when we were talking about the structure prediction at the beginning, antibodies are in inverted commas… Well no, just slightly different in the whole way that they mutate and change from standard proteins. So the standard proteins in your body, most people are comfortable, they evolve through time. So we have, let’s take hemoglobin is a really good example, things making your blood red, carrying the oxygen around your body. We have hemoglobin, yeah, so does every other mammal, sperm whales have hemoglobin.
Their hemoglobin looks quite different from ours in terms of sequence. But the shape is the same because it’s carrying out the same function. So what it’s evolving to do is maintain that function So the sequence will change, but the structure is saying quite stable.

And AlphaFold makes use of that when it does its predictions because it takes in as its starting point, this what’s called a multiple sequence alignment. So if I talk about hemoglobin again, it would take in the hemoglobins from all of the different species, but of course, what it’s predicting is the structure of one of those. But the information from all of those is helpful because they’re all roughly the same shape. In fact, more than roughly the same shape, they’re very much the same shape. Antibodies are not evolving in that way. They’re deliberately evolving to change the shapes of the loops on their surface to bind to different things.
Actually, I should be fair here, AlphaFold multiple is actually really rather good at predicting antibodies.

But the sequence alignment probably doesn’t help it very much because it’s taking in information that is, if you imagine these loops make one change in sequence and it completely changes the shape that you observe there. And that’s what they’re meant to do. And they’re not really evolving. They’re undergoing somatic hypermutation driven towards specific targets, which are different for each antibody. But we would align them all together because globally most of the sequence is very similar. It’s just these loops that are really making big changes. And so one of the reasons in our structure prediction program, we don’t use a sequence alignment. Now one, that allows us to go much faster. But two, because the antibodies don’t evolve in the same way as a standard protein, it’s probably non-information, at least to a much greater degree to use that.
So I think that’s the first stage.

The second stage, and this is something that he was talking about yesterday is of course, we’re at the stage where we’ve solved maybe this is the individual units. So if you think, well the antibody’s already got two chains, but we haven’t really solved the problem of this kind of interactions thing, which is the bit I need to solve, which is how does the antibody interact with its target? And that’s the next stage on. And I think it’s something they’re obviously thinking about too, about in the general sense, how do you predict if two proteins will interact with one another? Because it’s not always interacting, it can be transient. So it binds and then it comes off again, binds, comes off again. So how do you predict that and what systems do you need to do that? And there’s actually some really exciting research papers coming up about trying to do that kind of prediction at the moment.

Jon Krohn:

Yeah, that is exactly what I was going to ask you about next. And so before we get to that kind of 4D prediction problem, to really quickly explain why antibodies are… I’m going to try to paraphrase back to you what you said and then you can explain to me why I’m wrong. So with things like hemoglobin, across lots of different species, there’s this kind of convergent evolution towards a particular-

Charlotte Deane:

So it’s the other way around, it’s a divergent evolution. If you imagine there was an ancestral mammal that had hemoglobin, we all have hemoglobin and we’re all divergent from there.

Jon Krohn:

Ah, got it, got it, got it, got it. But because it’s so efficient, it stays mostly the same across all these different ancestors. Yeah, so I was thinking about it the wrong way where we’ve all separately evolved, yeah.

Charlotte Deane:

And there are examples of convergent evolution, but primarily, this is a divergent evolution thing.

Jon Krohn:

Right, right, right. That makes sense because it would obviously be a lot more rare for that to happen just by chance. But then with antibodies, with the tricky bit and say the name of the tricky bit again.

Charlotte Deane:

I say CDRs, it’s complimentary determining regions.

Jon Krohn:

CDRs.

Charlotte Deane:

Which sort of makes sense because it’s what determines their binding. So complimentary determining regions, but CDRs is what people call them.

Jon Krohn:

And so the trick that you were describing of having all this information from lots of different species on how hemoglobin sequences correspond to hemoglobin shapes, that isn’t very useful information in the case of antibodies because these CDRs are deliberately trying to be as diverse as possible. And you’re using that 10 to the power of 14 number, there’s so many different possible orientations. And so you end up with just lots of sequences where you’re like, how can we predict what shape this would make because we haven’t seen this before?

Charlotte Deane:

Yeah, and they don’t have that relationship where keeping that shape is useful for their function. So if they all bound to the same target, then they would all have the same shape. In fact, that’s a trick that we use when doing prediction. You can have very different sequences, if we know they bind into the same thing, they probably have the same shape, even with antibodies. But generally speaking, antibodies all bind into different things, so they will have a different shape. It’s obvious as you say it out loud, but it changes the way you might write the software and think about how to use the data.

Jon Krohn:

Super interesting. All right, so now let’s go back to that 4D problem. So I’m so glad that it is something that we have some research on because it’s something to me, that was immediately fascinating when I read about the AlphaFold 2 successes earlier and how it had done so well at predicting 3D protein structure. In my mind, then I was like, well yeah, the next big problem to solve then is going to be computationally much more complex, is going to be this 4D problem of how does this 3D structure change over time as it receives the antigen that it binds to? Or as it does and I think yes, that’s the specific to antibodies. But as the protein does work, how does its shape change?

Charlotte Deane:

So this is one of the interesting things, actually antibodies are one where if this is not so much of a problem. Well we don’t know that for definite, I should be clear here. All of the evidence suggests that antibodies, when they bind to their target, when they are very afite, so when they bind strongly, which is what we’re interested in them, their shape doesn’t change very much at all, yeah. And the reason for that is pure physics, energetic. So if you have to change a lot on binding or have to fix your confirmation on binding, there is usually an entropy cost for doing that.
So if you are rigid and already in the shape you need to do binding, it’s easier to have a higher binding energy because you are losing less energy to do the binding as well as that which you gain. Whereas there are examples of other proteins which do change. Now there are some antibodies that do change on binding, just to be clear and it’s completely a jigsaw piece. So some of it is moving, the side chains will move, there’ll be changed in confirmation. But interestingly for antibodies, that actual change in shape is four things that are strong binders. It is thought because we don’t have complete evidence, but it’s not so much of a problem.

Jon Krohn:

Nice. Okay, I got it, I got it, I got it. Well it’s still interesting to me that this will be, yeah, as you said, people at DeepMind, and I’m sure in other groups are thinking about this 4D problem and how we can be yeah, adapting algorithms to be able to handle these shape changes. And I guess we keep being surprised by how quickly problems get solved. Maybe that’s one that will take a few years.

Charlotte Deane:

It’s been amazing over the last few years how quickly all these types of problems have been solved. And I think one of the interesting things here in terms of a lot of this is thinking about what problems we could still solve where we actually do have, if you like, the data that would allow us to solve it as well as it’s a well formulated problem? The way to think about it is where DeepMind showed their first success was on games. And the reason they picked games and they’re very open about this is because games are a multi-optimization problem, that’s what machine learning’s really good at, with well-defined rules, machine learning likes those. And with a way of being able to generate or have lots of data that describes the entire space you’re going to work in.
And so it’s interesting to think, particularly in terms of drug discovery biologics, but also in all sorts of other areas, what problems fulfill those requirements because many, many that we would like to do, don’t.

And so then it becomes either we need very new techniques to solve them or we need to think how do we define the rules such it would be helpful. That’s usually the biggest problem. The rules of the game or the data is the other really huge problem that people miss out very frequently. Actually, the volume of data required and it’s not just lots of data, it’s data spread well across the space you want to search. If you only have data in a tiny area of the space, none of this will work. It’s very difficult for it to learn the rules over here, miles away if I’ve only got data just a little circle around this tiny point over here.

Jon Krohn:

Yeah, I see what you’re saying. And then so yeah, we obviously need really good training data for any of these models to be possible. And to have really well-defined data to work with on these 4D problems would be very hard to come by and extremely expensive to create.

Charlotte Deane:

Yeah, and I think people will think about other ways to think about that. So you might use lots of simulated data from using physics simulations for example, or other ways to attack this. So it’s like you’ve got to think about what machine learning is good at, if that’s your preferred hammer. And then work out can you set the problem up such that hammer will work. Or are you just, no, don’t use a hammer right now, think of something else?

Jon Krohn:

And there’s probably some way that I’m thinking about this incorrectly, but I imagine in a situation like that, where you’re trying to do physics simulations of 4D structure, of course if you could do those physics simulations perfectly, you’d already solve the problem, so-

Charlotte Deane:

That is almost exactly it. That is actually a phrase I use quite a lot when people go, “Well it’s all right, we’ll run the simulations and then we’ll train the machine learning model to do them.” And I’m like, “Well if the simulations worked, I wouldn’t…” You still might train the machine learning model because it might be a lot faster. So yeah, you can make an argument that that’s a good thing to do because it’s much quicker. But if you know that the simulation is incorrect, you have to decide is it worth trading something that gets to me the wrong answer much faster or is it correct enough that I might be able to use it to help me generate the types of data that will make good models for doing this kind of thing? So there are positives I can put on this, but it’s a bit more nuanced than just yeah, if you can simulate it, then you could train it and then we’d answer it. But as you say, if the simulation’s wrong because it’s not quite clear what you’ve trained yourself to do.

Jon Krohn:

Right, so for this 4D problem, it sounds like a thorny one that we’ll be tackling for a while. What breakthroughs do you think we might have in your space in the next few years that’ll make headlines?

Charlotte Deane:

I think we’ll see lots of things and you’re sort of starting to see them in research papers already because the field’s moving so fast. I think we will see, in the next few years, you will see biologics that are designed on a computer that make it through to clinical trials. I think that’s coming, there’s already one company that’s talked about doing that, but they designed only part of their molecule. But they did design some of their molecule and it going through to clinical trials with it. So I think will continuously see the rise of the computational technique being the driver of the experiments that are chosen to be done and then the driver of what happens in the clinical trials. Hopefully that will translate into lots of other good things in terms of drugs becoming easier to make and more available.
I think we will start to see, and this is something, maybe this is a hope rather than I know it’s going to happen, but I think it’s a really key hope for somebody like me, is that we will start generating dataset with the algorithms in mind.

So in MySpace at the moment, the data we have is this historical data that people have collected because they were interested say in a specific antibody or a specific target and they publish this. And you collate all that together and try to use that to train algorithms. But if I was designing a dataset, those are not the experiments I would’ve carried out. Now experiments are expensive and they’re hard, but we are moving to a much more roboticized setup of all of these types of things. So I think one of the things that we will see, and it will change the types of algorithms we can even write is if we collected the right data in the first place.
It’s a bit like lots of my colleagues in this department, they’re statisticians and one of their biggest complaints is when somebody turns up after they’ve done all their experiments and say, “Can you do some stats on this to show that this is significant?” And it’s like, well I could have done, if you talked to me before you ran the experiment, it’s quite hard now to do anything with your data because you set that up as you thought about it, but not something that was computationally or statistically would allow me to make those kind of statements.

And I think we are starting to get to the stage where people are thinking, how do I generate data that really drives machine learning algorithms? And that, I hope, what we will see as the big breakthrough is that will be a virtuous cycle in the sense that once you do that, the algorithms will get noticeably better, then the data collection will get noticeably better and you will really, you’ll see a massive acceleration in what’s achieved, I think.

Jon Krohn:

Nice, super cool answer. So we have talked about COVID in this episode, but something that well, you and I haven’t talked about yet though listeners might remember from my introduction to the episode, which I record separately from our conversation, Charlotte, some of our listeners might recall that you were recognized with a Queen’s honor, when you were appointed member of the Order of the British Empire, so an MBE, which is a tremendous honor in the UK. And you were awarded that for your key role in leading UK’s response to COVID-19. So how did everything that we’ve talked about so far today, how does all of the science that we’ve talked about today, all the machine learning that we talked about today, relate to the way that you were able to respond to the COVID pandemic and make an impact in the real world?

Charlotte Deane:

So in some ways it’s separate and in some ways it fits together. So my stuff was because it was for helping UKRI, the UK’s Research and Innovation Council, set up their very rapid grants response to allow people to apply for money, both in industry and academia to do research into COVID. I happened, at the time when ‘COVID hit’ I was working for part of UKRI, I was deputy executive chair of what’s called the Engineering Physical Sciences Research Council. And I was asked to, if you like, lead this stream of work to be able to get these grants going, which is very rapid grants to people to be able to fund COVID research.

And I think some of the reason I was asked to do that was because there are a very, very small number of, if you like, academics or people with academic background inside UKRI and they span the whole discipline.
So some of them are historians, some of them are social scientists, those kind of backgrounds as well. It’s very small number of these people and I’m one of them. And I was one of the few people who worked in a field which was quite close to the area, in the sense that this is a virus. I knew a little bit about antibodies, a little bit about drugs, a little bit about… Actually, I suspect if we’d had a true, I don’t know, probably more useful if it had been an epidemiologist in some way, been sitting there. But really, in a way, a lot of my science wasn’t the most helpful thing there, it was everything else.
And it was this kind of, my job there was to have a broad enough understanding of science to try and get the right people together to work on these problems, to try and make sure that we were quickly funding lots of research projects on relatively short time scales. And persuading my colleagues they have to make their results available immediately.

This isn’t about publishing a paper, you can do that. This is about telling us what now, even if you’re not sure. First time in my life, the sentence, no, people will die if you don’t do this, actually meant something. And so this was more about that kind of organizing, coordinating, persuading the world to behave, and by this I mean the scientific world, to behave in a different way in terms of responding to what was needed in terms of research for that. So really it was that part of my life. And it was a very different experience, I have to say. Not something I ever thought I’d end up doing.

Jon Krohn:

Right, yeah. A lot of us didn’t see the pandemic coming, but yeah. Amazing that you were in the right place at the right time and able to play that key role in UK research and innovation. Were you able to at least, I don’t know, was there some kind of data or data analytics element to the UKRI response?

Charlotte Deane:

I would say it was almost all of everything that I heard about was about data analytics. It was a huge part of what was going on. So if you think about obvious things that, I’ll use one story as an example, and it’s almost a silly story, but it’s a really good way of describing this kind of thing. One of the projects that we funded very early on as UKRI was a project where you could test waste water to see what viruses were in it, now that’s a known thing, yeah. But could we test it and find out how much COVID was in an area, as opposed to having to get loads and loads of people to do COVID tests, persuading people to do testing? And some people were more averse to testing than others. So of course if you just look at numbers of people who have COVID based on COVID tests, it’s not actually a very good representation of what’s happening across the population.

Because some people over tested, some people never tested and it was easier for some people to get tests than others, all the obvious things.
And so that data, there was a lot of work to try and make that clean and how you deal with it. Whereas here, what you got was data from the sewage plants. I love this kind of, the whole concept is just, it has this element of humor to it, so I really like the whole grant in this sense. And it turned out to be really accurate because what you would see, and actually sewage plants are relatively local things, so you knew if there was a major COVID outbreak happening in a part of the country, sometimes ahead of any other data. But also, this data was very unbiased compared to a lot of other data because it didn’t depend on lots of the factors for others. So we could use that data and overlay it with data, say for example, from COVID testing or from other incidence rates that we might have or from hospitalization data. But most importantly, it could do things like warn you because it would tell you there’s a large amount of virus.

And I’m sure most people are aware, there was quite a long time between people catching virus and people actually being seriously sick. And so you could start to prep the hospitals in an area, saying, “The numbers are clearly high in your area even if you can’t see it right now, sewage plant says yes,” yeah? And that’s really, I think a really good real world example of where thinking about where your data comes from, thinking about your data sources, thinking about how you get unbiased data. And then of course you could do every kind of other test to lay over it. But it also helped us to how do you normalize the testing rates that are going on and work out how much COVID is actually happening in different places? So I didn’t do any of the calculations for this, but I was helping to make sure that these projects all talk to one another, so that you actually had those kind of numbers available for people to make decisions and think about the best ways to use that data.

Jon Krohn:

Right, it’s super fascinating. And what’s the bridge TV show that you just made reference to by saying, so the computer says no, skip. That’s [inaudible 00:58:54] understood.

Charlotte Deane:

Yes, the computer says no, that’s in, I can’t even remember. Yes, it’s a comedy show. I don’t think I’ve even ever watched it. I think it’s just something [inaudible 00:59:03].

Jon Krohn:

But that’s [inaudible 00:59:04] is wastewater says yes. And there’s an interesting, to me, this whole area of using wastewater to investigate what’s going on with COVID, that also, that’s being done regularly with lots of different kinds of diseases and just community health issues in general, right?

Charlotte Deane:

Yeah. I have to say, before I was involved in this, I had no idea that wastewater was used in this way. And then when this project was suggested and colleagues who worked in what’s called NERC, so that’s the sort of environment funding part of UKRI, were like, this is a really good idea and we’re talking it through. And they talked about other things that have been measured within wastewater and it’s good because it’s no kind of personal information attached to this. So I don’t know, it feels like a really good way of doing a lot of these things because you can see something that is helpful for the health service to know about and for people to be able to look after people well. But you’re not invading anyone’s privacy because, well, literally it’s the sewage from everybody in the area, which tells you nothing apart from the area.

Jon Krohn:

This is a situation where somebody saying that your research is crap means it’s quite a good thing.

Charlotte Deane:

Yeah, I think we said it was one of the most excellent in that sense.

Jon Krohn:

And I had to quickly look up on my machine here, some of the details, but this reminded me that a lot of the early epidemiological work in the western world revolved around detecting cholera. So someone named John Snow, I’ve looked up here, detecting epidemics of cholera and localizing them to wells. And so it seems it’s related here in a way, is basically people were pooping and it was getting into the well water and then people were drinking the poop and that’s where that [inaudible 01:01:09].

Charlotte Deane:

[inaudible 01:01:11] issue, yeah. But here, as opposed to nobody was drinking poop, I hope nobody was drinking poop. But instead what you’re saying is what you’ll be able to see by measuring something, is where there is a hotspot and for epidemiological modeling that’s really helpful because it tells you how… Now it’s not detailed enough data, but once you have that kind of data, of course you can then work out what the testing means in that area. Because if you only have negative tests in an area where the sewage system is saying actually, there’s a lot of COVID, you realize that you must have a lot of population that’s not testing, for example. And so you can build that into your epidemiological models as you try to work out what’s happening with the spread.

Jon Krohn:

Yeah, it’s going the opposite way around. With the cholera thing, it was like, oh wow, these wells are really important. In this case we’re saying okay, we already know this is where we should be looking for the problems. Yeah, so fascinating, fascinating research that we’ve covered in this episode and I’ve learned an absolute ton. Are there any particular kinds of data tools or software or techniques that you use regularly or you’re excited about that you think our listeners should be aware of at home?

Charlotte Deane:

I guess I start from this that I have no hammer. I have questions I want to answer. So I get excited about loads of different techniques. There are new techniques that people are getting really excited about in terms of sort of deep learning methods for the kind of problems we’ve been talking about, this binding problem. So there’s a lot of excitement around diffusion models at the moment and things like, I like it because it’s a beer name in my head IPA. So those kinds of models. I think one of the big things that also is becoming very powerful in biological data in general is the natural language models.
So people might have seen the models that have come from Facebook where they’ve been building these giant natural language models that seem to be, they’re a really interesting way of representing sequences. And then what you can do with those.

I should say that I always want to say that something that’s really important to remember is that this is the basic statistical techniques are often really powerful to start with, that forgetting to check what your data tells you just by if you just checked if they correlate. And it always sounds a bit silly, but I start with those are really important in what we do. And then these really cool methods allow us to extract more information or potentially pull in more types of data into a single model to be able to make these more complex predictions. So yeah, it just keeps going. And I should also say, I think that if you ask me in three months time, there’ll probably be newer techniques because the field is moving so fast at the moment.

Jon Krohn:

Yeah, this is something. So in the last few years, actually starting in the pandemic, I started creating a lot of content, particularly on YouTube, on the foundational knowledge of the data science field or machine learning. So things like linear algebra, calculus, probability theories, statistics and fundamental computer science concepts. And as I dug more and more into that, I was blown away at how many complex problems can be solved with very simple linear algebra or yeah.

Charlotte Deane:

Yeah, another thing that we found recently is a lot of machine learning papers don’t bother to work out what a boring baseline would be. So if I used a really standard technique, how well would I do at the problem as I have set it up here? Because we have discovered quite often, that for many of these methods, the boring baseline is almost as good as the machine learning fancy, fancy thing. And if that’s the case, don’t do the fancy, fancy thing. Because you’ve got a lot more parameters. So I feel really strongly that it’s important to understand all that underpinning stuff, so that you can actually make good choices. Because there’s real power in these really complex algorithms, but we should know when it has real power and when it’s effectively just a very fancy toy.

Jon Krohn:

For sure. So I’m sure there are a lot of listeners out there who are excited about the kinds of problems that you’re tackling. What kind of background would they need to have to be either working with you at Oxford University or at Exscientia?

Charlotte Deane:

I guess the background is very variable, it’s easier to talk about my research group. But it varies from people who study things like biochemistry and chemistry, study computer science, mathematics, engineering, statistics, physics, all across the board. And that’s partly because I work at the interface of all these areas. The skills that I want them to have as they develop through my research group in some sense is, actually to do this, you’d have to enjoy programming and get pretty good at it. You don’t have to be the world’s most brilliant programmer, but you’ve got to be happy on the computer and you’ve got to want to program a computer and play with that. And then the kind of things that I want them to be good at is really this excitement and interest in the problem because this is a hard problem.
Working on this, you will spend a lot of time doing things that don’t work. And so if you are not interested in it to a very large extent, it’s really quite depressing at that point. And being able to be independent about how you think about it.

So independent thinking is really important and I think there’s lots of different potential hard skills, as I said. It’s really useful if somebody understands tons of immunology. In fact, some people who join my group know more of that than I do because I’m not trained in it originally. And then I have others who come in who are really qualified, if you like, on the mathematical side and can really explain the algebra that’s underpinning a lot of these machine learning algorithms. So I think there are many different ways in, if you see what I mean.

Jon Krohn:

Yeah.

Charlotte Deane:

But primarily it’s about, for me, it’s you’ve got to want to work on a computer because that’s what we do and be comfortable to do that. And yeah, accept that lots of things we try won’t work.

Jon Krohn:

Yeah, I like that. You have to enjoy programming, failure and be an independent thinker. That’s a great answer. Everyone always says communication, so I’m glad you didn’t say that.

Charlotte Deane:

Well, it’s not like that’s a bad thing. I would agree-

Jon Krohn:

No, [inaudible 01:07:32]. But it’s nice to get some other answers.

Charlotte Deane:

I could say I want them to be somebody who’s good at working in a team, because that’s true, and some other things. But yeah, let’s be honest, failure’s quite an important thing to enjoy.

Jon Krohn:

Yeah, yeah, yeah, I like that. So as I mentioned earlier in the episode, we had so much engagement when I announced that you’d be coming on the show and I asked if people had questions for you. We had a lot of comments. Somebody named Kirsty Allen, I don’t know if you know her, she’s a program manager. She said that you’re not only incredibly intelligent, but you’re also funny, which I’ve been able to enjoy.

Charlotte Deane:

So Kirsty and I, I do know her, not spectacularly well, but we have met and worked together a bit in Oxford. So yeah, that’s quite amusing.

Jon Krohn:

And then, so Adam Seroka, who works on AI platforms in the energy sector, his comment was that you are literally the definition of a super data scientist, which is-

Charlotte Deane:

Now I’m very embarrassed.

Jon Krohn:

And then we did also have a question. So Mathias Baldino, he’s a business intelligence analyst at a company called Brain Technology. And he said, what an amazing guest. I wasn’t familiar with her work until now. I am very intrigued about AI in structural bioinformatics. And I know that this is maybe beyond Charlotte’s area of expertise, but can this kind of machine learning models, so structural bioinformatics models, can they help us understand the kind of life that can be formed on other worlds? So can we use structural bioinformatics to make predictions about aliens?

Charlotte Deane:

I guess my starting point is, I don’t know. There are people who have, for a long time, worked on the concepts of what could other forms of life look like? So one of the questions is whether you can have a form of life which isn’t carbon based? So there are many really good reasons why carbon-

Jon Krohn:

Silicon based, right? Would be the-

Charlotte Deane:

Yeah, silicon based life, yeah. Those kind of things.

Jon Krohn:

I watch Star Trek, I know.

Charlotte Deane:

And there are people who work on even understanding how we got to complex organisms. So did we start in an mRNA world rather than a protein world? So was mRNA actually the first thing and then mRNA created DNA and mRNA created proteins? Those kinds of things. I would guess, and this really is a guess because we’re quite a long way from what I know about. But obviously if we understood much better how our proteins worked, how they bound to one another, basically how we ourselves work in terms of human life forms and natural life on earth. I would suspect that would massively help you to understand what would be a potential forms that an alien life form could take. Because it would be a much better understanding of the space that if you like, human life sits within. But really, a long way away from anything I really know about, so I probably shouldn’t have made any comment at all.

Jon Krohn:

Nice. Well, let’s jump back to something that you know quite a bit about, which is cycling. I’m not kidding. So for many years your team at Oxford, your research lab has been organizing a cycling tour that you call Le Tour de Farce. So you enjoy cycling a lot and it seems like your researchers do too. How did this come about and do these kinds of social activities help with scientific discoveries?

Charlotte Deane:

So the answer to the second one is definitely yes. Because going back to that, I didn’t say communication, but being relaxed with the people you work with, able to talk to them about it, being unafraid of making mistakes in front of them and being able to discuss with them in a broad scientific sense is always going to improve your output in that sense. It’s really good to be able to talk to people in that way. The Tour de Farce, so a long time ago, my group have always known that I like cycling. If you are watching this on video, you will see that my Brompton is behind me, on the floor there. That is one of my six bikes. Just to be clear, I really like cycling and my research group knows this. And so a student of mine called JP asked if I would be prepared to work out a cycle route we could take.

Now Oxford has a lot of pubs, you’ll know this John, a cycle route we could do where we would cycle round Oxford and stop at pubs on the way. But that it would be short enough, and so it’s a bit like a pub crawl, except we do it on bikes and that is how the Tour de Farce was born. And it was meant to be this sort of one-off thing we did because it would be, I was like, yeah, I can work that out. So we did this, well, we didn’t even cycle that far, but it was kind of fun and it was on a beautiful summer’s evening in Oxford, the long summer evenings you get. And we stopped off at a few pubs, some people drank beer, other people drank water. It was quite normal evening, had a really good time. But it very quickly became clear this was not going to be a one-off. This was going to be something that my group did every year. And yeah, it’s well certainly one of my, but I think the rest of the group also enjoy it, favorite events of the year.

Jon Krohn:

It sounds super fun, I wish I was there in Oxford to enjoy. I heard that the Eagle and Child closed or something like that. Is that…

Charlotte Deane:

So no, the Eagle and Child is still open, the Lamb and Flag, which was the pub opposite. So that was actually shut for a while, it is now reopened. So a group of people in Oxford opened it effectively, sort of like a community pub. So a load of locals now own the Lamb and Flag and it’s back open again. And they’ve redone all the wood paneling inside it, so it’s very beautiful.

Jon Krohn:

Nice, all right. I look forward to checking that out. Did you know Jonathan Flint who was at Oxford for a long time?

Charlotte Deane:

I don’t think I did.

Jon Krohn:

No? Well then the rest of the story isn’t going to be very interesting and so we’ll just skip that one. He was my doctoral supervisor and he is in episode number 547 of this podcast. But I had a story about him cycling in Oxford and it ending really poorly. But we’re running out of time and so I’ll just skip to asking you instead if you have a book recommendation for us.
Charlotte

Deane:

So my book recommendation is a book that actually I picked up by accident in a bookshop because it had a title I liked. It’s a fiction book, it was called Lessons in Chemistry. And I don’t know if you’ve heard of it. But the book is set, I can’t remember exactly when, probably sort of 1950s, 1960s, that kind of time. And the concept of the book is that the central character is a lady who wishes basically to be allowed to do research. But given the timing, you can guess that there are many problems with this occurring. She is working in a research lab when we first meet her. And the book is very funny, this is something I should say.
But the amazing thing about it is I suspect that male or female, if your listeners are people like me, who really like science and thinking about things in the scientific way, they will really like the central character in this book.

Basically, if you are at all of a scientific bent, you are rooting for her from the beginning to the end of the book. Partly because, I don’t want to spoil it, but in one sense she sees the world in a very particular way, which means that she’s not seeing what’s happening in front of her. You know what’s happening, but you also feel like that’s the way that most scientists also see the world. So I felt like, yeah, I know what she [inaudible 01:15:20]. And the book is amazing, that’s the best way to describe it. And there have been other people, and I’ve discovered after I had picked it up and read it, that it was quite well known and lots of people had read it and seen it before.

Jon Krohn:

Nice. Well it is the first that I’ve heard of it and I think it’s the first time it’s been mentioned on the show. So that’s a really cool recommendation. Sounds like a lot of fun. Charlotte, this has been an amazing episode and it’s mind blowing. We’ve gotten right to the very end now and we haven’t had a single retake, which is pretty rare. This has just been one continuous flow of wonderfully interesting conversation, at least for me. And so I’m sure there are lots of listeners out there who would like to be able to follow your work after the program. Are there social media channels that they should follow you on? Or how could they keep up to the latest on your work?

Charlotte Deane:

The easiest way to follow me in an academic sense, which is probably the best way to see what I’m up to, is I’m not really on Twitter myself, but my research group is. My research group is called OPIG, Oxford Protein Informatics Group. But our Twitter handle, because obviously the group is OPIG, so they are the OPIGlets. So if you can find the OPIGlets on Twitter, that is us. And the other place to see stuff I’m doing, if you want, is on LinkedIn because that’s where I put some of the stuff. But mostly follow the OPIGlets and they’ll tell you what I’m up to.

Jon Krohn:

We’ll be sure to include the OPIGlets in the show notes. Thank you so much, Charlotte, again, for taking the time. It’s been such a great episode. And yeah, maybe in a couple of years we can check in with you again and see how things are coming along. I’m sure you’ll have plenty more to fill us in on that.

Charlotte Deane:

Thank you. It’s been great to speak.

Jon Krohn:

Holy crap, what an extraordinary episode. In it, Charlotte filled us in on how biologics are big drug molecules like antibodies that help our immune system recognize foreign material. How personalized cancer medication is quite slow today, but how data and machine learning are dramatically speeding biologic drug discovery and perhaps soon personalization. How AlphaFold 2 predicts protein shape well in general, but antibodies have especially tricky complementarity determining regions that aren’t yet automatically solved across the board by any computational approach. She also talked about how data analytics was central to the scientific response to the COVID pandemic, such as epidemiological studies of wastewater and how undervalued simple old correlation is as a data science technique.

As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Professor Deane’s social media profiles, as well as my own social media profiles at www.superdatascience.com/643, that’s www.superdatascience.com/643.
If you too would like to ask questions of future guests of the show like several audience members did during today’s episode, then consider following me on LinkedIn or Twitter as that’s where I post who upcoming guests are and ask you to provide your inquiries for them. Another way we can interact is coming up on March 1st. I’ll be hosting a virtual conference on natural language processing with large language models like BERT and the GPT series architectures. It’ll be interactive, practical, and it’ll feature some of the most influential scientists and instructors in the large natural language model space as speakers. It’ll be live in the O’Reilly platform, which many employers and universities provide free access to. Otherwise, you can grab a free trial. Hopefully catch you then.

All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you.
And thanks of course, to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another epic episode for us today. For enabling super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you, yourself are interested in sponsoring an episode, you can get the details on how by making your way to johnkrohn.com/podcast. Last but not least, thanks to you for listening all the way to the end of the show. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Podcasts SDS 643: A.I. for Medicine

Podcast Transcript

Share on

Related Podcasts

February 24, 2026

February 20, 2026

February 17, 2026

Podcasts SDS 643: A.I. for Medicine

Share

SDS 643: A.I. for Medicine

Podcast Transcript

Share on

Related Podcasts

February 24, 2026

SDS 969: The Laws of Thought: The Math of Minds and Machines, with Prof. Tom Griffiths

February 20, 2026

SDS 968: Is AI Automating Away All Coding Jobs?

February 17, 2026

SDS 967: AI for the Physical World, with Samsara’s Praveen Murugesan