Transcript of The Bioinformatics CRO Webinar Series: AI-First Drug Discovery Pipelines
Disclaimer: Transcripts may contain errors.
Grant Belgard: Welcome to the inaugural seminar in The Bioinformatics CRO webinar series. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision-ready insights providing flexible expert bioinformatics support from study design through analysis and reporting.
With that mission in mind, we’re launching The Bioinformatics CRO Webinar Series, a practical forum for sharing tools, workflows, and real world lessons from the front lines of modern bioinformatics. Let’s kick off our first session and welcome Nick Wisniewski.
Nick is an expert in applying artificial intelligence to the life sciences. He earned his PhD in biophysics from UCLA where he later served as a faculty member developing machine learning methods for imaging and multiomics data. In 2016, he joined the founding team at Verge Genomics, pioneers in AI-driven drug discovery, and has since helped launch more than four more biotech startups spanning diagnostics, a smart pill, and a cell therapy. More recently, he served as vice president of bioinformatics and data science at Stemson Therapeutics in San Diego. In this live webinar, Nick discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, the shortening timelines from concept to clinic. Feel free to put your questions in the chat as we’ll have a Q&A afterwards.
And Nick, over to you.
Nick Wisniewski: Thanks a lot, Grant. Really excited to be on this inaugural webcast and looking forward to the rest of the series as always a big fan of The Bioinformatics CRO so I am very happy to support this development and look forward to more in the future. The talk that I’m going to deliver today as Grant mentioned is going to be on AI first drug discovery pipelines I think there’s a lot of movement happening in the space clearly a lot of excitement and discussion amongst investors and tech bio people and a lot of new algorithms and methods coming out every day that it’s hard to keep up with.
So the purpose of this talk is to kind of give an overview of all of that development that’s happening as well as an insight into how these machines are learning and where this is all going into the future.
So to start with I think it’s good to point out that the current state of drug discovery is one in which less than one in 10 drugs succeed. And the cost of developing a drug can be up to $2 billion given given the failures that occur and the portfolio approach that that happens and that money gets spent over a period of up to a decade.
So the impact that AI can have on the drug discovery process can happen in multiple ways. One is the increase in the accuracy or the success rate of the drugs and the second is in the cycle time. If you can test more drugs faster, you can kind of overcome some of this challenge to understand where we’re at and what the impact of AI is. I think it’s important to just start with a review of the traditional drug discovery pipeline as we know it.
It’s largely discussed as a waterfall type process where you have a left to right kind of movement through a number of different phases starting with target identification and target validation, compound screening to identify hits and then hit to lead, getting to the lead and optimizing the lead and then all sorts of preclinical testing to understand toxicity and other stuff of that nature.
And then it goes into the clinical development stage where you have the phase one, two, and three trials. And you know the the loss rate as you go through what what is the chances of success starting out very early when you’re still validating a target can be only 3% of molecules are are going to be successful in the clinic. So that’s you know quite a low rate.
If AI can improve that to 5% it would make a huge difference. We don’t need to necessarily get to 100% although that would be even better but I think the other important part of this pipeline is that embedded into it are a number of different design make test and analyze cycles and so we often think of these in terms of synthesizing molecules but they form the standard feedback loop in in the molecular optimization process and so it mainly happens between hit discovery and lead optimization with each iteration lasting maybe weeks to months and to get to an optimized outcome you might need three to 10 different iterations. So that can really be a bottleneck that AI can address in the drug discovery process.
There are a number of traditional computational tools that are being used and have been being used for quite some time.
So in the initial stage of target identification you know we ask the question which protein which gene what what is the target and a lot of the early tools you’ll recognize as things like ingenuity pathway analysis and WGCNA a lot of these matured you know mainly in the 2000s mid 2000s and have been used with with a fair degree of success since then once you get more into the drug development stage where you already have a target and now you’re trying to design molecules to hit that target.
This is where the rest of the tools come in.
So things like virtual screening and docking have also been concepts that have been around for quite some time. And so this is which molecules are going to bind to that target. So things like AutoDock and and the Schrodinger Glide emerged starting in the the 1980s but you know growing more popular towards the late 90s. And then another question is which molecules are bioactive? So maybe you can bind but you can’t get any sort of activity out of it.
This is where quantitative structure active activity relationship models which go back to the 1960s. They’re largely just regression models and they’ve been updated over time to integrate machine learning methods to to kind of improve some of that but they’re you know a mainstay of the process and and maybe alongside of those there’s the pharmacophore modeling and shape matching and this is kind of trying to understand what’s the geometry of the molecule that’s required for bioactivity 3D shape matching and distance metrics between molecules are all quite useful and they allow us to to filter candidates you know going back to the the 90s again more recently there’s been a lot of movement in molecular dynamics and free energy calculations and this is you know more physics based trying to understand how energetically or thermodynamically favorable the ligand- target complex is and so these are some some simulation techniques that that matured you know maybe 15 years ago or so understanding stability of these bindings and then quite importantly once once you can simulate a lot of these things and think you’ve identified a molecule of course what is extremely important is the properties of that molecule once it enters into a body will the molecule be absorbed and safe will cross the blood brain barrier. These are things that are known as ADMET properties and largely we want to think about the toxicity absorption distribution metabolism excretion and so forth.
Now I’d say the first stage of the development of AI has been to start just modularly replacing out each of those different phases with let’s say deep learning components. And so maybe the one that that we haven’t talked about so far is imaging which is very useful in in the target ID step where you can do more phenotypic level understanding and protein localization within cells and I think those models have been very powerful and very influential in in terms of of the rest in terms of target ID you know we we’ve got the the gene former class of inference algorithms out now in terms of protein structure.
Alpha fold has broken that field wide open. And then you know the rest of them often come with straightforward replacements. So DiffDock is now a big replacement in molecular docking. Maybe new ones are having to do with the denovo molecule generation like MegaMolBART which is now integrated I think in in Nvidia’s BioNeMo. We have some deep learning tox predictions and retrosynthesis planning which is good to help you find the easiest path to synthesize a molecule. But maybe some of the more exciting ways to think about things have to do with experimental planning. And I’m going to talk a little bit more about that in a few slides.
But things that aren’t represented here I would say are the most latest developments. One came out just yesterday being Claude for life sciences. And this is I think very exciting. You know if you’re a bioinformatician or a programmer, you’ve likely been using Claude now for a bit of time for programming and tasks of that nature.
So extending that now into integrations with common lab tools like Benchling and partnering with with institutes like the Broad Institute and 10X Genomics to help facilitate access to data in those platforms and algorithms in those pipelines as well as PubMed to really facilitate searching the literature and getting back good intelligence on targets you find and on drugs that design. That’s going to be highly influential. It promises right now to be able to analyze single cell RNA sequencing data, which is going to be great for democratizing access to that data source. And really interestingly, it’s promising to help prepare regulatory documents, which may be one of the the biggest bottlenecks in the real world into putting together a pipeline and and accelerating it. This stuff takes a lot of time.
Similar developments are coming out of partnerships with Nvidia now more and more every day. Again with Benchling maybe leading the way. Benchling launched their Benchling AI recently and as part of that Nvidia is integrating its NIM microservices into Benchling. So this offers access to all the optimized GPU implementations of things like openfold 2 for protein structure prediction, and I think the other models like the ADMET models and everything are coming shortly. So that’s also very exciting.
But to return to some of the other bottlenecks that are being addressed by AI, let’s go back to the DMTA cycles. So going from design to make is the the first half of the cycle where you know you present a chemist with a bunch of designs and tasked with making those molecules and it may not be immediately obvious how to make them. You may have some information on how it’s done, but along the way, what you learn is the feasibility of synthesis and any constraints that might exist for future design choices.
So in learning that you can already make the first step into thinking well what would I do with a machine learning model like that retrosynthesis model? Well, you can update kind of based on your learnings from that step and retrain your generative design models with those new constraints. And then likewise when you go to test these molecules in a series of biological assays and the ADMET profiling and other stuff like that you learn a lot about the potency, selectivity, toxicity, off-target effects, everything you can measure about these drugs that you may have had predictions for from the QSAR, docking, pharmacophore, ADMET models but now you have new data and you can go back and update all of those models in real time to improve the predictions on the next iteration of the cycle.
So you start to see how a more continuous learning framework can arise from the existing cycles that exist in the drug discovery pipeline. And this starts to hint at the next transition that’s coming in the field where we currently think of AI as more of a tool which we’ll call augmented AI where you have module assistance for each of these different steps of the pipeline and it’s still there being controlled by humans and informing humans empowering humans.
The next step that things are moving towards is this AI first regime where you have some sort of orchestrated autonomous learning cycle. And here the AI acts as the central control architecture orchestrating not just the DMTA loop but you can extend that loop into into different feedback loops and you can start then thinking about these closed loop continuous learning cycles. Combined with automation wet lab automation bioinformatics automation and everything you need to be self-contained. This is I think the crux of the idea that we hear a lot in terms of lab-in-the-loop.
This is a concept being popularized across a number of different institutions — at Genentech, Aviv Regev has put together a team that is exploring a lot of lab-in-the-loop operations. Nvidia is highly supporting lab-in-the-loop architectures and this is kind of the main goal of getting to a continuous learning closed loop architecture where the AI proposes novel molecules. It synthesizes and tests them automatically, gets the assay results and feeds them back to update the model for subsequent iterations.
I think as we’re moving into this regime, it’s important to understand some of the key machine learning paradigms. And so I’m going to talk a bit more about those in the in the next slides but I’ll introduce them here in terms of you may have heard of things like active learning, Bayesian optimization, reinforcement learning and so forth. And then the third component which I mentioned earlier is the automation component.
So right now there’s a range you don’t necessarily need automation in order to build these loops. You can have human in the loop doing the experiments.
But of course the hope is that by having automated experimentation you maybe reduce some of the variability in the experiments increase some of the reproducibility as well as the speed at which you can experiment. You can run all day and night highly parallelize things and so it’s going to scale a lot better.
So thinking about how we’re making this transition I think organizational principles are one of the big bottlenecks. There’s a big issue that we all face with adopting new technology in terms of understanding how it works and deciding to what extent we can trust the decisions that it’s making.
As we work as programmers with these AI tools like Claude, Cursor, Codex, we see and get immediate feedback on how well it solves problems. How many iterations and recorrections we need in order to keep it on track and do what it needs to do. And we can gain some sense of how much we can trust the decision-making that’s happening.
It’s a little bit harder in drug discovery because I think primarily the cycles are so long. So benchmarking these tools is very difficult if it takes five years in order to get something up running molecule created and then get it through trials before you know that it works. Of course it’s a very long feedback cycle and it can take quite a while to develop that kind of trust.
Moreover, we’re handing more and more decision-making over to the AI where traditionally humans maybe directors director level people are making these decisions and that introduces some accountability questions and other organizational problems. So I think one of the most important things that we can do in order to help facilitate trust in the decision-making is understand at a basic level how these decisions are being made. If we’re going to let AI determine what experiments to do next and where to allocate those resources, it probably helps to understand a little bit how it’s making those decisions.
So again I’ll introduce briefly the concepts of active learning, Bayesian optimization and reinforcement learning as kind of the three main techniques right now that you see in these sorts of systems where active learning is one in which the AI understands a bit about what it doesn’t know or what it’s most uncertain about and then it targets experiments in order to learn the most it can in the next set of experiments. So this is a fairly straightforward concept. I think scientists think in much the same way.
Bayesian optimization is maybe more product focused. You’re trying to optimize some property of a molecule and you kind of have to navigate a search space, do some hill climbing on a landscape that you’re inferring while you’re climbing it. And this is a method that’s used in order to find kind of the most potent drug out of a large set of molecules without having to test them all.
And then reinforcement learning, something that we read a lot about these days in terms of particularly the LLMs, is a method of learning that’s really finding trajectories through that space. It’s trying to optimize a sequence of decisions or what’s called a policy in order to optimize long-term gains. And you know that’s very computationally expensive, maybe not as efficient. But it has some strengths over the previous two particularly Bayesian optimization in terms of parallelization capabilities and ability to explore the space in a more efficient way but also has some some drawbacks in terms of inability to learn in sparse spaces and so forth.
So I’ll just show kind of a graphic example of active learning. I’ve got the other two but for the sake of time we’ll skip over it. You know, imagine you’re trying to do a classification task where you’ve got, you know, red team on the right of class one and blue team on the left of class zero, whether these are, you know, whatever you want to call them, toxic, non-toxic, and then you’ve got a bunch of unmeasured molecules.
So each dot is a molecule here. The ones in white are ones that we don’t have any data on yet. The ones in orange are also ones we don’t have data on yet. But in fitting the boundary between red and blue, we find there’s a bunch of unfitted molecules along that boundary. And we color this orange to point out that these are maybe the most uncertain in the whole model.
From the model’s point of view, learning what these are would likely have the most impact on what that decision boundary is. And so you go forth and you test those with the idea that you’re really trying to choose the next experiment in such a way that it can have the maximal impact on your prior beliefs about what that boundary should be. It maximizes your information gain.
And so I think that may help a little bit better in understanding how these lab-in-the-loop systems are working. As a result, I think that waterfall topology of the standard drug development pipeline, we’re going to start seeing change a little bit. It’s going to become possible to flatten and collapse different stages into each other where they all share maybe a group of objective functions that you can optimize simultaneously which can dramatically shorten certain stages of the pipeline.
At the same time, we can also merge and parallelize certain loops. So you can you can do all sorts of different DTMA loops at the same time, integrating all of that feedback instead into what’s called a continuous feedback mesh where you have a bunch of models all kind of conditionally dependent on each other all being updated whenever new data comes in being concurrently influencing each of their predictions for for the next cycle. And you know the one of the most important changes in this process is the shift of the human role into that of a supervisor.
So as humans shift from the decision makers and the gatekeepers to the supervisors you know they’re going to start overseeing these autonomous loops monitoring them to make sure things are on track and only intervening strategically while the AI is handling the rest of the routine iteration and optimization.
We’re starting to see real world examples and commercial platforms. So, if you want to build a startup and design one of these systems yourself, of course, you’re free to do so. Many of those tools are open-source but putting them together can require a lot of effort, a lot of engineering. There are commercial platforms that can be licensed and there are places that are are using these that we can use to benchmark success rates. So I think Insilico Medicine is maybe at the forefront of this. They have a a fully automated robot robotic lab. 31 active programs and claiming to achieve concept to phase one in under two and a half years. And there’s a platform Pharma.ai you can license from them or form partnerships. Similarly with Iktos this is another big player in the field. They have a a similar licensable software-as-a-service program. Recursion and Exscientia are our big players in the field that everybody’s watching to understand how the progress and whether we can speed stuff up is actually progressing.
Isomorphic Labs of course is deep in this and all of the commercial platforms and tools from Schrodinger to the Nvidia and AWS tools the HuggingFace models and I’ll even note new lab automation as a service like Strateos offers these cloud labs where you can control the automation.
So to conclude with a future outlook, I think we’re moving towards this period where autonomous AI scientists are going to start leading a lot of the process. They’re going to be able to design, synthesize, and test molecules in these closed loop cycles. And this is going to improve data quality and integration with every step. The data sets are going to become more unbiased and more accessible. And I think again the key component here is human trust and collaboration which is definitely going to take some time to develop. And I think that may be the most interesting part of of the of the path forward that we’re going to experience in the coming years.
So with that I’ll kind of conclude the talk. Again bringing it back to to Grant and The Bioinformatics CRO. Again, happy to kick off this this inaugural webinar. There’s going to be three more following me over the coming months and I’ll turn it over back to Grant to introduce those speakers and tell us what’s coming.
Grant Belgard: Nick, thank you so much. Yeah, so our next webinar will be broadcast at the same time, 11 am Eastern on Tuesday, November 11th. We’ll be joined by Ania Wilczynska, senior director of Bioinformatics and AI at Bit.bio. So hope to see all of you there. But Nick, questions. So, everyone watching, you can put questions, in the chat and we can kick off with, what do you think of the new foundation models for single cell analysis? Are they having an impact on drug discovery?
Nick Wisniewski: Yeah, they’re very interesting. I was very excited when I saw Geneformer and SCGPT come out and I think there’s been a lot of adoption of these at new startups. This is a big part of the new phase of AI target discovery.
So I think the things that they bring to the table that are fantastic are moving things into the transfer learning paradigm where you know you can bring in a whole bunch of knowledge and do zero-shot predictions on your data without having to to train or learn from external data sets that’s already been done for you.
It also gives you a good way of representation learning and so it gives you a bit of a new representation by which to learn stuff. I think the benchmarking of these things hasn’t shown much more than maybe moderate increases in performance in cases like predicting drug perturbations. I think the benchmarks are still showing no clear improvement over linear methods which is a bit surprising, and I think it’s it’s important to look at that and wonder whether or not that’s telling us something about the data about the algorithms and about biology. I think there’s something to learn there.
My guess is we’re probably coarse graining somewhere, whether that is in the molecule set that we’re using. I think there’s been some recent studies showing that maybe you need to know what the phosphorylation state of every intermediate molecule is what that chemistry mess actually happening in the cell looks like. And that by just measuring broad activity, you may be fine graining or coarse graining too much. And the other is maybe we’re course graining in time and that there’s dynamics that need to be learned that aren’t being captured by our snapshots. Yeah. They’re always very often focused on steady state. Yeah.
Grant Belgard: What do you think the impact of Claude for life science will be in drug discovery? Speaking of recent developments.
Nick Wisniewski: Yeah. This is you know I spent a lot of time looking at it yesterday after I saw the launch. I don’t know if you’ve had had much time to explore it.
Grant Belgard: Few minutes.
Nick Wisniewski: Yeah. I mean it the integrations it’s made I think are fantastic.
Like you know we tend to think particularly in bioinformatics in terms of some of the scientific questions you know these foundation models and stuff like that but when you actually work in in the pipeline and in the lab you notice the overhead in terms of connecting different systems particularly ELN’s like Benchling, the inconsistent metadata that you might find across experiments and the the access to data is a real bottleneck for bioinformaticians in order to to get the data, synthesize it, harmonize it and move forward.
So I think it has the capability to really have huge impact in the way that bioinformaticians work as well as biologists because it gives them access to a lot of this data. I think there, you know, probably other questions having to do with reproducibility that come out of these tools. Every time in the past where we’ve seen access to tools whether it was you know buttonclick testing of p values that you could throw models at everything you saw you know an increase in p hacking and loss of reproducibility and stuff like that so it’s going to be very interesting to see the impact on actual science that Claude has.
My impression of that largely comes maybe from experience and that you know working with Claude when you’re programming you get a lot of “you are absolutely right!”. Let me and I can say you know most of the time I’m not absolutely right. And in my 20 year career working in bioinformatics and biology, I don’t think I’ve really ever said those words aloud in the practice of doing biology. So, you know, a lot of biology comes from pushing back and creating a lot of counter scenarios and debunking ideas rather than the narrative driven science. And so we’re going to see, I think, how Claude navigates that space and whether it’s a positive contribution in that sense.
Grant Belgard: Yeah, that’s a really good point. You can certainly imagine someone running the same query a few times until they get their their favorite gene showing up in a list and running with that, right? So we have a question from the chat. What do you think the timelines are on the transition to AI scientists?
Nick Wisniewski: This is a great question. So you know, of course, predictions are always hard, especially when they’re about the future. And these timelines are, of course, maybe the most contentious part of the AI field because there’s so much hype around them and the fundraising that goes into things. There’s a number of different influences on the timelines that I think go beyond just the development of the tools.
The adoption of these tools is going to be slower than they can be developed particularly at large institutions which you know use most of the resources in the field. You know for good reason big pharma is going to be slower to adopt these systems than the startups.
So I think we’re going to see more development happening in the startup space than in the big players probably with a continued pattern of then acquisitions whenever somebody’s successful. Given the current funding environment for startups you know factoring that in there may be a delay and that delay particularly in The States may be overshadowed elsewhere. So, you know, I read a lot these days about how far ahead in automation places like China are in terms of biotech research. And so, I wouldn’t be surprised if we start seeing the first successful closed loop continuous learning labs emerging from somewhere like there rather than San Francisco.
But in terms of then guessing an overall timeline given those factors and still the need for some development in the automation robotics and the manufacturing of those so that we can get them into labs cheaply here. I think we’re still looking at like 5 to 10 years before we get to these systems even though the capabilities to do this may come a lot sooner.
Grant Belgard: And what’s the one misconception about AI first pipelines you’d like to correct before we wrap?
Nick Wisniewski: Yeah, that’s a great question. I think again the idea that they may be a magic bullet. I think there’s a lot of hope that it’s going to improve reproducibility, reduce variation, and accelerate the speed of research.
But also given the fact that we’re seeing only modest improvements in terms of performance over linear models and stuff like that, it still depends on having the right set of molecules, knowing if you need dynamic real-time data as opposed to snapshot data like we’ve been using. And so it may not be I think if we institute it right now given the the same tools that we’ve been using it it may fall flat in terms of delivering on its promises and I think we need to to also incorporate that ability to question whether or not the data that’s being posed to it is well posed. And I think from a scientist point of view, this is often the ground floor when you approach a problem is asking, is the problem well posed?
And until we build in that base level intuition into these things, it’s easy to start optimizing or overoptimizing something that shouldn’t be optimized in the first place. I think that’s a really good cautionary note.
Grant Belgard: Well, Nick, thank you so much for joining us and all our viewers, thank you for joining. We’ll see you November 11th. Bye-bye.