The BCRO Webinar

The Bioinformatics CRO Webinar Series

January 21, 2026: James Opzoomer – Biophysics-Informed Spatial Transcriptomics Approaches to Identify Cytokines Causally Driving Downstream Gene Programs

The BCRO Webinar

James Opzoomer is a Senior Scientist in the Innovation Lab at Relation, where he develops single-cell and spatial genomics platforms to accelerate drug discovery. His projects span high-throughput multimodal single-cell sequencing and spatial transcriptomics technology development, generating ML-ready datasets that power novel therapeutic insights.

In this live webinar, he discussed BISTR (biophysics-informed spatial transcriptomics regression) as a computational toolbox for building biologically plausible predictive models from spatial transcriptomics by combining RNA dynamics as a readout of changing gene programs, and paracrine cytokine diffusion as a physically constrained model of cell–cell communication. By linking inferred cytokine secretion, a spatial propagation diffusion model, and receptor-associated changes in mRNA maturation, BISTR aims to suggest cell-type-specific, testable causal relationships between extracellular signals and downstream transcriptional responses.

Transcript of The Bioinformatics CRO Webinar Series – Biophysics-Informed Spatial Transcriptomics Approaches to Identify Cytokines Causally Driving Downstream Gene Programs

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the next talk in The Bioinformatics CRO webinar miniseries. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision ready insights, providing flexible expert bioinformatics support from study design through analysis and reporting. As part of that mission, our webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s talk is by James Opzoomer. James is a senior scientist in the innovation Lab at Relation where he developed single cell and spatial genomics platforms to accelerate drug discovery. His projects span high throughput multimodal single cell sequencing and spatial transcriptomics technology development generating ML ready data sets that power novel therapeutic insights. Today James will be presenting on biophysics informed spatial transcriptomics approaches to identify cytokines causally driving downstream gene programs. After the talk, we’ll host a live Q&A session. This is streaming both to YouTube and LinkedIn and on either platform you can put your questions in the chat or the comments at any point during the talk and we’ll bring them into our discussion afterwards. James, over to you.

James Opzoomer: Thank you and hello. So I’m delighted to be speaking today at this uh BCRO webinar and I’d like to thank Grant and the BCRO team for inviting me to speak with you today about relation and some of our spatial transcriptomics work within the Innovation team. So I’m going to start today by giving you an overview of Relation and our approach to data generation and then I’ll dive into a novel spatial transcriptomics an analysis method that we’re developing called BISTR and provide a worked example at the end. So, first I’d like to start with a question. What are some of the main challenges with the current model of drug development? And why is now a uniquely good moment to deploy large-scale patient genomics to solve this problem?

James Opzoomer: So shown here are four major trends that define the future of drug development and healthcare. And on the left we have sort of two negative trends. First that the cost of drugs is ever increasing. We spend more money on health care but we don’t see commensurate increases in in life expectancy. And this is also demonstrated by the the ratio of health care spend to life expectancy on the left. Now the two good trends on the right are that the cost of sequencing is is drastically decreasing. You can now do a whole genome sequencing for about $100 and the cost of compute that’s driven by titans like Nvidia has made it more accessible than than ever before. So really the problem that Relation is trying to deal with is the first one decreasing the cost of drugs. And what we want to ask is can we use these two trends on the right to solve those on the left.

James Opzoomer: Now this slide really represents a simplified overview of the drug development funnel which I’m sure you’re all well aware of. On the left we start with maybe 20 programs, 20 ideas for new medicines and we invest on the order of 1 to 3 billion across this funnel and after all that work we typically end up with just one marketed drug. So most of the attrition here is because we were wrong about the biology. So although every stage in the funnel is important, the decisions we make right at the beginning in target discovery echo all the way through this funnel to the clinic where failure is acutely expensive. And so that’s why we believe that that target discovery is really the most important problem in in drug discovery.

James Opzoomer: So at Relation our ambition is to transform target discovery into an engineering discipline. And now this means building systematic repeatable processes powered by large-scale patient data and ML models.

James Opzoomer: So the funnel that I previously showed you is another representation of this statistic on the top left that over 90% of drugs that enter clinical trials ultimately fail. So how do we transform R&D so that this number looks very different in the future? Now over the last few years several large analyses have given us an important clue. So on the right there are two examples of these. The first is is a recent Nature paper where it looked across many clinical programs from and the papers from Matt Nelson’s group. They show that when a drug target is supported by human genetic evidence the probability of success in the clinic is increased compared to targets without that evidence. In other words, genetics gives us causal anchors in human biology. The second work shows that single cell RNA sequencing of human tissue sharpens that picture. So by knowing which cells in which tissues express a genetically supported target, we can better predict efficacy.

James Opzoomer: So how do these these approaches fall into historical data collection strategies? So on the left we have large end low value highdimensional observational data. These are things like the human cell atlas um large bio bank cohorts. There’s a lot of it but it’s noisy, confounded and often only weakly connected to clear interventions that we want to make in drug discovery. And on the right we have small and high value but lowdimensional uh interventional data mechanistic experiments in model systems but in small numbers and with low dimensional readouts a few readouts and few perturbations. Now what we actually need for AI driven target discovery is bespoke multimodal perturbation data that links interventions to rich molecular and cellular readouts across diverse biological systems that are related to patient primary patient material. Now that missing data layer is what enables us to train models that actually learn the consequences of perturbing a target in a specific cell type and tissue.

James Opzoomer: And you know overall we believe that current models and data in the public domain are nowhere near sufficient to deliver meaningful impact in target discovery. So we therefore have to build the right data and the right models applied to where they most make sense.

James Opzoomer: So now that I’ve talked about why we care so much about genetics single and single cell data, I wanted to give a quick overview of how Relation is actually set up to do this in practice. And this slide represents a highlevel map of our platform. On the left you see human tissue profiling. This is where we generate deep multimodal data directly from patient samples. whole genome sequencing um single cell spatial transcriptomics single cell transcriptomics and proteomics. Now all of this is connected to the cellular modeling teams who run perturbation experiments on patient derived primary cell systems to generate bespoke data for the models and this connects to translation pharmacology who take the prioritized drug targets and turn them into to drug discovery programs. Now this is all connected to to both data science and our three main machine learning platforms. ROSALIND which identifies genetically validated drug targets, ADA which focuses on reversibility and TURING which provides drug discovery context of our targets. And I’m not going to go into these platforms in detail today because I really want to focus on the spatial genomics data that we generate in human tissue profiling and some of the new analysis methods that we’re developing to better use our spatial transcriptomics data in in drug discovery. So as an example of the type of primary patient data that we collect, I just wanted to show a case study of osteomics. This is our flagship observational clinical study focused on osteoporosis and bone disease. So in this study we partner with orthopedic surgeons across London to collect human bone waste from key surgeries. So these are total joint replacements elective surgeries associated with osteoarthritis and um hemiarthroplasty. So these are non-elective surgeries resulting from osteoporotic fracture really the end stage of osteoporosis.

James Opzoomer: So from each patient we build a genuinely multimodal data set. So that’s whole genome sequencing to identify variants and genes that causally in influence bone density, fracture risk and response to therapy. And this feeds into our genetic discovery platform at ROSALIND. We also generate single nucleus RNAseq of bone and joint tissue to map those genetically supported targets into specific bone stromal and immune cell types and states within the tissue and this sharpens our view of where these targets are expressed within the tissue. We also collect blood-based proteomics to find circulating biomarkers that report on pathway activity can be later used for for patient stratification. And in addition to this also rich clinical metadata including bone BMD or bone mineral density to anchor everything back to quantitative phenotypes. And now this lets our models learn how genetics and cell state translate into real clinical outcomes.

James Opzoomer: So in addition to the the single cell RNAseq we generate we generate spatial transcriptomics data with Xenium and the VisiumHD platforms on human bone and other tissues in associated with the other disease programs we’re working on. And this is really important because single cell data tells us what cell types and states are present within the tissue, but really we lose where they sit in the tissue and how they interact and communicate with other cells within this spatial context.

James Opzoomer: So together these genomics and single cell data sets give us a dense patient centric view of disease biology and in particular we in the Innovation Lab are interested how we can utilize this spatial transcriptomics data to disentangle the causal microenvironmental signals. So the cell communication pathways that drive cell state and cellular response to micro environment. And this has led us to develop a new analysis method called BISTR um or bioysics informed spatial transcriptomics regression that I’d like to share with you today.

James Opzoomer: So spatial technologies are key for preserving the in situ cellular context present in tissues providing a contextual perturbation system of sorts to understand some of the micro environmental signaling factors that may be driving a particular cell state within a tissue or within a particular disease. So we’re often attempting to model our disease states in less complex in vitro systems like some of the ones shown here 2D cell models and 3D organoids or organ-on-chip models. And the kind of the motivating feature of this BISTR package is to answer some of these questions. It’s can we identify cytokines responsible for cell identity and behavior in primary patient tissue and could we then stimulate cell models to mimic some of these these disease relevant or patient relevant micro environmental niches. And we hope that this can add value to the drug discovery process and to kind of our efforts in in vitro cellular modeling by using this knowledge to build experimental systems with greater disease relevance in vitro.

James Opzoomer: So a lot of this work is enabled by the advancements in the resolution of of spatial genomics technologies which is is really rapidly changing. And we recently published a review in Cell Genomics tracking these technology trends called SC trends. And this kind of summarizes the historical development in spatial omics technologies as well as some of the analysis packages available. And we also comment on these these kind of developing spatial technologies in real time since it’s such a such a fast moving field at our blog sctrends.org. So I encourage you to check it out if you’re able to.

James Opzoomer: So the work that I’m going to show you today is really focused on uh 10x Genomics VisiumHD platform and this is one of these spatial sequencing based spatial transcriptomics technologies where the increased resolution in this generation of platform now two micrometers has really enabled subcellular resolution allowing us to track several biophysical processes that are shown out here on the right. So RNA abundance, RNA localization and also RNA splicing at the subscellular level. And we can use these two micron pixels to approximately reassemble single cell data based on image segmentation tools in the imaging modality to create approximately single cell data.

James Opzoomer: So this slide sort of positions BISTR among other spatial modeling approaches. On the left are are sort of simple heuristic based approaches like using a radius around a specific cell or a k-nearest neighborhood and computing sort of some summary statistics. They’re fast. But the spatial scale is often somewhat arbitrary and the tissue is treated more like a discrete bin than a sort of a continuous space that it is. On the right, we’ve got deep learning based approaches. Now, these can be powerful, especially when they leverage analysis pipelines from the image space or are often paired with single cell data, but they’re typically more data hungry and and less sometimes less mechanistically interpretable. So, BISTR sits in the the biophysical model space in between. So we encode this process of of um intracellular signaling via ligand diffusion as a diffusion decay problem with boundary exchange to generate interpretable per cell exposure features without choosing an ad hoc neighborhood. It runs with more modest compute and also sets up a clean entry point for ML once the inverse problem is well posed.

James Opzoomer: So this is sort of a schematic representation of the BISTR computational pipeline. You have your underlying biological system and you generate subcellular spatial transcriptomics data say 10x VisiumHD data. We then use an image segmentation, vision transformer for instance, to identify nuclei and cell boundaries and infer subcellular compartments. You then quantify the transcripts on the nuclear and cellular level and then we construct the extracellular domains the space between the cells as a finite element triangulation mesh and we model paracrine signaling fields per ligand across this mesh using a finite element methods. This allows us to extract the per cell signaling features which we identify with receptor gating. So understanding the concentration of the ligand at a cell boundary and whether the cell expresses the cognate receptor to this ligand and from that we can characterize which ligands predict certain gene expression via a GLM based model.

James Opzoomer: So this is another schematic that that represents the data flow within the the BISTR Python package. You have your VisiumHD data. You identify nuclei with a vision transformer and you perform a morphological expansion of cells to create a like a cell cytoplasm boundary giving you approximately single cell data. You then build the FEM triangulation network. You use public databases to look up ligand receptor, ligand and receptor genes that are expressed within your cell types of interest and you solve the FEM network across all of your ligands within the intracellular space. Now this gives you the FE solution at the cell boundary. And we also look at ligand flux which is the relationship between the expression of a ligand within the cell and the FE solution at the cell boundary effectively identifying whether a cell is a source of a particular intercellular communication ligand or a sink, is it just receiving this signal and then we use a GLM to identify which ligands are most predictive of certain gene expression programs downstream.

James Opzoomer: So now I want to show you a kind of a worked example on a publicly available uh VisiumHD data set. So this is the BISTR package applied to this uh 10X Fenomics colorectal cancer data set. This is a a 10x VisiumHD FFPE data set that was published as part of the preprint that was released along with the VisiumHD product launch in in 2024. So here you can see a a highlevel view of the image of the tissue that has been assayed and zooming in onto a smaller subsection of the tissue. So you can see the individual cells. We use a vision transformer model to perform nuclei segmentation and then morphological nuclei expansion. So we follow this expansion to assemble the two micron spots into approximately single cell data which we annotate with its various cell types giving us a tissue representation of single cell data that looks like this. Here they’re colored by their cell type annotation.

James Opzoomer: So on the left here you can see we construct the extracellular domain and mesh. So we triangulate the extracellular space between the cells whilst using a tissue mask to limit the extracellular triangulation to the space that’s only underneath tissue. And starting with a ligand expression per cell, we formulate an FEM problem with diffusion and and decay parameters plus [] membrane coupling that allows us to solve a sparse linear system per ligand and get the FE solution across the tissue space. And here you can see the cells themselves are colored by the expression of ligand vgf-a. And you can see the FE solution in the intracellular space colored in this sort of white to red heat showing that cells express- expressing high vgf-a secrete, are predicted to secrete vgf-a into the intracellular tissue space. And we model this diffusion with decay throughout the tissue. And this ultimately gives us a FE solution across each of the communicating cells within the tissue which we gate basically binarizing them based on whether they express the receptor to a particular ligand or not. If they do express to the ligand then we calculate the FE solution across the cell membrane of each cell and also the flux. So this is the average membrane exchange signal. So effectively this is the proportion of the ligand expression within the cell and at the boundary of the cell from the extracellular space. Is this cell a source or a sink of this intercellular communication signal?

James Opzoomer: So in order to understand what ligands might be affecting certain cell types, we found that the coefficient of variation and also in a related sense looking at the mean ligand flux versus the standard deviation of the ligand flux is informative to understand the kind of most variable intercellular communication ligands across a tissue and cell type. So in this respect, in this particular example we’re looking at vgf-a here in tumor cells which is, which has a relatively high mean flux across this tissue section.

James Opzoomer: So we use then a negative binomial GLM fit to the per cell gene counts which has predictors such as receptor gated ligand exposure. So the coefficients of this model quantify how exposure shifts expected expression and we can see which exposure to which ligand are related to specific genes and then gene programs. Here we can see that our model captures the directionality of many genes known to be associated with a vgf-a exposure in tumor cells. And this indicates that we’re capturing known biological processes associated with this ligand inter- ligand receptor interaction in this tissue.

James Opzoomer: So in closing remarks I think we often find that sequencing based spatial transcriptomics technologies um have a lower UMI coverage um that’s somewhat sparer than single cell RNAseq. This has motivated us to develop novel tools to understand the relationship between intracellular ligand receptor signaling and downstream gene expression. So this tool that we developed BISTR converts spatial transcriptomic counts and in coordination with segmentation into physically constrained extracellular ligand fields and then into per cell exposure for downstream modeling of the effect of ligand exposure on gene expression. And we believe that modeling um ligand receptor interactions like this with a biophysics constraints gives more interpretability into the intercellular signaling process. And we’ve designed this BISTR method as a flexible toolbox that is deployed as a Python package which we hope to make publicly available sometime soon. The goal of this approach really is to generate more tissue contextual experimentally testable hypotheses especially where simple in vitro systems miss micro environmental signaling contexts so we can better understand the intercellular signaling processes that drive cell states in patient tissue and in particular to better understand disease. So we we will be publishing this approach hopefully as a pre-print soon and so I encourage you to to keep your eyes out for it at that time. So yeah, thank you for listening today.

Grant Belgard: James, thank you very much. Um so does the BISTR package work with spatial transcript domain technologies other than VisiumHD?

James Opzoomer: Yeah. So it’s designed from a like the core methods within designs within a spatial sort of transcriptomics method agnostic approach. I really hope that I kind of highlighted that what you need is subcellular resolution spatial transcriptomics data and from there you can reassemble sort of approximately single cell and whatever compartment you can segment with your sort of image layer into that form of data. So, VisiumHD is great for that, but we’re excited to get our hands on um hopefully the new Illumina spatial transcriptomics technology that’s coming out that appears to be sort of in this one micron resolution. But yeah, it should work across different spatial transcriptomics technologies although we have only tested it with VisiumHD but we hope to expand that outwards soon. Thanks.

Grant Belgard: Now what makes the BISTR package biopysics informed rather than just a spatial regression?

James Opzoomer: So that’s a good question. So the the kind of BISTR approach explicitly models paracrine signaling as a spatial field within the extracellular space using this diffusion with decay FEM approach solved over the effectively the finite element mesh that we build from the native tissue geometry from the spatial transcriptomics you know the sort of the imaging data and the spatial transcriptomics data itself. So this kind of we believe this builds a more representative intracellular communication space than just representing cells as nodes on a graph without understanding you know the distance but also some of the spec- tissue specific features that might exist within it. For instance, you know, a future direction that we hope to go is to to use image segmentation within tissues to create different tissue zones, right, which you can identify from H&E and other types of immunofluorescent staining where ligands might have difficulty passing through and thinking in particular, we work a lot on bone as I touched on, but you know, using that to to create more representative data.

Grant Belgard: Great. Well, thank you, James, and thanks to everyone for joining us. Join us for our next webinar on February 18th at 11:00 a.m. Eastern. Uh, Phil Ewels from Sequera will discuss reproducible bioinformatics at scale, nf-core, and Nextflow. Thanks.

James Opzoomer: Thank you.

Ania Wilczynska - The Bioinformatics CRO Webinar

The Bioinformatics CRO Webinar Series

November 11, 2025: Ania Wilczynska – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks  

Ania Wilczynska

Dr. Ania Wilczynska is Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their ioCell products and employing cutting-edge AI and machine learning solutions to data analysis. She has over a decade of experience in Bioinformatics and Data Science and over two decades in molecular, developmental and cancer biology.  Prior to joining bit.bio in 2020, she held positions at the University of Cambridge, MRC Toxicology Unit and the CRUK Beatson Institute (now CRUK Scotland Institute).

In this live webinar, she explores how modern bioinformatics must evolve from one-off analyses toward robust, interoperable platforms capable of integrating multi-study, multi-modal data at scale. Drawing on best-practices in data architecture, metadata design, workflow automation, and AI-ready infrastructure, she discusses evolving omics pipelines into a discovery engine.

Transcript of The Bioinformatics CRO Webinar Series – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks

Disclaimer: Transcripts may contain errors.

Grant Belgard: At the Bioinformatics CRO we help life science teams turn complex omics data into decision-ready insights providing flexible expert bioinformatics support from study design through analysis. As part of that mission our webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s session features Dr. Ania Wilczynska presenting “Thinking beyond the single data set: pragmatic solutions for scalable, AI-ready bioinformatics frameworks”. Ania is the Senior Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their IO cell products and employing cutting edge AI and machine learning solutions to data analysis. She has over a decade of experience in bioinformatics and data science and over two decades in molecular, developmental, and cancer biology. Prior to joining bit.bio in 2020, she held positions at the University of Cambridge MRC toxicology unit and the CRUK Beatson Institute. In this live webinar, she will explore how modern bioinformatics must evolve from one-off analyses towards robust interoperable platforms capable of integrating multi-study, multimodal data at scale, drawing on best practices in data architecture, metadata design, workflow automation, and AI ready infrastructure. She will discuss evolving omics pipelines into a discovery engine. We’re live streaming this on YouTube and LinkedIn. Please drop your questions in either chat or email them to [] and we’ll bring them into the discussion. Ania, over to you.

Ania Wilczynska: Thanks very much, Grant. It’s great to be here. Are we sharing? Oh, here we go. Okay. Right. So, welcome everybody. today I’ll be talking to you about how teams can move beyond single bioinformatics data sets toward scalable AI ready bioinformatics frameworks. the talk is grounded in our experience building such systems at bit.bio but really should be equally applicable across academia and industry. We’ll talk about principles that have guided our thinking over the years and highlight cultural changes in how we need to think about data and workflows. So everyone in bioinformatics faces scaling challenges. And this is going to be about practical ways to solve them. So just as an overview of our talk we’ll first start with stating the problem of data growth. We’ll discuss some principles of scalable design. We’ll talk about infrastructure so automation and building a platform integration of data focusing a lot on metadata and using SOMA objects as an example of data integration and then we’ll go into discussing AI workflows human in the loop and how we integrate bioinformatics data in the new AI world and really how we move into creating AI native data sets.

Ania Wilczynska: So a lot of bioinformatics still operates on single studies and ad hoc analyses and of course modern AI and ML requires scale, structure, and reproducibility. So the question is really how do we evolve bioinformatics platforms to address these questions. Data generation currently outpaces analysis and every omic imaging and metadata stream grows exponentially. So classical ML and now large language models give us tools that turn data into insight but only if our systems are consistent and reproducible. Thus, we can’t treat these data sets as isolated projects anymore. And we need to think about platforms that integrate data, automate quality control, which is obviously the first and very important step. And enable us to use all of the information throughout our organization, be it again industry or academic. And this will be the thing that will enhance discovery, precision and scalability. And this is the area where reproducibility, standardization and machine intelligence intersect. So we need to treat data systems as long lived infrastructure not one-off workflows. And by building structured and automated systems, first of all, this is very simple but very important to everybody. We reduce costs. we do accelerate discovery and then create data that as a consequence AI can actually learn from. So this is building into the future. and if analyses can’t be repeated, they can’t be automated. So platform thinking means designing for reuse. Every data set, every model, every workflow should be modular and interoperable because reproducibility means scalability.

Ania Wilczynska: So now moving on into what scalability can mean. So the first principle that we use is this simplicity scales. And we build modular pipelines where the complexity is contained within the module, but the interfaces between the modules are really clean and clear, which then as a consequence means that the complexity is localized to a module that can be easily interchanged. new platforms can be plugged in easily. Now what’s really key to highlight and we’ll be going into more detail on this shortly is consistent language and naming and shared hierarchies are really important in this because this is how we have clarity both across data sets and across teams.

Ania Wilczynska: And finally, by designing for API and cloud integration, we can future-proof our systems so that new technologies can be onboarded very quickly. So, the princ- the take-home from here is design for evolution, right?

Ania Wilczynska: So, this is how this can look like in practice. So automation, end-to-end automation, in our case connects experimental metadata compute analysis storage and dashboards in a continuous loop. It allows multiple analyses to run in parallel. This is instantly reproducible when new data sets arrive. And so the outcome is that scientists including bioinformaticians spend more time interpreting results and less time essentially babysitting pipelines. So this infrastructure supports hypothesis generation and iteration. And it creates a complete cycle from data to new insights. The automation of course doesn’t replace scientists, it amplifies them. And so by automating the flow of data ingestion to reporting to API access we iterate faster and keep quality consistent. So the schematic on the right shows you how we automate the full bioinformatics cycle from data to insight. We start with metadata capture in Benchling as well as our in-house built app. And we ensure that every sample and condition is traceable. This is really key. Next pipelines run automatically with the use of AWS Batch and Lambda and these are scalable as new data volumes arrive. The results are stored in AWS S3 then linked through APIs to feed dashboards and AI tools and LLM agents can summarize the results, flag patterns, QC issues and the scientists interact with the data through dashboards. And so the key idea is that the loop data in analysis inside out runs reproducibly at scale. And it frees our time to focus on interpretation rather than execution.

Ania Wilczynska: Right? So, I’ve mentioned metadata quite a lot already because in our view it really is the connective tissue behind all of the data sets and capturing rich technical biological provenance metadata early allows us to integrate across studies, perform batch correction, and reuse analyses efficiently. We employ fair principles, define once, reuse everywhere. This is really essential. And standardized metadata allows for not just traceability, but also creates a central data store that ensures findability and reuse and structured data feeds directly into AI and ML tools. So metadata can of course be vast. One thing that we found that’s been really important for automation of pipelines, tracking of samples has been a unified sample naming system. Again it sounds pretty trivial. It’s- it takes a little bit of engineering and a little bit of cultural change to to deploy, but it’s been extremely important for us. Again, as a relatively trivial example. So again, just to reiterate that metadata turns messy data into machine readable knowledge.

Ania Wilczynska: So, a slightly busy slide. But this is just an overview of what our platform looks like in terms of integrating research and again using metadata as a foundation as well as automation. So our in-house bioinformatics platform connects this metadata with the sample tracking analysis pipelines QC and reporting through a unified database. it provides live links, data sanitation, and API access, enabling researchers to explore, analyze, and develop AI workflows directly. And as a result of this, we have a self-service interactive research environment where data flows seamlessly from experiment to model. And the way we think about this is that we move from a data set to really a living research system. And we use this platform internally to handle everything from single cell to genotyping to plasma design. and we emphasize empowering users and bridging all these systems.

Ania Wilczynska: Okay. So now we’re, now we start moving into making AI — sorry — making data AI ready. So of course data alone isn’t enough. it needs to be transformed into structured knowledge and we do that by explicitly extracting relationships from our experimental data metadata as well as publications. A lot of our work relies on relies on external open source data and we codify these relationships into a knowledge graph and now we can start to infer new connections using AI. So this structured understanding supports predictive biology which is what we as a company do. And we hope that this will give us the ability to anticipate outcomes of experiments rather than just to measure them.

Ania Wilczynska: So AI now helps us to ask better questions about the data that we are actually generating, generating in house and the workflows that we’re developing using our data as well as again like I said external data from publications from open source data sets is — the loop consists of first defining a question then again aggregating lots of data moving through AI agent synthesis through human review and I cannot stress enough that at how important it is at this stage to have the human in the loop. We- we’ll talk about that again in a second. And finally updating the knowledge graph and iterating again. The human element is very important in terms of both ensuring quality scientific rigor and of course eliminating faulty or hallucinated information. So again to emphasize we’re not at the stage yet of replacing the scientist were augmenting their ability to work. And of course this reduces lag between experiments and insights. So every iteration enriches the knowledge base and therefore improves the outcomes for the next round.

Ania Wilczynska: So, the more we automate retrieval and summarization, the more time our scientists have to focus on creative reasoning and high complexity tasks. So, this is really what this simple schematic at the right is showing you. We are less a lot less focused on the medium and low complexity tasks thanks to both our automated modular pipelines as well as the plugging in agentic AI to be essentially better scientists. Moving a little bit away from the engineering into the more creative science. And this of course implies huge efficiency gains.

Ania Wilczynska: So once again emphasizing the human in the loop. but on the engineering side of things we implement all these principles through a retrieval augmented generation or RAG stack. It allows the AI models to query internal data safely without retraining or exposing to sensitive — exposing sensitive information. Again the architecture is modular. This is obviously a theme. And each of the agents, be it a specific bioinformatics agent or imaging agent or a developer agent, specializes in a particular domain. And this is all coordinated by the human in the loop scientist. So of course this dramatically accelerates tasks that used to take hours that can now be done in seconds. The structure is secure, modular, with replaceable components. So again this is how we think about scalable AI in kind of pragmatic production. All right. So circling back in a way to something a little more formally bioinformatics focused. I’m going to talk to you a little bit about how we scale to multi-data set and multimodal analysis. And this will be mainly focused around single cell data sets. Because this is really an area where the concept of scale is quite obvious and quite a pain point for a lot of researchers. So single cell data sets scale to millions of cells. And integration of these data sets becomes a core challenge for many reasons. But a lot of it is because it is a data engineering problem not just a problem of statistics. So internally at Bit.Bio, we routinely handle data sets from millions of cells across multiple studies and modalities. And to integrate these data sets successfully, we have to normalize technical variation while preserving biology of course and build models that scale efficiently. So data sets need to be aligned across batches, labs, modalities, be that RNA or ATAC-seq, spatial data, protein data, imaging, what have you. And new algorithms do scale to millions of cells and to multiple modalities. There are of course integration tools such as Seurat, Harmony. This is all available open source and scales to unprecedented levels. But the volume of the data is still a huge challenge even when it comes to just loading the data objects for analysis. So the way we’re currently addressing this question is with the use of SOMA. So that stands for stack of matrices annotated. It’s a new open standard from the Chan Zuckerberg Initiative designed exactly for this challenge. So, it provides an array based data format that supports multimodal data sets, again RNA, ATAC-seq, and so on, at a massive scale. It’s fully interoperable across R, Python, C++. And it enables out-of-core access to data aggregations much larger than single host main memory, ensuring distributed computation over data sets. So SOMA provides a building block for higher level API that may embody domain specific conventions or schema around annotated 2D matrices like cell atlases. So for us adopting SOMA has meant that we can first of all store huge amounts of data in one place, slice it quickly and any way we like and share and finally share the data reproducibly preparing it for AI training or in or more simply retrieval. So what this looks like in practice as an, as one example is we’ve integrated a number of perturbation screens where both the technology in terms of sequencing as well as conditions were very different and about 2 million single cells in total have been integrated from our own data. We’ve also been able to put this together with pseudo bulks for each of the screens as well as pseudo bulks from the open- source 44 million single CELLxGENE census data set. This is all integrated in a unified SOMA data layer. What this means is that for all of these data we have a consistent schema for metadata. One of the important things to highlight here is we have, we’re using a unified gene annotation which does require some data wrangling as especially external data sets can use very different annotations. Now that everything is put together, we can easily query slices of the data in seconds across different data sets instead of waiting for minutes or sometimes even hours for our data to load. And this is really the first step towards truly AI native data sets because it- the data is structured is standardized and is really ready for automated reasoning. And with that a kind of whistle stop tour of our thinking. I’ll end. So just three principles to summarize. Treat data as infrastructure not a byproduct. Make metadata-first design non-negotiable. And realize that AI readiness is an outcome and emerges naturally from reproducibility and structure. So the goal is not to automate everything but to build systems that let us scale and scale the scientific discovery. So we need to work towards these scalable interoperable systems instead of just thinking about individual scripts and creating silos.

Ania Wilczynska: And thank you very much.

Grant Belgard: Thank you Ania. As a remember — as a reminder, live viewers can submit questions to the live chat on YouTube or LinkedIn. To kick us off, how do you ensure reproducibility across so many heterogeneous pipelines?

Ania Wilczynska: Yeah so we have I think over 30 different sequencing technology pipelines as of my last counting. And so of course deliberate design is very important. In terms of particular pipelines I cannot emphasize the need for containerization enough. having versioned pipelines, fixed parameters and in it — just to sound like a broken record, very well structured and deliberate metadata capture.

Ania Wilczynska: Having a centralized way of sample submission is also, has also been really important for be, for us being able to very quickly version our pipelines as well. We have relatively rigorous testing approaches as well. But yeah, so containerization versioning and metadata first and foremost.

Grant Belgard: How do you align or normalize data sets across modalities and platforms?

Ania Wilczynska: Yeah. So batch correction is obviously a a big nightmare. And we’ve already spoke about spoken about tools like Harmony, Seurat. There are plenty of publications that talk about various pitfalls of the just single cell integration tools. But again metadata is extremely important. And for example, in our hands, thinking about integrating imaging with transcriptomics is a nontrivial problem especially in the absence of spatial data. So we do single cell but we don’t do spatial transcriptomics. And we’ve spent a lot of time thinking about how we can integrate some ML approaches to to image analysis with our transcriptomics data. And once again, unsurprisingly, metadata, excellent sample tracking, and integration of the, of systems, including, ELN’s, ELN systems like Benchling has been really key to this and sort of deliberate. So there’s also an element of deliberate experimental design again thinking beyond a single experiment. Because you know and I appreciate that academic labs will be less naturally used to thinking about consistent experimental designs because that’s not kind of the core way of thinking in academic labs. However, I still think that asking the question “does it scale” is extremely important also in academia. I mean in my academic career I found that the lack of the question of “does it scale” has meant that we often missed a lot of opportunities to integrate data sets because everything about them was just incompatible.

Grant Belgard: Yep. What’s the advantage of SOMA over existing HDF5 or anndata approaches?

Ania Wilczynska: Yeah, so it’s really the out-of-core scalability, the fact that there is multimodal support and the interoperability. So it’s, it is really from everything that that we’ve seen so far in actually implementing these huge SOMA objects. It’s the next step towards big data rather than single experiments. We’ve — you know I — this may sound like a plug, but we’re, yeah, we’re really excited about how this is enabling us to just to iterate through computational experiments if you like very quickly.

Grant Belgard: How is AI readiness different from just automation?

Ania Wilczynska: Well, so AI readiness is really it means a different way of thinking about structure and about semantics. You know automation, with automation reproducibility is sort of your main output and your main gain. Whereas AI readiness means that the data is discoverable and learnable. And so it does require cross experiment but also cross function thinking. I think that’s another you know thing that people often disregard is how important it is to think about data and about computational biology not just within the computational biology function. But also make sure that the wet lab scientists or indeed in industry other functions understand how everything fits together in a data stream.

Grant Belgard: So related to that what cultural or organizational changes are required for this to be successful?

Ania Wilczynska: Primarily cross functional collaboration. And I think a little bit of mix of evangelizing and education can really go a long way. So we work both on the, on embedding AI workflows into bioinformatics but we also work with other teams for example you know the commercial team in our company to create AI workflows and that of course it helps the business understandably but it also means that there is a lot more understanding across the company as to why such workflows are important, why data is important and why data structures are important. And again this is, this does require a little bit of outreach, a little bit of evangelizing. Structured metadata and deliberate experimental design can at first seem like a little bit of an overhead in the lab because oh it’s another thing I need to capture.

Grant Belgard: We’ve never seen that before, have we?

Ania Wilczynska: Indeed. But showing people the value of that rather than going, “Oh, ping, my whizzy whizzy machine gave you a new result,” but rather going, “Well, because of that overhead, we’ve now been able to bring three data sets together. One that we did three years ago and one that we did now, and now we have a better outcome.” And again, this is, you know, this is all kind of cultural shift, but I found that, you know, that a little goes a long way in that respect. So, so yeah, a bit of showing by example, a bit of evangelizing and also treating colleagues as partners.

Grant Belgard: What’s the next step beyond AI native data sets?

Ania Wilczynska: Well, of course, in the utopian brave new world, it is AI scientists. I don’t think we’re quite there yet. Or at least so we’re all telling ourselves or we’ll all be out of jobs. But really it’s the closed feedback loops, data, models, experiments, self-improving hypotheses. And I think that’s really where things are heading very quickly.

Ania Wilczynska: But there is also well I guess everyone’s talking about it now. There’s also a lot of hype about what AI can do. But I think a lot of what we’re all in a way promising ourselves is is really bottlenecked by data.

Grant Belgard: And so I’m hearing it’s really essential to train people to think in terms of cycle time, right?

Ania Wilczynska: Yeah.

Grant Belgard: Uh, so we have an emailed question. Uh, how can these platforms be translated into the clinic or are there regulatory requirements that need different setups of the platforms?

Ania Wilczynska: Yeah. So, everything I’ve talked about, just to be clear, is in a R&D preclinical setup. I think the important thing to remember about regulatory requirements is that everything is extremely slow for good reason. And there is very little at least as far as I’m aware existing regulation around AI tools. I think that will probably take quite some time. Which again brings me back to this to this idea that you know the AI tools are — I mean we’re all blown away by stuff every day but because these, the regulatory principles don’t really yet exist the experimentation is still necessary. And so again I — thinking about the fact that it’s not just the tool the data has to come before it but also will for some time come after it is very important. Of course on the other hand in terms of you know just building automated modular pipelines you know there are the a lot of the cloud platforms provide certain standards so it you know we’re sort of working towards it but I think we shouldn’t expect the really novel solutions to be adopted all that quickly.

Grant Belgard: Well, Ania, I think we’re at time, but thank you so much for joining us. Um, and the series will resume January 21st at 11:00 a.m. Eastern with Jake Taylor-King from Relation Therapeutics, followed by Phil Ewels from Seqera on February 18th at 11:00 am Eastern. Uh, mark your calendars and thank you everyone for joining us today.

The Bioinformatics CRO Podcast

Episode 68 with Caspar Barnes

Caspar Barnes, founder and CEO of AminoChain, tell us about his mission to make biospecimen sourcing transparent, ethical, and efficient.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Caspar Barnes

Caspar Barnes is founder and CEO of AminoChain, a decentralized biobanking protocol with a mission to make biospecimen sourcing more transparent, ethical, and efficient.

Transcript of Episode 68: Caspar Barnes

Disclaimer: Transcripts may contain errors.

Coming Soon…

Nick Wisniewski

The Bioinformatics CRO Webinar Series

October 22, 2025: Nick Wisniewski – AI-First Drug Discovery Pipelines

Nick Wisniewski

Dr. Nicholas Wisniewski is an expert on AI in drug development and regenerative medicine.

In this live webinar, he discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, thus shortening timelines from concept to clinic.

Transcript of The Bioinformatics CRO Webinar Series: AI-First Drug Discovery Pipelines

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the inaugural seminar in The Bioinformatics CRO webinar series. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision-ready insights providing flexible expert bioinformatics support from study design through analysis and reporting.

With that mission in mind, we’re launching The Bioinformatics CRO Webinar Series, a practical forum for sharing tools, workflows, and real world lessons from the front lines of modern bioinformatics. Let’s kick off our first session and welcome Nick Wisniewski.

Nick is an expert in applying artificial intelligence to the life sciences. He earned his PhD in biophysics from UCLA where he later served as a faculty member developing machine learning methods for imaging and multiomics data. In 2016, he joined the founding team at Verge Genomics, pioneers in AI-driven drug discovery, and has since helped launch more than four more biotech startups spanning diagnostics, a smart pill, and a cell therapy. More recently, he served as vice president of bioinformatics and data science at Stemson Therapeutics in San Diego. In this live webinar, Nick discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, the shortening timelines from concept to clinic. Feel free to put your questions in the chat as we’ll have a Q&A afterwards.

And Nick, over to you.

Nick Wisniewski: Thanks a lot, Grant. Really excited to be on this inaugural webcast and looking forward to the rest of the series as always a big fan of The Bioinformatics CRO so I am very happy to support this development and look forward to more in the future. The talk that I’m going to deliver today as Grant mentioned is going to be on AI first drug discovery pipelines I think there’s a lot of movement happening in the space clearly a lot of excitement and discussion amongst investors and tech bio people and a lot of new algorithms and methods coming out every day that it’s hard to keep up with.

So the purpose of this talk is to kind of give an overview of all of that development that’s happening as well as an insight into how these machines are learning and where this is all going into the future.

So to start with I think it’s good to point out that the current state of drug discovery is one in which less than one in 10 drugs succeed. And the cost of developing a drug can be up to $2 billion given given the failures that occur and the portfolio approach that that happens and that money gets spent over a period of up to a decade.

So the impact that AI can have on the drug discovery process can happen in multiple ways. One is the increase in the accuracy or the success rate of the drugs and the second is in the cycle time. If you can test more drugs faster, you can kind of overcome some of this challenge to understand where we’re at and what the impact of AI is. I think it’s important to just start with a review of the traditional drug discovery pipeline as we know it.

It’s largely discussed as a waterfall type process where you have a left to right kind of movement through a number of different phases starting with target identification and target validation, compound screening to identify hits and then hit to lead, getting to the lead and optimizing the lead and then all sorts of preclinical testing to understand toxicity and other stuff of that nature.

And then it goes into the clinical development stage where you have the phase one, two, and three trials. And you know the the loss rate as you go through what what is the chances of success starting out very early when you’re still validating a target can be only 3% of molecules are are going to be successful in the clinic. So that’s you know quite a low rate.

If AI can improve that to 5% it would make a huge difference. We don’t need to necessarily get to 100% although that would be even better but I think the other important part of this pipeline is that embedded into it are a number of different design make test and analyze cycles and so we often think of these in terms of synthesizing molecules but they form the standard feedback loop in in the molecular optimization process and so it mainly happens between hit discovery and lead optimization with each iteration lasting maybe weeks to months and to get to an optimized outcome you might need three to 10 different iterations. So that can really be a bottleneck that AI can address in the drug discovery process.

There are a number of traditional computational tools that are being used and have been being used for quite some time.

So in the initial stage of target identification you know we ask the question which protein which gene what what is the target and a lot of the early tools you’ll recognize as things like ingenuity pathway analysis and WGCNA a lot of these matured you know mainly in the 2000s mid 2000s and have been used with with a fair degree of success since then once you get more into the drug development stage where you already have a target and now you’re trying to design molecules to hit that target.

This is where the rest of the tools come in.

So things like virtual screening and docking have also been concepts that have been around for quite some time. And so this is which molecules are going to bind to that target. So things like AutoDock and and the Schrodinger Glide emerged starting in the the 1980s but you know growing more popular towards the late 90s. And then another question is which molecules are bioactive? So maybe you can bind but you can’t get any sort of activity out of it.

This is where quantitative structure active activity relationship models which go back to the 1960s. They’re largely just regression models and they’ve been updated over time to integrate machine learning methods to to kind of improve some of that but they’re you know a mainstay of the process and and maybe alongside of those there’s the pharmacophore modeling and shape matching and this is kind of trying to understand what’s the geometry of the molecule that’s required for bioactivity 3D shape matching and distance metrics between molecules are all quite useful and they allow us to to filter candidates you know going back to the the 90s again more recently there’s been a lot of movement in molecular dynamics and free energy calculations and this is you know more physics based trying to understand how energetically or thermodynamically favorable the ligand- target complex is and so these are some some simulation techniques that that matured you know maybe 15 years ago or so understanding stability of these bindings and then quite importantly once once you can simulate a lot of these things and think you’ve identified a molecule of course what is extremely important is the properties of that molecule once it enters into a body will the molecule be absorbed and safe will cross the blood brain barrier. These are things that are known as ADMET properties and largely we want to think about the toxicity absorption distribution metabolism excretion and so forth.

Now I’d say the first stage of the development of AI has been to start just modularly replacing out each of those different phases with let’s say deep learning components. And so maybe the one that that we haven’t talked about so far is imaging which is very useful in in the target ID step where you can do more phenotypic level understanding and protein localization within cells and I think those models have been very powerful and very influential in in terms of of the rest in terms of target ID you know we we’ve got the the gene former class of inference algorithms out now in terms of protein structure.

Alpha fold has broken that field wide open. And then you know the rest of them often come with straightforward replacements. So DiffDock is now a big replacement in molecular docking. Maybe new ones are having to do with the denovo molecule generation like MegaMolBART which is now integrated I think in in Nvidia’s BioNeMo. We have some deep learning tox predictions and retrosynthesis planning which is good to help you find the easiest path to synthesize a molecule. But maybe some of the more exciting ways to think about things have to do with experimental planning. And I’m going to talk a little bit more about that in a few slides.

But things that aren’t represented here I would say are the most latest developments. One came out just yesterday being Claude for life sciences. And this is I think very exciting. You know if you’re a bioinformatician or a programmer, you’ve likely been using Claude now for a bit of time for programming and tasks of that nature.

So extending that now into integrations with common lab tools like Benchling and partnering with with institutes like the Broad Institute and 10X Genomics to help facilitate access to data in those platforms and algorithms in those pipelines as well as PubMed to really facilitate searching the literature and getting back good intelligence on targets you find and on drugs that design. That’s going to be highly influential. It promises right now to be able to analyze single cell RNA sequencing data, which is going to be great for democratizing access to that data source. And really interestingly, it’s promising to help prepare regulatory documents, which may be one of the the biggest bottlenecks in the real world into putting together a pipeline and and accelerating it. This stuff takes a lot of time.

Similar developments are coming out of partnerships with Nvidia now more and more every day. Again with Benchling maybe leading the way. Benchling launched their Benchling AI recently and as part of that Nvidia is integrating its NIM microservices into Benchling. So this offers access to all the optimized GPU implementations of things like openfold 2 for protein structure prediction, and I think the other models like the ADMET models and everything are coming shortly. So that’s also very exciting.

But to return to some of the other bottlenecks that are being addressed by AI, let’s go back to the DMTA cycles. So going from design to make is the the first half of the cycle where you know you present a chemist with a bunch of designs and tasked with making those molecules and it may not be immediately obvious how to make them. You may have some information on how it’s done, but along the way, what you learn is the feasibility of synthesis and any constraints that might exist for future design choices.

So in learning that you can already make the first step into thinking well what would I do with a machine learning model like that retrosynthesis model? Well, you can update kind of based on your learnings from that step and retrain your generative design models with those new constraints. And then likewise when you go to test these molecules in a series of biological assays and the ADMET profiling and other stuff like that you learn a lot about the potency, selectivity, toxicity, off-target effects, everything you can measure about these drugs that you may have had predictions for from the QSAR, docking, pharmacophore, ADMET models but now you have new data and you can go back and update all of those models in real time to improve the predictions on the next iteration of the cycle.

So you start to see how a more continuous learning framework can arise from the existing cycles that exist in the drug discovery pipeline. And this starts to hint at the next transition that’s coming in the field where we currently think of AI as more of a tool which we’ll call augmented AI where you have module assistance for each of these different steps of the pipeline and it’s still there being controlled by humans and informing humans empowering humans.

The next step that things are moving towards is this AI first regime where you have some sort of orchestrated autonomous learning cycle. And here the AI acts as the central control architecture orchestrating not just the DMTA loop but you can extend that loop into into different feedback loops and you can start then thinking about these closed loop continuous learning cycles. Combined with automation wet lab automation bioinformatics automation and everything you need to be self-contained. This is I think the crux of the idea that we hear a lot in terms of lab-in-the-loop.

This is a concept being popularized across a number of different institutions — at Genentech, Aviv Regev has put together a team that is exploring a lot of lab-in-the-loop operations. Nvidia is highly supporting lab-in-the-loop architectures and this is kind of the main goal of getting to a continuous learning closed loop architecture where the AI proposes novel molecules. It synthesizes and tests them automatically, gets the assay results and feeds them back to update the model for subsequent iterations.

I think as we’re moving into this regime, it’s important to understand some of the key machine learning paradigms. And so I’m going to talk a bit more about those in the in the next slides but I’ll introduce them here in terms of you may have heard of things like active learning, Bayesian optimization, reinforcement learning and so forth. And then the third component which I mentioned earlier is the automation component.

So right now there’s a range you don’t necessarily need automation in order to build these loops. You can have human in the loop doing the experiments.

But of course the hope is that by having automated experimentation you maybe reduce some of the variability in the experiments increase some of the reproducibility as well as the speed at which you can experiment. You can run all day and night highly parallelize things and so it’s going to scale a lot better.

So thinking about how we’re making this transition I think organizational principles are one of the big bottlenecks. There’s a big issue that we all face with adopting new technology in terms of understanding how it works and deciding to what extent we can trust the decisions that it’s making.

As we work as programmers with these AI tools like Claude, Cursor, Codex, we see and get immediate feedback on how well it solves problems. How many iterations and recorrections we need in order to keep it on track and do what it needs to do. And we can gain some sense of how much we can trust the decision-making that’s happening.

It’s a little bit harder in drug discovery because I think primarily the cycles are so long. So benchmarking these tools is very difficult if it takes five years in order to get something up running molecule created and then get it through trials before you know that it works. Of course it’s a very long feedback cycle and it can take quite a while to develop that kind of trust.

Moreover, we’re handing more and more decision-making over to the AI where traditionally humans maybe directors director level people are making these decisions and that introduces some accountability questions and other organizational problems. So I think one of the most important things that we can do in order to help facilitate trust in the decision-making is understand at a basic level how these decisions are being made. If we’re going to let AI determine what experiments to do next and where to allocate those resources, it probably helps to understand a little bit how it’s making those decisions.

So again I’ll introduce briefly the concepts of active learning, Bayesian optimization and reinforcement learning as kind of the three main techniques right now that you see in these sorts of systems where active learning is one in which the AI understands a bit about what it doesn’t know or what it’s most uncertain about and then it targets experiments in order to learn the most it can in the next set of experiments. So this is a fairly straightforward concept. I think scientists think in much the same way.

Bayesian optimization is maybe more product focused. You’re trying to optimize some property of a molecule and you kind of have to navigate a search space, do some hill climbing on a landscape that you’re inferring while you’re climbing it. And this is a method that’s used in order to find kind of the most potent drug out of a large set of molecules without having to test them all.

And then reinforcement learning, something that we read a lot about these days in terms of particularly the LLMs, is a method of learning that’s really finding trajectories through that space. It’s trying to optimize a sequence of decisions or what’s called a policy in order to optimize long-term gains. And you know that’s very computationally expensive, maybe not as efficient. But it has some strengths over the previous two particularly Bayesian optimization in terms of parallelization capabilities and ability to explore the space in a more efficient way but also has some some drawbacks in terms of inability to learn in sparse spaces and so forth.

So I’ll just show kind of a graphic example of active learning. I’ve got the other two but for the sake of time we’ll skip over it. You know, imagine you’re trying to do a classification task where you’ve got, you know, red team on the right of class one and blue team on the left of class zero, whether these are, you know, whatever you want to call them, toxic, non-toxic, and then you’ve got a bunch of unmeasured molecules.

So each dot is a molecule here. The ones in white are ones that we don’t have any data on yet. The ones in orange are also ones we don’t have data on yet. But in fitting the boundary between red and blue, we find there’s a bunch of unfitted molecules along that boundary. And we color this orange to point out that these are maybe the most uncertain in the whole model.

From the model’s point of view, learning what these are would likely have the most impact on what that decision boundary is. And so you go forth and you test those with the idea that you’re really trying to choose the next experiment in such a way that it can have the maximal impact on your prior beliefs about what that boundary should be. It maximizes your information gain.

And so I think that may help a little bit better in understanding how these lab-in-the-loop systems are working. As a result, I think that waterfall topology of the standard drug development pipeline, we’re going to start seeing change a little bit. It’s going to become possible to flatten and collapse different stages into each other where they all share maybe a group of objective functions that you can optimize simultaneously which can dramatically shorten certain stages of the pipeline.

At the same time, we can also merge and parallelize certain loops. So you can you can do all sorts of different DTMA loops at the same time, integrating all of that feedback instead into what’s called a continuous feedback mesh where you have a bunch of models all kind of conditionally dependent on each other all being updated whenever new data comes in being concurrently influencing each of their predictions for for the next cycle. And you know the one of the most important changes in this process is the shift of the human role into that of a supervisor.

So as humans shift from the decision makers and the gatekeepers to the supervisors you know they’re going to start overseeing these autonomous loops monitoring them to make sure things are on track and only intervening strategically while the AI is handling the rest of the routine iteration and optimization.

We’re starting to see real world examples and commercial platforms. So, if you want to build a startup and design one of these systems yourself, of course, you’re free to do so. Many of those tools are open-source but putting them together can require a lot of effort, a lot of engineering. There are commercial platforms that can be licensed and there are places that are are using these that we can use to benchmark success rates. So I think Insilico Medicine is maybe at the forefront of this. They have a a fully automated robot robotic lab. 31 active programs and claiming to achieve concept to phase one in under two and a half years. And there’s a platform Pharma.ai you can license from them or form partnerships. Similarly with Iktos this is another big player in the field. They have a a similar licensable software-as-a-service program. Recursion and Exscientia are our big players in the field that everybody’s watching to understand how the progress and whether we can speed stuff up is actually progressing.

Isomorphic Labs of course is deep in this and all of the commercial platforms and tools from Schrodinger to the Nvidia and AWS tools the HuggingFace models and I’ll even note new lab automation as a service like Strateos offers these cloud labs where you can control the automation.

So to conclude with a future outlook, I think we’re moving towards this period where autonomous AI scientists are going to start leading a lot of the process. They’re going to be able to design, synthesize, and test molecules in these closed loop cycles. And this is going to improve data quality and integration with every step. The data sets are going to become more unbiased and more accessible. And I think again the key component here is human trust and collaboration which is definitely going to take some time to develop. And I think that may be the most interesting part of of the of the path forward that we’re going to experience in the coming years.

So with that I’ll kind of conclude the talk. Again bringing it back to to Grant and The Bioinformatics CRO. Again, happy to kick off this this inaugural webinar. There’s going to be three more following me over the coming months and I’ll turn it over back to Grant to introduce those speakers and tell us what’s coming.

Grant Belgard: Nick, thank you so much. Yeah, so our next webinar will be broadcast at the same time, 11 am Eastern on Tuesday, November 11th. We’ll be joined by Ania Wilczynska, senior director of Bioinformatics and AI at Bit.bio. So hope to see all of you there. But Nick, questions. So, everyone watching, you can put questions, in the chat and we can kick off with, what do you think of the new foundation models for single cell analysis? Are they having an impact on drug discovery?

Nick Wisniewski: Yeah, they’re very interesting. I was very excited when I saw Geneformer and SCGPT come out and I think there’s been a lot of adoption of these at new startups. This is a big part of the new phase of AI target discovery.

So I think the things that they bring to the table that are fantastic are moving things into the transfer learning paradigm where you know you can bring in a whole bunch of knowledge and do zero-shot predictions on your data without having to to train or learn from external data sets that’s already been done for you.

It also gives you a good way of representation learning and so it gives you a bit of a new representation by which to learn stuff. I think the benchmarking of these things hasn’t shown much more than maybe moderate increases in performance in cases like predicting drug perturbations. I think the benchmarks are still showing no clear improvement over linear methods which is a bit surprising, and I think it’s it’s important to look at that and wonder whether or not that’s telling us something about the data about the algorithms and about biology. I think there’s something to learn there.

My guess is we’re probably coarse graining somewhere, whether that is in the molecule set that we’re using. I think there’s been some recent studies showing that maybe you need to know what the phosphorylation state of every intermediate molecule is what that chemistry mess actually happening in the cell looks like. And that by just measuring broad activity, you may be fine graining or coarse graining too much. And the other is maybe we’re course graining in time and that there’s dynamics that need to be learned that aren’t being captured by our snapshots. Yeah. They’re always very often focused on steady state. Yeah.

Grant Belgard: What do you think the impact of Claude for life science will be in drug discovery? Speaking of recent developments.

Nick Wisniewski: Yeah. This is you know I spent a lot of time looking at it yesterday after I saw the launch. I don’t know if you’ve had had much time to explore it.

Grant Belgard: Few minutes.

Nick Wisniewski: Yeah. I mean it the integrations it’s made I think are fantastic.

Like you know we tend to think particularly in bioinformatics in terms of some of the scientific questions you know these foundation models and stuff like that but when you actually work in in the pipeline and in the lab you notice the overhead in terms of connecting different systems particularly ELN’s like Benchling, the inconsistent metadata that you might find across experiments and the the access to data is a real bottleneck for bioinformaticians in order to to get the data, synthesize it, harmonize it and move forward.

So I think it has the capability to really have huge impact in the way that bioinformaticians work as well as biologists because it gives them access to a lot of this data. I think there, you know, probably other questions having to do with reproducibility that come out of these tools. Every time in the past where we’ve seen access to tools whether it was you know buttonclick testing of p values that you could throw models at everything you saw you know an increase in p hacking and loss of reproducibility and stuff like that so it’s going to be very interesting to see the impact on actual science that Claude has.

My impression of that largely comes maybe from experience and that you know working with Claude when you’re programming you get a lot of “you are absolutely right!”. Let me and I can say you know most of the time I’m not absolutely right. And in my 20 year career working in bioinformatics and biology, I don’t think I’ve really ever said those words aloud in the practice of doing biology. So, you know, a lot of biology comes from pushing back and creating a lot of counter scenarios and debunking ideas rather than the narrative driven science. And so we’re going to see, I think, how Claude navigates that space and whether it’s a positive contribution in that sense.

Grant Belgard: Yeah, that’s a really good point. You can certainly imagine someone running the same query a few times until they get their their favorite gene showing up in a list and running with that, right? So we have a question from the chat. What do you think the timelines are on the transition to AI scientists?

Nick Wisniewski: This is a great question. So you know, of course, predictions are always hard, especially when they’re about the future. And these timelines are, of course, maybe the most contentious part of the AI field because there’s so much hype around them and the fundraising that goes into things. There’s a number of different influences on the timelines that I think go beyond just the development of the tools.
The adoption of these tools is going to be slower than they can be developed particularly at large institutions which you know use most of the resources in the field. You know for good reason big pharma is going to be slower to adopt these systems than the startups.

So I think we’re going to see more development happening in the startup space than in the big players probably with a continued pattern of then acquisitions whenever somebody’s successful. Given the current funding environment for startups you know factoring that in there may be a delay and that delay particularly in The States may be overshadowed elsewhere. So, you know, I read a lot these days about how far ahead in automation places like China are in terms of biotech research. And so, I wouldn’t be surprised if we start seeing the first successful closed loop continuous learning labs emerging from somewhere like there rather than San Francisco.

But in terms of then guessing an overall timeline given those factors and still the need for some development in the automation robotics and the manufacturing of those so that we can get them into labs cheaply here. I think we’re still looking at like 5 to 10 years before we get to these systems even though the capabilities to do this may come a lot sooner.

Grant Belgard: And what’s the one misconception about AI first pipelines you’d like to correct before we wrap?

Nick Wisniewski: Yeah, that’s a great question. I think again the idea that they may be a magic bullet. I think there’s a lot of hope that it’s going to improve reproducibility, reduce variation, and accelerate the speed of research.

But also given the fact that we’re seeing only modest improvements in terms of performance over linear models and stuff like that, it still depends on having the right set of molecules, knowing if you need dynamic real-time data as opposed to snapshot data like we’ve been using. And so it may not be I think if we institute it right now given the the same tools that we’ve been using it it may fall flat in terms of delivering on its promises and I think we need to to also incorporate that ability to question whether or not the data that’s being posed to it is well posed. And I think from a scientist point of view, this is often the ground floor when you approach a problem is asking, is the problem well posed?

And until we build in that base level intuition into these things, it’s easy to start optimizing or overoptimizing something that shouldn’t be optimized in the first place. I think that’s a really good cautionary note.

Grant Belgard: Well, Nick, thank you so much for joining us and all our viewers, thank you for joining. We’ll see you November 11th. Bye-bye.

The Bioinformatics CRO Podcast

Episode 67 with Manos Metzakopian

Manos Metzakopian, co-founder and CEO of CellCodex, joins us to discuss CellCodex’s mission to provide high-quality, scalable cellular perturbation data, ready to train advanced AI models for biology.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Manos Metzakopian

CellCodex is a CRO that generates AI-ready perturbation data at scale. Our founder and podcast host, Grant Belgard, is also a co-founder and the CTO of CellCodex.

Transcript of Episode 67: Manos Metzakopian

Disclaimer: Transcripts may contain errors.

Coming Soon…

The Bioinformatics CRO Podcast

Episode 66 with Eva-Maria Hempe

Dr. Eva-Maria Hempe, who leads NVIDIA’s healthcare and life sciences business across Europe, the Middle East, and Africa, joins us to discuss her work at NVIDIA, the gaps that AI can fill in healthcare research, and the future of drug discovery.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Eva-Maria Hempe

Eva-Maria Hempe leads NVIDIA’s healthcare and life sciences business across Europe, the Middle East, and North Africa. 

Transcript of Episode 66: Eva-Maria Hempe

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to The Bioinformatics CRO podcast. I’m your host, Grant Belgard. Today, we’re joined by Dr. Eva-Maria Hempe, who leads NVIDIA’s healthcare and life sciences business across Europe, the Middle East, and Africa. Eva-Maria, trained as a physicist, earned a Bill and Melinda Gates funded PhD in healthcare service design at Cambridge, and has since moved through roles at the NHS, Bain & Company, VMware, and the World Economic Forum before joining NVIDIA. She now guides strategy for applying accelerated computing and generative AI, think BioNeMo, Parabricks, and DGX Cloud, to genomics, drug discovery, medical imaging, and more. Eva-Maria, welcome to the show.

Eva-Maria Hempe: Hey, great to be here.

Grant Belgard: So what do you do day-to-day at NVIDIA?

Eva-Maria Hempe: I think in general, my day-to-day oscillates between two major poles, like working in the business and working on the business, or playing the short game and the long game. So on the one hand side, I am responsible for the business. And so that means we have to deliver revenue because if you don’t deliver revenue, you’re not a business, you’re a hobby. And when, on the one hand side, I have to hit a revenue number because if you don’t have a revenue, then you’re not a business. But on the other hand, NVIDIA is all about the long game. Like we are creating markets. We are building things that haven’t been built before. And so it’s really about striking this balance. And what it means, very practical, is on the one hand side, as I said, working in the business. So I have customer meetings.

Eva-Maria Hempe: I work with my team. We’re discussing strategies and tactics, like what should be our sales place? How are we going to work with startups? How are we going to work with this customer? I check out KPI if I see like, are we on track to delivering the revenues that is expected of us? I do a lot of talks and evangelizing to spread the message that NVIDIA is so much more than just GPUs that we have all this great software out there as well, which is super helpful and super valuable to our ecosystem that people can save a lot of time by building on top of what we put out there. So that’s the operational part. And then there is the working on the business. So really the more strategizing, making decisions on, should we focus on enterprises or startups? Where within healthcare should we focus?

Eva-Maria Hempe: To whom do we talk about which kind of topics? To which degree are we focusing on the sale? But where do we see new areas emerging which maybe aren’t driving a sale or even a lot of compute initially, but where we really believe that there are, A, making an impact. And then if they make an impact, eventually it will turn into revenue, which is one of the real beauties about working at NVIDIA that the company is set up in this way to build, to disrupt, to change and to, yeah, you have this luxury almost like it’s a bit crazy to call it luxury, but in a lot of businesses, it’s a luxury you don’t have to really work on your business than just working in the business.

Grant Belgard: So BioNeMo just went open source. Can you tell us about that and what pain point it solves?

Eva-Maria Hempe: Yeah, so in general, as I said before, we’re trying to do at NVIDIA, we’re trying to lift up the field. So we’re not looking for the quick buck. So that’s why we’re not looking to, we’re not gonna change the field by collecting licensed revenues on BioNeMo, but we think BioNeMo is a super interesting, super valuable tool for the community. And by putting it out there as open source, we can just make it much more available to a lot more people. And also we can increase the number of people who are contributing to it with their ideas and making it into something that is a lot more valuable to the community and more powerful and much more in line with the community. I think around the same time that we made it open source, we actually also, we changed it.

Eva-Maria Hempe: Like we turned it into, it has two pieces these days, the one is BioNeMo Framework and the other one is NIMs. So Framework is really, it’s also a collection of microservices, but it’s a collection of microservices, which you need to train and deploy models. So it has a curator and an evaluator and a guard railing part to it. And you can use all of these, you can use any of these, whatever helps you to put out models in a better way. And then we have NIMs and so NVIDIA Inference Microservices and some of them are biology specific. So we have some on folding, we’ve got some on generation, we’ve got some on docking, and you can put this together into reference workflows, which we call blueprints.

Eva-Maria Hempe: I often say it’s a bit like, if you think of a big box of Legos, it’s like the building plan, how you build the most basic thing out of them and then you can play with it and turn it into all sorts of other things. But in general, what we’re trying to do with BioNeMo is really solving the main pain points of drug discovery. So drug discovery is slow, it’s expensive and then also quite technically challenging if you want to use computer aided drug discovery. And so here we’re giving researchers tools to handle complex data, to collaborate and just in general, we wanna have an advanced biomolecular research framework out there that people can use and that they can do their best work with.

Grant Belgard: And for our listeners who aren’t already familiar with BioNeMo, can you give a quick primer on what they can do with it?

Eva-Maria Hempe: So, as I said, it is mostly about computer aided drug discovery. So one way I usually explain it, we have another framework called NeMo and that’s not by coincidence. So NeMo is all about training, deploying models that have to do with language, but by now it’s actually also multimodal and BioNeMo is that for the language of biology. So if you think about a sentence has like words and observes grammar and the same way like a molecule has atoms and observes the laws of physics and chemistry. And so that’s a bit the analogy there. And so the same way that with our language model, you might have proprietary data and you might wanna train a model on this or you might wanna fine tune a model with new data, you can do the same thing with biological data.

Eva-Maria Hempe: If you have data coming in, you can curate it and then you can also make sure, so that’s the curator part, then you can also evaluate it against certain benchmarks. So how good is my model? And then finally you can also make sure it has certain guardrails, so it doesn’t do certain things that you don’t want it to do. And so that’s, yeah, that’s in a nutshell about it. It’s about training, deploying and serving biological models for drug discovery.

Grant Belgard: So AlphaFold has made a huge splash in the structural biology world. What do you think is the next big thing that would be GPU enabled in biology?

Eva-Maria Hempe: For me, AlphaFold is really like, I’m a physicist. So I know when I did my PhD, which in my mind hasn’t been that long ago, we locked up PhD students for three years in a basement to find out the 3D structure of a protein. And now you can just do it on a computer. You can go to build.nvidia.com where we host the NIMs, I said before, and we have a model there and you could fold a protein in like a second live on your computer. And it’s just mind blowing. It even works on my phone. I’ve done it during presentations on my phone. So I’ve folded a protein on my phone within less than a second. In general, there are certain things around AlphaFold. There are certain gaps. So it has problems with dynamics. It has problems with multiple conformations. It can’t do disordered proteins.

Eva-Maria Hempe: And 60% of human proteins have at least one intrinsically disordered region. It’s also not great with protein ligand and nucleic acid interaction. So there are a whole lot of things which it cannot do. And so these are actually also the things we see in the field where a lot of work is going on. And as NVIDIA, we’re doing some research ourselves in the spirit I said before, in trying to lift up the field and trying to show what’s possible and trying to also inspire other people to go further down that path. And so we’re doing some research ourselves. We’re doing a lot of research in collaboration with all sorts of other people. Sometimes we’re open about this. Sometimes it’s not disclosed, but yeah, we’re seeing a lot of things that are going on.

Eva-Maria Hempe: And what we’re seeing in particular in terms of frontiers, I would say, are four things. So we see how do you deal with larger complexes and assemblies? How do you deal with post-translational modifications? How do you deal with dynamics, molecular dynamics? And then also how do you deal with protein design? Like how can you turn AlphaFold around? Like with AlphaFold, you have the sequence and you want to know the 3D structure. Can you have a 3D structure and figure out what is the sequence behind it? So there’s a bunch of work going on in the space and I think it’s going to be super exciting to see what will come out of that.

Grant Belgard: How do you see DGX Cloud changing the barrier to entry for academic labs?

Eva-Maria Hempe: DGX Cloud is like an interesting way, which is part of what we offer. And maybe it’s easier to understand in the greater context of what we offer. So in general, we are very much agnostic of what GPU you’re running your workloads on or what NVIDIA GPU you’re running your workloads on. And that is a huge advantage for people who are working with our software because we don’t want to lock anybody in. The only commitment you’re making is you’re going to work on GPUs, which I think is not a bad lock-in. You’re not locked in any other way, but that you’re going to be using GPUs. And those GPUs, the answer what GPUs are the right ones for you will again very much depend on your situation. Like, do you have a data center? Is your data center big enough? Has it liquid cooling?

Eva-Maria Hempe: Does it have enough electricity? Do you even want to run a data center? Or do you have big spikes where you need really high performance computing capacity in a short amount of time? And DGX Cloud is following our reference architecture. So it’s really all the different components, the GPU, CPU, networking perfectly aligned with each other. And it’s in the cloud, it’s on demand. So what we see it used quite often for is spike. And if an academic lab has that, if a lab is trying to train a huge model, it can be the right thing for the lab. And it could be a great way as well to showcase the power of it, but it’s not always the right solution. Sometimes it’s also worthwhile to build your own on-prem capacity or to go with more conventional cloud capacity.

Eva-Maria Hempe: So I think it’s an element of a larger compute discussion, but it definitely allows academic labs if they have the funding, if it’s basically baked into the grants to really get top-notch performant GPU computing on really short timescales.

Grant Belgard: And at what stages in the process does AI assist drug discovery today?

Eva-Maria Hempe: Pretty much along all of them, I think we see different levels of activity. So we see a lot of really early discoveries. So it starts with things like finding new targets, which I think is an interesting one. I think it’s one where we don’t see, I think you could see even, I would hope for even more activity. Somebody told me the other day how many people are working, how big the overlap is between working on the same targets. It’s mind blowing. And for example, what we talked before, intrinsically disordered proteins is a super interesting area to really find new targets, to be able to address parts or proteins, which so far have been undruggable.

Eva-Maria Hempe: And we’re working with a company there, they’re called Peptone, and they actually, AI supported, have found a method to figure out the structures of disordered proteins. So I think this was super exciting. So we’re starting there. And then of course, we have all the virtual screening workflows in terms of, okay, you have a target, you fold the target. Then you have something like MolMIM or like a generative model, which starting from a particular small molecule creates all sorts of variations of that small molecule. And then you take your protein and your multiple variations of small molecules you generated, and then you use another AI model, which can calculate how well they fit together. And as I said, that’s an area of active research as well.

Eva-Maria Hempe: How well can you really calculate those bindings? And again, another company we’ve worked with, they’re called Inoform. They can actually also do a, they can create models that fit into a particular, or molecules that fit into a particular cavity. So there’s a lot of interesting things around there on the real fundamental level. But then there’s even more to it. There’s, we’re trying to figure out how can we also, or companies are figuring out how can you apply AI to pre-clinic?

Eva-Maria Hempe: And then even in clinical research, or the clinical stages of drug discovery and drug development, there is still so much that can be done because so many drugs don’t necessarily fail because the biological mechanism isn’t there, but often also because you can’t recruit patients, you can’t recruit the right patient. And again, AI can actually have a huge contribution to solving these kinds of problems. And then you can go into manufacturing and selling drugs. So I always tell my clients that AI is a topic along the entire value chain. And we are seeing applications today along the entire value chain. Like every single step, there is somebody working on something and a lot of progress is being made.

Eva-Maria Hempe: You still have the whole issue that just things take a very long time because like clinical studies just take the amount of time they take. You can have a bit of time out there by doing optimized recruiting of trial participants, which is usually a pretty of a delaying factor, or you can use AI also to speed up the data analysis and regulatory writing, clinical writing, submissions processes. So there is some speed up you can do there. But I think in terms of the speed up is more happening in the earlier phases of drug discovery. And then in development, we really have more of a trying to figure out where do they work. So a lot of work I see in that area as well is around biomarkers.

Eva-Maria Hempe: Again, figuring out what works for which patients so that it feeds back into the early stages, but then also once you’re in trials, you have the right patients in your trials and you have a better chance of actually making it through phase three, doing efficacy. I said about all those different ways, how AI can help with the preclinical part. And there is actually real good data on that by now. So, and SILCO is really famous about this and they were smashing it. They had 22 developmental candidates between 2021 and 2024. And actually they were able to get on average to a developmental candidate within 13 months. So around 70 molecules synthesized per program. And the fastest was like nine months and the longest was 18 months.

Eva-Maria Hempe: And this is just like a huge, huge speed up to what you usually see, but these kinds of processes take years. Interesting, so that’s the preclinical phase where it’s really about the speed up and you can also go from target and lead identification over lead optimization in 46 days these days. So all of this is amazing. And I said before in the clinical studies, it’s then really about being better. And there was a paper which came out last year where they looked at AI discovered drugs. And for phase one, the success, probability of success was twice as high as for regular drugs. And it was still pretty bad, but it was twice as high. And then for phase two, it was in line with the averages, but for phase two, the numbers started to become quite small.

Eva-Maria Hempe: And for phase three, there wasn’t enough data. But if we assume this holds, if you assume you’re twice as successful in phase one, which is not unrealistic because phase one is all about safety and with better models, we get better idea of target effect, and then phase two and three about efficacy and a dosing on part, then this actually means we’re going from one in 10 drugs, making it to markets to two in 10 drugs. It’s still a lot, but it’s basically, it’s halving our cost per drug. And if a drug costs these days, on average $2 billion to make it to market, saving a billion dollars per drug. So this is huge. Your potential is huge, which I think is why we’re all still working on this despite all the problems we talked about of long timelines and difficulties to get funding.

Grant Belgard: Where are the biggest talent gaps in bio AI today?

Eva-Maria Hempe: I think it’s really about speaking multiple languages. And the question is also talent where? So we have and– and what keeps things from reaching or from reaching impact. So I think if you look at a lot of the biotech, tech bio, we still have the issue that the entire pharma ecosystem is set up in a particular way. Somebody said it the way, like it’s a coin flip. And we know that the coin is unfair. We know that heads gonna come with a 10% probability. Now what these companies are doing, they’re actually trying to improve the coin minting process. So by using AI, we’re trying to mint better coins. We’re trying to mint a coin, which has a 20% chance of heading up, landing heads up. But this is really hard to prove.

Eva-Maria Hempe: And the entire system, the people in the VCs, all their mindset is like a biotech investor mindset. And they’re looking for the things around a 10% coin flip probability. And it’s really hard to evaluate this. Is this really going to get us this lift up or not? And different to other areas of AI like quant trading where you have immediate feedback, you change something, okay, you’re gonna make more money. Great, let’s do more of this. Here, it’s almost the complete opposite of quant trading. You have like 10 years until you see whether it works or not. And I think that’s actually one of the biggest gaps.

Grant Belgard: Even with the 10 years, it’s small in, right? So it trickles through after 10 years.

Eva-Maria Hempe: And so, yes, I think we need to have more people who speak multiple languages of AI and of data science and of biology. But I think we’re starting to see some of that. But I think it’s really more the system as a whole and the incentives and the structures and just the fact that we’re dealing with biology, which takes 10 years to come. But I’m still optimistic.

Grant Belgard: What are your thoughts on community standards such as OpenFold and so on? Are there areas where there are glaringly obvious missing standards or areas that you think are still being held back by a lack of standards?

Eva-Maria Hempe: At NVIDIA, we are big believers in open source. So we think it’s the one way to really harness the power of community. And we are big believers in the community. NVIDIA is all about communities, about ecosystems and us doing our part to help the ecosystem develop, which is why so much of our software is actually open source because we believe in the power of this approach. And we really wanna support it to come to full fruition.

Grant Belgard: Well, it’s essential to save biotech and pharma, right? The internal rate of return on R&D has been abysmal below the cost of capital for many years now. And at last that turns up.

Eva-Maria Hempe: It’s actually interesting because of those $2 billion per drug or one and a half billion dollars per drug, only I think it’s around 300 or so are the actual cost. All the rest is the cost of the failed drugs and the cost of capital because the capital is just locked up for such a long time and you have so many failures all around. And the other thing I think, I don’t know, you’ve probably seen it, it’s called Eroom’s Law. If you take how many drugs $1 billion in research spending buys you, it’s a logarithmic downward over the last 70 years. This is not recent. This has been going on forever, but it’s just starting to get into areas where it’s just really, you just can’t continue this way. We just need a different way of doing things.

Eva-Maria Hempe: We just can’t continue spending more and more and more and getting less and less and less.

Grant Belgard: So shifting gears, let’s talk about your own journey. What pulled you from physics to health?

Eva-Maria Hempe: It was the impact. So I was sitting there in my lab. So I was doing quantum optic, which means I’m sitting in a dark lab because I was dealing with optics and lasers. So you don’t want daylight messing up your experiments. So you go in in the morning, it’s dark. You leave in the evening, it’s dark. And during the day, it’s dark. And I was just thinking to myself, what is this going to do for the world? And back then we kept saying, oh yeah, this could be used for quantum computing. But back then I was like, well, but this is going to be at least 15 years until anything useful. And I have to say, this has been more than 15 years ago by now. So I was just like, okay, is this really it? But then as with those decisions, usually two things have to come together.

Eva-Maria Hempe: And the other part, which was for the ignition to really change tack was just meeting the right person at the right time. So I met this girl and she was an electrical engineer by training. And she studied how procurement processes at the hospital affect patient safety from with this very scientific engineering frame of mind. And I just thought that it was fascinating. Like all the way I’ve been trained to think, which like I really liked the scientific method. I really liked this way of thinking, but applying it to real world problems. And that’s how I got to study healthcare service design.

Grant Belgard: Are there any insights from your PhD that you still use?

Eva-Maria Hempe: Yeah, I think it’s really that organizations are an interplay between structure and people. And that sounds very simple and very obvious, but if you’re designing an organization, you’re not actually designing an organization. You’re designing almost a scaffolding for the organization to grow around. You’re giving some structure, but an organization isn’t the org chart. It isn’t the policies. It isn’t the trainings. It’s the people which are populating those structures, which are interacting, which are meeting each other or not meeting each other. And I think that was a really important insight which has like, it pops up everywhere. Now, one of my big challenges at work is like how do I get enterprises to adopt AI?

Eva-Maria Hempe: That’s again, an organizational question. As much as a technological question, actually technological question is like, maybe not even half of it. A lot is really about how do you get people to adopt it? How I get people to use it? What are the incentives they’re listening to? Who has power in this organization? How is this organization really structured? So yeah, I still use some of the things I learned, I studied.

Grant Belgard: And what did you learn in your time with the NHS that you think tech sector often misses?

Eva-Maria Hempe: I think in the tech sector, it’s easy to look at everything through a technological lens that, oh yeah, we can improve this, we can do this. But a lot of my research and my work was about design thinking, which is very much empathy. You start with the end user, you immerse yourself into the end user. Ideally you get to observe, you get to shadow, but you get a real idea of what are people doing and what’s the real problems and how can technology help that? I think this empathy, this user-centric view is sometimes a little bit missing in tech. I think what we also discussed before, you’re creating a great tool and maybe the people you tested it with like it, but it has to fit into the workflow. It has to fit into the real life. It’s all about minimizing friction.

Eva-Maria Hempe: I was saying the other day, just like if you wanna drive real value in organization, it’s about having something that has as little a friction as possible and as much immediate value as possible. And then you’re gonna see adoption. If it’s high friction, it has to have even higher value. If it’s low value, it has to have even lower friction, but ideally it has both.

Grant Belgard: Can you tell us about your time at the World Economic Forum and how that impacted the work you do today?

Eva-Maria Hempe: Yeah, the forum really is about multi-stakeholder and what role policy plays. And again, about what are the right incentives and how can you align the incentives of multiple different parties towards a common goal. So what I did there, it was about the future of healthy. So how do you make staying healthy a business versus having people get sick first and then making them healthy again? I mean, that’s an established business model, but why are we there? Why can’t we just keep people healthy in the first place? And there it’s really about thinking through the food industry. How can we make it a better business for the food industry to sell healthy food? How can we make it better for the doctors to be paid to keep the patients healthy?

Eva-Maria Hempe: There’s models for that where they get basically paid per patient in their catchment area, but they don’t get paid for the procedures they do, but they get like a fixed fee. It has all its pros and cons, but really think through things from a joint value and joint incentive point of view. And like I said, again, when you’re trying to change big systems, whether it is an organization or whether here it is like a multi-organizational system, it’s really important. And this is something I think I couldn’t imagine a better place to learn how you navigate these things, how you deal with politicians, how you deal with all the different lobbyists and all the different interest groups and really try to drive towards a common goal. And I think there’s no better place than the forum to learn that.

Grant Belgard: Can you tell us about your time rowing in Cambridge and did that develop you in any way that’s useful today?

Eva-Maria Hempe: Yeah, I got to Cambridge twice. The first time I went to Cambridge, it was for a summer research as part of my master’s thesis. And I knew people and they made some connections for me. And so I was at Cambridge during the summer before the freshers arrived. And then the freshers, so the first year students all came in and all the clubs started recruiting and the rowing club started recruiting and they tried to recruit me. And I was like, yeah, no, I’m only here for a few more months it doesn’t make sense, I should still do it. And I didn’t do it. And then I came back to Germany where I was finishing my studies and everybody was like, oh, you were in Cambridge, did you row? I’m like, no. And then I really regretted it. I was like, well, I really should have.

Eva-Maria Hempe: So I promised myself if I make it back in for my PhD I’ll give rowing a go. And so I did, and initially I wasn’t that good. So I was in the second novice boat. I didn’t even make the first novice boat. I was in the second boat, but then I just kept at it. And I barely made the first boat in the next term. There’s three terms in Cambridge. And then in the third term, I was still in the first boat of my college, of my part of the university I was at. And then I was around for the summer. So I thought, okay, the university team is doing a summer program. I might as well try that. So I did that. And then they try to funnel you into joining the team full time. And I was like, well, Cambridge rowing.

Eva-Maria Hempe: The year, my first year I watched the Cambridge boat races and I was like, wow, it must be so nerve wracking and whatever. And then they were like, yeah, you did the summer program. Don’t you want to trial, like just try for the university? And I was like, okay, well, what’s the worst that could happen? I’d taken that lesson of where I hadn’t rowed and regretted it. I’m like, okay, I don’t want to regret. So I just went for it. And then I found myself on the starting line of that boat race, which I just watched a year before. So I went within 18 months from never having rowed in my life to rowing and winning a boat race. And I think the lesson here, as I said, there’s the one about no regrets.

Eva-Maria Hempe: I think the second one about that you’re just capable of a lot more than you give yourself credit for. And I think the third one also just about the power of habits and the power of persistence and the power of community. So there’s nicer things than getting up every single morning at five o’clock, going to the train station, going rowing, barely making it back for nine o’clock to go and to your lab and do your work. And then at five o’clock going back to row. But it’s incredibly disciplining because you only have from nine to five. There is just no, oh, I’ll do this later. You have to be done at five because then you have to leave and go train and you have to be there for training. You can’t skip training.

Eva-Maria Hempe: And so I thought that was actually really useful to fall into this rhythm and go along with it and also shape your environment in a way that helps you do the things you want to do. Because like I said, it’s just not like, I don’t want to get up at five, but I just have to. And then once you’re back from training, you actually feel pretty good. And of course winning the race, nothing feels as good as that. But even if I would have lost the race, I still like, yeah, it was interesting because just before the race, it was about an hour or two before the start. And I remember we were in the boat bay and did like a little circle of the whole crew. And until then I had a bit of nerves, but from that moment on, I was just calm. All the nervousness, all the nerves were just gone.

Eva-Maria Hempe: And I was just like, well, I put everything into this I could, I have no regrets. So whatever happens now on the water, I can look back at this day and I’m proud because I did whatever I could to get to this point. And I think that was interesting because the year before I thought those people must be so nervous when they sit on the start line. But actually when I sat on the start line, I was just calm, I was just ready to do this. And basically put in the work.

Grant Belgard: Why NVIDIA, what sealed the decision for you to join?

Eva-Maria Hempe: It’s because we are a $4 trillion company. No, of course not. Actually, when we joined, I wasn’t. When I joined NVIDIA, it wasn’t a $4 trillion company. No, it’s just, I couldn’t imagine another place right now where you’ll have this impact on the entire ecosystem of healthcare. We work with everybody. We’re the one AI company which works with everybody else. So I get to work with startups. I get to work with established companies. I’m on the forefront of what’s possible. And at the same time on the forefront of what’s possible to do an organization like the thing we thought before. I mean, on the one hand side, we’re looking at models which can design proteins based on 3D structures.

Eva-Maria Hempe: But on the other hand, we’re also looking at rolling out procurement agents because that solves a real problem in the organization today. So it’s just a really exciting place to be at the center of the action around AI and healthcare. And so in general, it just felt like a place where a lot of the things I’ve been doing in the past sort of all came together. Like the multi-stakeholder management of the forum, the strategizing of almost 10 years in consulting, the operationally leading a team and helping people and creating strategies and tactics to make your number, which I did at VMware. And yeah, it just wrapped into sort of this one package of doing something really exciting and really exciting in a field I’m super passionate about.

Grant Belgard: For early career computational biologists who were looking at entering industry, what three skills should they cultivate now?

Eva-Maria Hempe: It’s a bit difficult to say because I’m not a computational biologist, but I think it’s also maybe not so much about the computational and the biologist. I just assume people are well-trained in those fields. I think what’s really important is for them to listen, to sort of to listen where the problems are, what’s being done, where people struggle with. I think the other thing is to really understand value. So I think there’s a lot of interesting work. If you want to do really cool and interesting work, and maybe it’s a bit controversial, but then academia is the place to be. Like if you just are in for the cool, by all means, that’s what academia is supposed to be. If you’re going into industry, then you need to have a nose for value. You need to start to understand like what’s value.

Eva-Maria Hempe: And value can be very different things. Value doesn’t necessarily mean the biggest grossing drug. It can also just be in line with the research portfolio of the organization. It can be in line with individual values of particular managers, but you need to understand value. I think the last thing it’s about teamwork, because so many of these things by now become so difficult that you just can’t solve them alone. You’re dependent on working with others who are bringing complementary skills and complementary experiences. So I would say three things are listening, understanding value, and working well in a team.

Grant Belgard: For life science founders, when is it worth building their own models versus taking existing models or platforms?

Eva-Maria Hempe: So I think you have to be smart. So do you really have an edge? And AI, in my mind, I always think about in three elements. The one is data, compute, and algorithms. So compute, there are some people who have an edge because they can just buy compute for billions of dollars, but that might not be your edge as a founder. So then it probably leaves either algorithms or data. And if you have something there, yeah, you might want to go for it. But very often, actually, you don’t necessarily need to build a model from scratch. You might not even have enough data to build a good model from scratch. And it might be much more worthwhile for what you’re trying to do and you’re coming back to the point of value. What is the value you’re creating?

Eva-Maria Hempe: It might actually be better to stand on the shoulders of giants and just taking a foundation model and retraining it. And in general, I would always advocate for using frameworks out there because they make your work easier. So BioNeMo is not a model per se, but it’s also a framework which helps you do your models better. And I think you shouldn’t write your own data loader and you shouldn’t have tried to configure guardrails from scratch. Like you have, as a founder, you’re massively resource constrained. So try to think about what are the things where you can really differentiate and focus on those and then try to use platforms, existing tools for all the rest.

Eva-Maria Hempe: And I hope that people are taking something from this podcast is we have so much things out there which we’re putting out there, usually often as open source. We have frameworks and libraries and NIMs and all of this is intended to help you and avoid reinventing the wheel. Like if you’re doing medical imaging, you don’t need to write your own segmentation tool. Like this is all out there. Take it and then build a killer application on top of it. But be smart, look at what’s out there and NVIDIA can offer so much and your favorite AI engine, if you ask it, I have this particular problem, what are the latest NVIDIA frameworks? It should give you a whole list of libraries and frameworks you can use, whether it’s for data science or data frames, et cetera. There’s just so much out there.

Eva-Maria Hempe: I think the last thing for life science founders is as well look into Inception. So Inception is NVIDIA’s free virtual accelerator. So it gets you access to NVIDIA experts, which help you even better find the right tools and right frameworks, which make your money last longer. It gets you into a community of like-minded people and there’s also some programs about cloud credits and or discounts for hardware. So join Inception, look at what NVIDIA has and other people have put out there before you build it yourself and just be really smart about what really drives value.

Grant Belgard: What’s your boldest prediction for AI and drug discovery over the next five years?

Eva-Maria Hempe: I don’t know if it’s five years. I would hope it’s five years, but I think at some point we will look back at the way we do drug discovery today and it will seem as archaic and plainly said stupid as the alchemists trying to turn lead into gold. Like today, if you tell kids, oh, back in the middle ages, you had all those alchemists and they were cooking and the idea was lead is this less noble material and you can turn it into more noble material as gold. People are like, why? And I think we look at the same way a lot of things we do today in drug discovery and we’re just like, why did everyone ever think this is going to work?

Eva-Maria Hempe: And there are like on a more practical level, there’s really smart and really interesting things going on about virtual cells and like better predicting like the link between the genome and actually how cells behave. And then also not just cells because we’re not just cells, we’re whole tissues. So I think we’ll see a lot more understanding and understanding biology, at least to some extent. And I think that will get us to this point of alchemy and how could we have been so stupid.

Grant Belgard: What’s a learning resource you would recommend for every trainee?

Eva-Maria Hempe: I think it’s not a learning resource in the conventional way, but I would really encourage to go on build.nvidia.com because it just shows you what’s possible and you have all those different models and you can play with them, you can get an idea what they can do. And then you can also go to the blueprints and basically see how these are put together. So I think that’s a great resource. And then I would maybe pair that with like, I’m a big fan of perplexity, but also any other LLM agent of choice. I think they are great teachers. They can teach you anything. And the other day I used perplexity in voice mode. And so I was like making dinner and just having this really natural conversation. And there is no stupid question. There is no judging.

Eva-Maria Hempe: You can like ask it anything like just, can you please explain to me again how this works? And I sometimes also use it for some of the NVIDIA stuff. I’m like, okay, can we go deeper on RAPIDS? Can you explain the different libraries? Like how does this work? Why does this work? So I think it’s a great tool to learn about AI, but also just anything else you wanna learn. And it can also challenge you. You can actually also ask it to quiz you and to make sure you really understand things and you explain it back to the machine. The machine actually gives you feedback whether you got it right or you need to brush up a bit more.

Grant Belgard: Yeah, I was actually doing the same thing with a bit of yard work yesterday. Also highly recommend that, voice mode is great. Eva-Maria, thank you so much for joining us. It was great.

Eva-Maria Hempe: Thank you, I really enjoyed it.

The Bioinformatics CRO Podcast

Episode 65 with Jeff Bizzaro

Jeff Bizzaro, founder and long-time president of bioinformatics.org, discusses the importance of open source tools and open access in the life sciences.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Jeff Bizzaro

Jeff Bizzaro is the founder of bioinformatics.org, which is committed to hosting resources for open science, bioinformatics webtools and data, and open source software development.

Transcript of Episode 65: Jeff Bizzaro

Disclaimer: Transcripts may contain errors.

Coming Soon…

The Bioinformatics CRO Podcast

Episode 64 with Afshin Beheshti

Afshin Beheshti, director of the University of Pittsburgh’s new Center for Space Biomedicine, discusses the importance of space biomedicine to understanding human health both in space and on earth.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Afshin Beheshti

Afshin Beheshti is the Director of the University of Pittsburgh’s new Center for Space Biomedicine in the McGowan Institute for Regenerative Medicine, Associate Director at the McGowan Institute, and Professor of Surgery at the Pitt School of Medicine.

Transcript of Episode 64: Afshin Beheshti

Disclaimer: Transcripts may contain errors.

Coming Soon…

The Bioinformatics CRO Podcast

Episode 63 with Kenny Workman

Kenny Workman, co-founder and CTO of LatchBio, discusses his experience building a cloud platform for modern biology and how Latch has grown since our 2022 episode with his co-founder Alfredo Andere.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Kenny Workman

Kenny Workman is the co-founder and CTO of LatchBio, a cloud based data infrastructure solution for working with molecular data.

Transcript of Episode 63: Kenny Workman

Disclaimer: Transcripts may contain errors.

Coming Soon…

The Bioinformatics CRO Podcast

Episode 62 with Don Alexander

Don Alexander, founder and president of GeneCoda, discusses the current climate in hiring for life sciences, trends in remote and hybrid work, and the impact of AI on expectations for candidates.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Don Alexander

Don Alexander is the founder, president, and managing director of GeneCoda, an executive search focused on the life sciences sector including biotech, pharma, med tech, and diagnostics. 

Transcript of Episode 62: Don Alexander

Disclaimer: Transcripts may contain errors.

Coming Soon…