The Bioinformatics CRO Podcast

Episode 70 with Joanne Hackett

Dr. Joanne Hackett, VP of Health Systems Services at IQVIA and Chair of the Board at eLife, discusses her hopes for the future of healthcare.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Dr. Joanne Hackett

Dr. Joanne Hackett is VP of Health Systems Services at IQVIA and Chair of the Board at eLife.

Transcript of Episode 70: Joanne Hackett

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the Bioinformatics CRO podcast. I’m your host Grant Belgard. And today we’re joined by Dr. Joanne Hackett, vice president of health system services at IQVIA and chair of the board at eLife. We’ll explore what she’s building now at IQVIA, her career path across science and industry, and her most practical advice for people working at the intersection of genomics, data, and healthcare. Dr. Hackett, welcome.

Joanne Hackett: Great. Thank you for having me. And I was just saying before we started the podcast, we have known each other for a very long time and haven’t seen each other in a very long time. So it’s very nice to be with you again.

Grant Belgard: Great to see you again as well. So how do you describe your role today to someone outside of healthcare?

Joanne Hackett: Yes. So this is always the interesting thing when you’re sitting around a dinner table and somebody says, what do you do? And the people who do things with their hands are usually the ones who get the greatest following because you can explain exactly what it is that you do. But the rest of us who do more with our brains have a little bit of a harder time, either convincing ourselves that we do something useful or those around us that we’re doing something interesting. But I personally think that I have a very interesting role at IQVIA and IQVIA is one of these companies that brings together a lot of different components. There’s advances in data science, technology, and healthcare expertise. And the ultimate aim is to help customers make better decisions and ultimately improve patient outcomes.

Joanne Hackett: And the nice thing about working in a company whose mission is to help create a healthier world is that we actually do get to do that, whether or not we’re the individuals who are accelerating the innovations or making those intelligent connections across the healthcare ecosystem, we really are making progress in changing the way healthcare is being delivered. And I’m very fortunate that my role, I spend most of my time across Europe, Middle East, Africa, and South Asia, but we are a global company. And because of that, we can take best practice from one part of the world and see whether or not it can be useful in other parts of the world as well. And what I find probably the most satisfying is that we get to work across health systems.

Joanne Hackett: And for me, that’s so important because all of us are patients, whether we’re talking to someone at our pharmacist to either pick up plasters because we scraped our knee or because we’re picking up a prescription for something that’s a long term condition. We are in many aspects of the health system and we rely so much on needing that connectivity and that data. And my role is to staple that together in various different ways. A big part is understanding where governments are finding issues with regards to healthcare spend, trying to understand how they can maybe attract more clinical trials and also to make sure that the institutions that sit within these countries are fit for purpose. And that doesn’t always mean complicated systems. It means trying to map out that patient journey. Again, maybe it starts in the pharmacy or it ends up in a very sophisticated tertiary hospital.

Joanne Hackett: But what is it that we actually need that pulls these different components together? So that’s really my role is to try to help pull that together with both the local focus as well as the global experience. And because I am a geneticist, of course, I always bring a genomic and precision perspective into the way that we solve problems.

Grant Belgard: So when you look across your region, what outcomes are health systems asking for most urgently?

Joanne Hackett: Very often, most health systems are trying to find ways to be more efficient. And it’s not that it’s trying to cut costs, to try to do something cheaper because they’re trying to cut corners. There is a lot of waste in the system because it just quite literally has not been mapped out. And by the time it does get mapped out, the world has changed. So very often it’s trying to understand more about efficiencies, trying to understand what data actually needs to be collected. We spend an awful lot of time thinking more is better. It’s not necessarily always the case. So the questions are around, should I be trying to attract clinical trials to this country? Do I even have the right patient population that have certain mutations that industry is looking for?

Joanne Hackett: Are there particular types of software that will work better in my country because it’s got a particular module that’s necessary for pharmacy integration into a hospital? It’s all of the stuff that’s both local, but at the same time thinking about the efficiencies that you can pull from various different parts of the world that would make it make sense as well.

Grant Belgard: Where do you see the biggest near-term opportunities to improve patient journeys end to end?

Joanne Hackett: A lot of people think that it has to be this one-size-fits-all approach and you have to build a platform that takes in 15 different data questionnaires or whatever. It’s really not that complicated. For me, at the end of the day, I went into genetics, as you probably remember when we first met. I went into genetics because I was a child with a rare disease. And to this day, it still fascinates me that individuals can’t put themselves in the shoes of someone who’s been ill or has had an experience in a hospital. It’s not complicated. When people say that they’ve got a rash because they’ve taken particular medication, if you were to ask, did that happen two weeks ago or when it happened with the rash, if you don’t, we are alive, they’re busy. But the last thing we remember is the absolute detail. Empower the individual to take a bit more responsibility about their health as well.

Joanne Hackett: So the near-term opportunities in my mind are treating individuals, because we are all patients, like sophisticated individuals and giving them a bit more responsibility about their health care. Give them the tools that actually connect to something useful. If you want to know that I had a rash and it happened four hours after I took the medication, if I’ve logged that and you have access to it, surely if that’s integrated into my record and you can see that, it’s going to make that next step a lot easier. In addition to that, there are so many people doing research now and real-world studies, real-world evidence. Why can’t that be much more accessible so that you start to realize that the individuals who have early onset Parkinson’s were also in the majority, the individuals who had miscarriages, they also had teenage acne.

Joanne Hackett: They also were the kids who was a bit clumsy and fell off their bike. If we can start to find those patterns, we can start to treat earlier and allow people to have longer and healthier lives. So for me, it’s not necessarily about big policies and changes and sophisticated technology. A lot of that stuff will help, but I really think that we still, many, this is not just one country, this is several countries, we still just haven’t taken it back to the basic steps of I am a human being, I am feeling ill, I am ill, I need to be, I need someone to see me and I need the information to be collected and I’m going to need treatment. Really not that complicated.

Grant Belgard: How do you envision that being implemented in practice? What’s the, if you’re looking on the timescale of a couple of years, what do you think is the most feasible path towards collecting that?

Joanne Hackett: I’m really delighted to see, especially in Europe, there has been a lot of funding made available through the COVID Recovery and Resilience Fund, and a lot of countries are applying for quite creative solutions. Again, it doesn’t have to be complicated. I say creative, which is a slightly different word. And what they’re trying to understand is if we were to tackle a certain type of cancer, for example, or cardiovascular disease, they’re not trying to do everything. They’re trying to do a smaller population or a particular niche area that they’re trying to work on and solving that and then thinking about funding it further. So for me, it’s not, it’s not the point solutions that are the shiny thing that people were thinking more of the solution of five, 10 years ago.

Joanne Hackett: It’s thinking more about that, the broader aspect of what’s needed to pull all this together, but instead of waiting for that perfect ecosystem, it’s starting to carve out parts of it and think about a particular therapeutic area, for example, and then start to solve that. And what I’m also delighted to see, it’s not just in isolation. Oh, let’s build a registry because it’s helpful. The thought process is that registry is going to be extremely valuable if we’re collecting real-world data, if we have consent for research and recontact, if perhaps we have a sample that’s linked to a biobank, it’s not just solving problem that the individual is facing today. It’s again, being able to do research into what would be a perspective and retrospective data, which is also fantastic because scientific endeavors are changing on a daily basis.

Grant Belgard: What does a credible digital thread between lab clinic and home look like?

Joanne Hackett: I think they really and truly are becoming closer, which is fantastic. I love this whole concept of virtual hospitals. There’s very little that needs to be done in an actual physical institution today. We learned this through the pandemic, not that long ago, that a lot of stuff can be done remotely. I love the fact that there’s also been a lot more research going into sensors. I think it’s fantastic that you can look at people’s temperature and you can understand infection is happening way before it’s actually happening, especially from vulnerable or older individuals who don’t necessarily understand what certain symptoms are telling them. And also for many of us, we wake up in the morning feeling a little bit unwell. And probably there’s been something that’s been happening that we didn’t necessarily know about.

Joanne Hackett: So if, genuinely speaking, healthy people are catching the symptoms early, we’re no wonder we’re waiting for people to get sick before they get better. That connection between allowing the patient and the individual to be more empowered about their own health. We’re seeing more people take advantage of wearable technology that they themselves are purchasing, not just for their own health, but also to monitor their fitness levels and things like that, that connectivity with a virtual hospital and that connectivity with the physical hospital, I do hope at some point in the future, there’s no such thing as accident and emergency for people with [?]. I just, why is this happening? How can we not triage this better? So I do see that connectivity happening mostly based on the fact that we have smarter ways to collect data.

Joanne Hackett: And also individuals are a bit more curious about their own health now, because we’ve demystified the fact that something weird is lurking and you’re going to find out something very strange. If you do a genetic test, for example, you’re not going to find that someone has had an affair or they drink too much wine, you’re not going to find that you’re going to find a better, more personalized way to treat them. So I do think that connection between the precision approach and also the generalized precision public approach is starting to get closer and closer together.

Grant Belgard: How do you think about the balance between privacy and utility, especially when working with data across multiple countries and jurisdictions and regulatory requirements?

Joanne Hackett: That’s definitely one of the things that I think we got wrong about 15 or 20 years ago. And I say we as in the general, we in healthcare, we scared a lot of people into letting them think that if somebody found out something about them, it would be frowned upon or it would be a bad thing, or they would be marginalized in some way. It doesn’t really happen like that. And in fact, you start to see very creative ways that insurance companies are trying to understand protecting people from getting sick before they get sick, because it’s much more beneficial for them to do that. You’re starting to see some companies, employers saying, oh, if we could offer you some health testing and making your best version of yourself, would you like that? Of course they’re being nice, but ultimately you’re a much better employee if you’re alive and healthy.

Joanne Hackett: So we’re starting to see that the responsibility for the health of the individual is not just of the interests of the individual. It’s by other factors that are sitting around it as well. So we’re starting to see that migration a little bit differently, which I think personally is quite exciting. And that’s making people feel a little bit more comfortable about the data sharing aspect of it and that brokerage of data. The other thing that we’re starting to realize, and again, I do think that people were scared for a long time thinking that every single thing would be attributed back to them. You can do a lot of research on data that doesn’t have to be identifiable. We do not need to know my postcode to know that I’m 172 centimeters. You probably need the 172 centimeters. You don’t need the postcode.

Joanne Hackett: So what does that core data that we actually need to be able to do the research and innovation? And I just think that as an industry, we’ve gotten smarter about what that core data looks like as opposed to it’s not always more is better. And the more that individuals realize that they can be part of a study by not even ever having to give their name or the characteristics that they’re describing is actually much more interesting than the fact that they go on holiday in Spain, for example. You don’t need to share all that. And I do think that there are much better regulations now about how the data is collected and who the data processor is. And also I think in the very near future, especially in Europe with the advent of the European health data space, this will change the way you use data for primary and secondary purposes.

Joanne Hackett: And I think that will give confidence to individuals to allow that data to be collected and used for the appropriaries.

Grant Belgard: What use cases for genomics do you see are crossing from boutique to routine the fastest?

Joanne Hackett: So many people got very excited about doing consumer genetics, which is great. I think that’s a good way to get exposed to it. And the nice thing about that is it demystified for a lot of people that genomics was some weird, scary, invasive thing. So we’re starting to see that now translate much more into even some of the health testing that’s routinely rolled out is looking at some of the celiac disease, for example, things that don’t seem as scary as some very advanced, rare disease that somebody doesn’t know anything about, and we’re making it a little bit more mainstream, which I think is also helping people. And the thing that to me is going to change the way that people view genomics and healthcare is pharmacogenomics. It’s such an easy thing to implement.

Joanne Hackett: And the minute somebody realizes you shouldn’t take this medication because you can’t process it, or you have to take half a dose or double a dose, people listen to it because it’s science backed and that you get a very different outcome when people are told, everybody knows they should probably in some way exercise more or not drink as much done so. Hearing that thing over and over again is following, but being faced with the reality that you actually cannot process certain medications and they will hurt you or they won’t work for you at all and don’t bother taking them. It’s a very different response and it’s so cheap. So to me, the pharmacogenomics era is just taking off now. I can definitely see that almost being a screening mechanism for most individuals.

Joanne Hackett: You get something back about your own health very quickly, even if the answer is you don’t have anything and there’s none of these genes and drugs that you need to worry about, even if it’s only that you still have some information back. So there’s that trade off. So I think the pharmacogenomics space is the one that’s segwaying into routine healthcare very quickly.

Grant Belgard: What evidence do payers still want to see before they embrace broader precision medicine approaches?

Joanne Hackett: It’s very strange in my mind when you have to think about changing the landscape so fundamentally just for the sake of a couple of dollars. I think that’s so sad, but that is the world that we’re living in. So let’s park that to one side. I do think it has to be more about the fundamental difference that it can make in outcomes. And to me, I’ve always had the belief that earlier is better. And this is where I’m starting to be quite delighted by seeing much more interest in some of the real world studies, starting to understand these retrospective data sets. What are they actually telling us? And trying to using artificial intelligence and different technologies to find those patterns, to try to then map that back. I do think that’s where payers don’t want to be paying for something that’s not going to work. I wouldn’t either.

Joanne Hackett: Many of us are going to want to go out and buy something that’s not going to work. It just, that doesn’t work like that. So how can they get the best value? And a lot of people we need to remember think that they are amateur healthcare professionals because of this wonderful thing called the internet. And it’s actually quite, I think it is extremely frightening for healthcare professionals to be told full stop that they don’t know what they’re doing because someone has run a search and he’ll come up with something completely different and their friend’s grandmother’s sister’s brother is on this particular medication and they should be on it as well. We’ve almost got to that tipping point where the face in the healthcare practitioner has been taken away because the patient, if you will, the human being is trying to make those decisions on behalf of themselves.

Joanne Hackett: I do think that balance between taking responsibility for your own health, doing better research is useful, but ultimately the experience of a healthcare professional has to be married up at the same time. So for me, trying to understand how decisions get made, the science that sits behind it, and then most importantly, if it’s not going to work, don’t prescribe it, don’t do anything like that. That to me is the evidence piece. And we’re getting much better about looking at different types of evidence in order to be able to prove that. But genomics is a key thing to making that work.

Joanne Hackett: There’s just a lot of stuff that it’ll never, you don’t take certain medications if they’re just not going to work and why would you possibly do a cell and gene therapy on something that’s not going to have any risk and any outcome, you just wouldn’t do that and finding the right patient population, stratifying by genotype, we’re getting a lot smarter now, which to me is going to help get some of these orphan designations across and approved and actually have a much better outcome for individuals as well.

Grant Belgard: What do you consider fit for purpose, real world evidence? What makes it cross the line from merely interesting to really decision grade?

Joanne Hackett: For me, it has to do more about the quality of the data, or again, those back to the simple principles of if not necessarily more is better. I would much rather look at, I don’t know, a hundred data points that are very deep, especially if that’s what I’m looking for as opposed to 10,000, a data point, which tell me hardly anything. The other thing that is getting a lot more traction than even probably five years ago is companies spending more time looking at the diversity of data. And it’s not just a throwaway term anymore. I think for a while, people thought it was the right thing to say or do the same way, you know, getting the patient voice was something that was just thrown into an application several years ago. It’s very different now.

Joanne Hackett: And with diversity and data, the reason why is actually mainstream today is because we have it, we didn’t have it five years ago, people didn’t build the registries, they didn’t have the data. So we’re, we’re seeing, it’s just a very different, it’s a very different time now. And to me, that’s a very positive thing because it’s a very rapidly evolving area and the data is coming thick and south, which is great, which then just allows better decisions to be made. So having more data, deeper data and more diverse data is allowing the real world evidence studies to have, to be a cot above than what they, where they were even two or three years ago.

Grant Belgard: Where are decentralized or hybrid trials generally improving access to the trials or speed?

Joanne Hackett: It’s a combination between getting individuals who wouldn’t necessarily, sometimes you have to travel to a site and you have to travel because there was no other option previously. The healthcare landscape has changed tremendously. So that, that has changed in the sense that a lot of the different things that were rather being collected, whether it was just a blunt sample or monitoring something, a lot of that stuff can either be sent to a patient’s home or it can be done in a community center or a pharmacy. If you look at the physical aspect of getting people to a particular site, that has changed tremendously. In addition to that, many of the things that were being collected, you would have to come in to have a little chat with someone to go over your symptoms.

Joanne Hackett: Electronic, that was allowing people who didn’t, again, a big part was getting individuals to a physical site more than anything, and now there’s more people who are going door to door, doing things in a very different way. So the physical side of it has changed tremendously.

Joanne Hackett: In addition to that, the way that you’re able to find, especially for rare diseases, individuals across many countries, because very rarely are you going to find enough individuals from one particular country to be able to do the study, the fact that there are easier ways to share that data today, to be able to recruit across many different countries is also allowing the right individuals to be recruited into the trial, to basically run the study from many different countries, which again, even five, 10 years ago, the sheer cost in bed alone, because the infrastructure wasn’t there, was the main reason why it just didn’t happen.

Grant Belgard: Which countries are really bright spots for digital maturity and why?

Joanne Hackett: So I’m slightly biased clearly towards anything that sits in Europe, Middle East Africa, and South Asia, because I spent almost all of my time supporting and growing business in those countries. But I do probably have a very special thought in my heart for the Middle East. I’ve spent a lot of my time there. There’s a huge amount of investment in healthcare as a whole, digital maturity is just, it’s growing so quickly. Every time I, from one month to the next, something different has changed. So the sheer growth and expansion in the Middle East is just absolutely fascinating and I’m very pleased to see that happening. Where I’m equally delighted to see a lot of progress is in Africa.

Joanne Hackett: And I know it’s a struggle way to discuss a lot of different countries, but there are several countries that have been working very closely together to share against practice, to think about doing clinical studies and to even share data in a different way. And just the amount of frugal innovation that you’re able to see in Africa is again, just very fantastic because it’s changing that landscape in a way that a very small incremental change is making a massive impact. And so I think the two areas that are probably, so digital maturity of 100% for the Middle East, the access and the change in the way that healthcare is being delivered, perhaps not necessarily digital maturity, but for Africa it is happening in a very fast way as well.

Joanne Hackett: Now, I would also highlight, going back to the comment about the COVID Recovery Resilience Fund, Europe, and again, that’s a fairly broad statement covering many different countries, has tapped into some very creative ways to change the way healthcare is being delivered. And a lot of that is about investing in the infrastructure that’s needed for digital transformation. So those are the three hotspots, if I will. And I’m sure if I had to, if they think hard, I could pull out a couple of named countries, but I wouldn’t want, I wouldn’t want to do that on the spot.

Grant Belgard: That’s interesting. Thanks. Now pivoting to our second major topic, which is you.

Joanne Hackett: Yes.

Grant Belgard: What drew you into working at the Interface of Science Data and Health Systems in the first place?

Joanne Hackett: So I was one of these people who definitely wanted to be an academic. I was 100% sure that’s what I wanted to do. And starting my PhD, I was 100% sure that was exactly what I wanted to do. And during my first postdoctoral fellowship, I was introduced to the commercial world and I could see them as a way to make data or assets accessible. It wasn’t about money. It was, had nothing to do with that at all, to be with the access side of things. And that was new and exciting that as an academic, the only thing that you have is your brain, and you can only think about, you know, your next grant for your next publication, it’s not necessarily as collaborative. And I’m a trained geneticist and a tissue engineer. So this commercial world helping me to make discoveries more accessible was quite interesting.

Joanne Hackett: So then I ended up thinking about a way that I could collaborate, do things differently, which again, is not necessarily a typical academic mindset per se, and then I ended up working where I say that the triple helix, if you will, which is the intersection between academia, business, and the clinical communities, and I did love the academic world. And then when I worked at Pfizer and combined that with a very fast-paced industry job, I thought this is really quite exciting and I could see the parallels in both. And I loved that section of my career as well. Worked for the UK government, which was a very strange but interesting place as well. No one grows up as a geneticist expecting to work for the government. Actually, you’ve just did a professor of regenerative medicine, all very strange, but it was a really interesting way to see how decisions were made.

Joanne Hackett: And healthcare decisions, strangely enough, that are being made for a government or for a hospital, but of course that’s directly related to how research gets done and how industry works with governments as well. So seeing that all come together was extremely interesting. And then for me, working at IQVIA, effectively, I was, I actually elaborated with IQVIA during two of those stages of my career. And I realized that, you know, if you were to think about a global genomics dream, it can really only be achieved if you actually combine all of those different forces together. So for me, it was, it was a no, if I had to, somebody that, Oh, you have to pick one of these three sections and go back and only work there. I would go back to all of them very happily. And each of one of them was extremely fulfilling in different ways.

Joanne Hackett: But the fact that I can weave between them now is just, it’s delightful.

Grant Belgard: Looking back, what were the two or three inflection points that most shaped your path?

Joanne Hackett: The, the biggest thing that happened to me was getting access to the commercial world and that happened not because I was someone who knew what I was doing and was very progressive about that way of thinking. As I said, I was a hard and fast academic. The fact that I had a postdoc supervisor who encouraged me to think differently, who allowed me to think outside the box and expose me to that. If I didn’t have someone basically pushing me for that opportunity, I never would have been able to see that. And that kind of ended up then allowing me to be exposed to slightly different individuals. It was the job at Pfizer that got me those to the UK government. So it were these things that kind of, it was the overlap in the intersection as opposed to the hard and fast decisions in one particular role.

Joanne Hackett: But to be honest with you, I’m also that annoying person that always asks questions, wants to know what comes before, what comes after, why is this fitting together and you just, I think maybe people just get tired of dealing with people like me and say, gosh, we just got to give this person something different to do that keeps out their energy contained because otherwise they’re going to end up driving us crazy. But being curious and asking the questions gets you noticed and people start to realize that you may think of it differently, which is sometimes not a bet.

Grant Belgard: How do you decide when it’s time to take on a new remit versus deepening where you are?

Joanne Hackett: I have had to become much more selective as time goes on, mostly based on the fact that they said yes to everything, which I definitely said yes to a lot of things. When I was younger, again, so the exposure for the experience, and it was absolutely fantastic. I wouldn’t do it any differently. The thing is with certain responsibilities now, it’s not just, I have to get something back from it as well. It’s not just, I can constantly give, I want to learn. I’m not too old to learn. I’m not, you know, I’m to pasture yet. I want it to be a transaction more so than me just being able to help someone else and there’s so much to learn. And for me, understanding how I can, I sometimes can learn more from a 30 minute reverse venturing experience with a young, you know, second year economics student who’s doing an internship, for example, then I can be sitting on a board.

Joanne Hackett: So it’s all about how I think that I can both help the individual, but how the individual can help me as well.

Grant Belgard: What have you changed your mind about the last five years?

Joanne Hackett: What have I changed my mind about, gosh, so many things. I think, yeah, for me, health has always been the thing that is zero compromised. If I was told I wasn’t able to go to the gym or if I wasn’t able to exercise when I was traveling or something like that, it would just, that’s not going to happen. I never compromised my fitness and my health. That’s always been something that’s been extremely important to me. I’ve probably changed my mind a bit on how much effort I need to put into that side of things as well. You can still be quite healthy and well-rounded without putting too much energy and emphasis into it. And I think because I am someone who does have a rare disease, I think I thought if I put so much energy now, I’m almost building up a little bit of collateral for later in life when I may need it and clearly that’s not the case.

Joanne Hackett: So I’m probably slightly more relaxed about that. And also I’ve probably changed my mind a bit more on, I’ve definitely, I’ve always been a very critical person, both of myself and the people who, you know, work for me, things like that, like I have very high expectations. I’ve probably learned to be a bit kinder because we’ve all, we all have something going on in our lives and you never know if the person in front of you has just received bad news and yes, they might be sitting there taking a few extra minutes, getting their bank card out, but there’s probably something you don’t necessarily know and I think that comes with either lived experience from an individual having some something happened to them or something happened to their family.

Joanne Hackett: But I’ve probably become a little bit more tolerant towards not necessarily understanding why, but just accepting the fact that what you see is not necessarily what you get.

Grant Belgard: Which early career habits aged well and which did you have to unlearn?

Joanne Hackett: I’ve always been someone who has put a hundred percent of my effort into something I do that’s a characteristic that one of the first things people will probably always say, very hard working, that’s never served me wrong. If I’m going to do something I’ve followed through with that, that’s never been a bad thing. And if I’m going to do it, it’s going to be done well. I’m not just going to slap it together just to say that it’s done. So the hard work, dedication and doing it well has worked extremely well in my favor. Probably trying to get people to like me in something that hasn’t aged so well. We have to realize that not everybody is going to like everyone. It sometimes has nothing to do with the person. It sometimes has everything to do with the person. It’s just not worth it.

Joanne Hackett: It’s not, you have to learn very quickly how to work more professionally sometimes, as opposed to try to be the buddy of an individual. So that, that’s not something I spend a lot of time thinking about anymore. People can respect you and not like you, and I would much rather than respect me than like me. There’s that point in trying to win that fight a bit over. And the things that also probably have, has been extremely useful for me, which I’ve perhaps adapted, is how to be a lead. So some of the ways that I, and I think anybody can be a leader, you don’t have to be senior in your career. You can be quite junior and still lead people. And I think the characteristics of leadership have changed for me, but that’s probably more based on the roles that I’ve had as time has gone on, as opposed to the actual characteristics of how to lead.

Joanne Hackett: And can you share a specific failure that ended up redirecting your trajectory? People who say that failure is the best thing that’s happened to them are telling the truth. There are so many things that we fail at that we never want to talk about. And sometimes maybe as it’s happening, it’s maybe not the right time to talk about it for a variety of different reasons, and we only wait until a certain time in our lives to be able to share that, which again, maybe there’s particular reasons for that, but for many years, I didn’t tell people that I had a rare disease and I’ve suffered through some of the different consequences that were happening because of that, I didn’t want them to think I couldn’t do the job or I wasn’t good enough. So I feel personally as though I failed at being authentic very early in some of my roles.

Joanne Hackett: And it wasn’t great to feel that I was scrambling to try to make it up or to try to be a different sort of person than I was, I think that was terrible that I did that and I don’t think it would have changed anything had I just been honest and had an open conversation. I didn’t have to do anything any different. I don’t know why I just felt embarrassed about the whole entire thing, so that wasn’t great. And I felt certain companies that were hauling, they were terrible companies and it was so great that we realized it and we wrapped them up and moved on with it. And when I, the first company that I started myself, which I knew I didn’t want to leave this company, I didn’t want to be the person responsible for it, and I sold it as quickly as possible. And there’s so many people to this day that think, oh, that’s too bad.

Joanne Hackett: No, no, that to me, that wasn’t a failure to me, that was a massive success because I didn’t want to do it. So it’s strange how certain people’s failures are considered to be other people’s successes, but it’s also what you take away from it. And for me, to learn how to be my authentic self or to make the decisions that were going to be the best for me were way more important than what somebody was saying. Oh, gosh, what’s been so sad to sell your company? No, that was actually fantastic. Thank you very much.

Grant Belgard: On the topic of advice, what skill investments today will compound in the coming years?

Joanne Hackett: There’s enough, there’s, you can never take away the traits of hardware dedication, people being able to rely on you. Those are characteristics that take you an awful long way. And also being curious. It’s there’s, I can’t understand these people that we have the whole world in front of us. Ask questions like why, if you don’t know something, why just accept it in isolation? It’s find out why, what happens before and after, doesn’t this help you understand things a lot more? So I really think it’s important to be curious and dig in. And honestly, people are mean, bad things are going to happen. Cold life’s just, you can be upset about something, but honestly, you’ll only be able to be a better version of yourself. It’s grit, it’s determination. It’s just cracking on with it. We all have a huge amount to give.

Joanne Hackett: So why not put your best foot forward and take that the best possible opportunity, not just for yourself, but for others.

Grant Belgard: What would you deprioritize that’s often overrated on a CV?

Joanne Hackett: I don’t know. I don’t do all of these extra courses and brag about them and stuff like that. And these people, I think it’s hilarious when they talk about all these fancy numbers and try to, efficiency is at 4% and this and that, you’re a person. I just don’t understand these sorts of things. I don’t buy into any of that stuff. I know that a lot of people are very, I don’t know if they’re necessarily competitive with themselves or for other people, but just do the best version of you. It’s not that complicated. And I never, I get very, when I see these TVs and people are trying to take credits or turned around a complex organization in 60 days or whatever, there’s no way. You didn’t do it. And if you did do it, you had a team. And it’s the fact that you won’t take that step back and reflect on the fact that the team helped you support this.

Joanne Hackett: You’re probably somebody who I wouldn’t want to work with anyway. It’s not that hard to share the credit. There’s always enough to go around. I don’t like that thing very much.

Grant Belgard: And for startup founders, how should a new product team validate real buyer demand inside a health system?

Joanne Hackett: Yes, this is something that I think we could do a whole podcast on its own because it’s quite shocking how I will occasionally see this pitch deck come across my desk and you think, well, then it’s scary that someone has put this together and has worked on it for several months when no one will buy it. And the biggest thing that, there’s loads of things out there that could be created. There’s a lot of different things that will help. Going back to the question earlier about what evidence to payers need for things, ask your thought who’s going to pay. And it’s not all about money, but if you’re planning on selling a product, someone’s going to have to buy it. So why would they buy it? How is it going to be rolled out? There’s different regulations in different countries. Do you want to be across several different countries, different types of institutions, who is going to pay for this?

Joanne Hackett: And whether you’re a biotech, a med tech, a digital health company, you have to have a value proposition that’s going to add value as opposed to just, oh, it’s great that we’ve decided to round the edges of the door knob, great, but no one’s going to go out and commission 5,000 more of them. I know it looks better and it’s nicer, but you need to find out, find a thing that’s going to make the difference, change it, and even if it is expensive, if it’s worth it, people buy it. Look at the cell and gene therapies that are out there today. There are millions of dollars. They’re bought for the obvious reason that they were. So it’s not a cost issue. It’s a more about is it, is there actually a need for this and will someone pay for it?

Grant Belgard: When you hear a pitch about AI and healthcare, what signals seriousness to you?

Joanne Hackett: I don’t think I’ve seen one yet. I’m sorry. That’s probably not the appropriate answer. But the thing is, I guess I’m a geneticist. We’ve been using AI, quote unquote, for years now. There’s no one who can look at the human genome and understand the many different, you just, you can’t. So there’s always been tools to make our lives easier and faster. And being able to have tools that are going to do that, enhance it in a, in a way that you’ve got the right information. There’s very few algorithms that have been trained with the right type of data or the right amount of data. I think it’s fantastic that there are going to be things that will be rolled out in hopefully due course, but if we’re not there yet, why, I just understand why people get so hopped up about this.

Joanne Hackett: AI and healthcare to me would be that one of the best use cases will be for us to be able to use our phones to triage healthcare and whether it’s an emergency or whether it’s just basic healthcare needs, why can’t we think about the practical aspects of healthcare, the AI and pulling together data for research, predictive mechanisms and things like that, that is exactly where I’d love to be able to see it to go. But so many people are obsessed about the device or the whizzy thing that they can talk about that’s going to happen today when I just don’t know if the data is in the right format, in the right place, diverse enough and being pulled together by the right type of an agent to be able to make that make sense. So I personally haven’t seen it yet and therefore I’ll remain skeptical until the right thing lands on my desk, let’s do it that way.

Grant Belgard: And for health system leaders, where can modest investments in data infrastructure yield outsized returns within a year?

Joanne Hackett: A lot, a very simple thing is curating data. And it’s so boring to even say that. I’ve fallen asleep just saying that line, but it really is structuring data. If you had these people who brag about the databases they have and, oh, but we see 10,000 cardiac patients a year. And what information do you have about that? Can I collect that and cross-reference it with people with metabolic disorders? Can I then cross-reference it and look at something else? You don’t have [bone-lock sterilization?] or something like that. What use is it? So the modest thing for data as a whole is making sure that it’s actually collected in a consistent way, it’s structured in the right way, and it’s accessible. And those are very small investments. And that data is then actually worth something as opposed to these people, oh, you know, data’s like the new oil. No, it’s not. It’s completely different.

Joanne Hackett: You cannot compare that because with oil you use it immediately, with data you can’t. So it’s not new oil. You have to refine it first before it’s actually useful. So we’re not at that stage where we’re actually capitalizing on the right type of data because we haven’t invested in it. And to be very honest with you, I have never seen the front cover of a magazine or a newspaper with anyone with a big pair of scissors cutting a data infrastructure for a change. You want to be standing in front of a ribbon in front of an Eberron machine.

Joanne Hackett: So until we get fast enough for the shiny tool is the thing that we want to invest in, and investment is a real piece of something, you actually have to invest a huge amount of time, effort, and energy into what happened behind the door, as opposed to the shiny machine that’s sitting inside the room and building in the business case for interoperability, data standards, and things like that. It’s still thought of as the fluffy thing that goes alongside of the MRI machine, and until we change that mentality, we’re still going to be struggling with the physical versus the thing that you just can’t see and touch, which scares a lot of people.

Grant Belgard: I think our bioinformatics listeners will agree enthusiastically with that, right? 80, 90% of your time is spent data cleaning, data munging, right?

Joanne Hackett: Completely.

Grant Belgard: So for our early career listeners, what questions should candidates ask during interviews, but rarely do?

Joanne Hackett: I very rarely find someone who’s read enough about a complicated question to answer it themselves. And they’ll usually turn and say, it’d be interesting to know how you would approach this, or what are you looking for? And that’s the line, but how would you answer it? Very rarely do they come with this solution themselves. And I think it’s because they want it to be a dialogue issue that they’ve come up with the creative question, but answer it for me, I’d be much more impressed with you answering your question, as opposed to flailing my take on it. And they probably have a better answer to be honest, because they’ll have different ways of thinking than I will have.

Grant Belgard: This has been fascinating. And for listeners who want to follow your work and your thoughts, what’s the best way for them to follow you?

Joanne Hackett: Most of the work that I do is on LinkedIn. It’s the only social media that I really engage with. So find me on LinkedIn.

Grant Belgard: Great. Thank you so much for joining us.

Joanne Hackett: Thank you for having me. It was a pleasure and really lovely to see you again.

Grant Belgard: Thank you.

Bioinformatics Done Right, Now

Academics, Don’t Wait on the Queue: A Faster Path from Data to Publication


The email arrives: “Your sequencing data are ready.”

It’s the kind of sentence that makes a lab buzz. But after the first rush comes a familiar pause: Who will analyze this, and how long will it take? If you recognize yourself in that moment, The Bioinformatics CRO is for you.

The Shortest Distance Between Data and Figure

Our promise is simple: fast, publication-grade bioinformatics for academics at core‑competitive pricing—without the long waitlist or the learning curve.

  • Speed without shortcuts. In-house cores do important work, but they’re often backed up. We keep our queue short and our response times tight. Projects start quickly once scope is set.
  • Expert time, not training time. Our team is staffed by senior scientists who have shipped many analyses. That experience compresses timelines and reduces rework.
  • Pricing in the same neighborhood as cores. Hourly rates are similar, but our model is built to reduce idle time and cut the “waiting cost.”

In other words: you move faster, often with lower all‑in cost once you account for delays, rework, and the hours you spend shepherding a novice through their first pipeline.

Why Not Just Use a Trainee?

Postdocs and graduate students are talented. They are also busy. Courses, journal clubs, teaching, competing projects, and grant work carve away their hours. If your timeline is tight—or the analysis is non‑standard—asking a trainee to learn on the fly can turn weeks into months. By the time they’ve written code, defended choices, and redone figures for reviewers, the “cheap” path has quietly become expensive.

Working with us is different. We’ve already navigated the edge cases, the batch effects, the parameter cliffs, and the “looks great, but reviewers won’t accept it” traps. We deliver defensible results, clean methods text, and reproducible code.

A Diplomatic Word About Cores and Collaborators

Cores are steady partners, but queues are real, and revisions can be slow. Collaborator labs can be great, yet authorship and priorities get complicated. We’re designed to be your surge capacity and your clean handoff: fast starts, clear deliverables, and no unnecessary authorship entanglements.

How to Work With Us (and Save Money Doing It)

Two engagement styles both work well. Choose the one that matches your project and bandwidth.

1) Clear Scope → Accurate Estimate & Fast Delivery

If you know what you need—say, bulk RNA‑seq differential expression with pathway analysis and four figure-ready plots—tell us up front.

What you get: a tight statement of work, a realistic budget window, and a start date you can put on your lab calendar.
Best for: projects with defined questions, revision letters, or datasets similar to your previous work.

2) Engage With Us as You Go → Exploration & Iteration 

Not every dataset announces its secrets on day one. If the plan is exploratory, we’ll move in measured steps—share early readouts, discuss directions, and refine.

What we need from you: real engagement. Quick feedback keeps momentum high and scope aligned.
Best for: new modalities, mixed cohorts, or “we’ll know it when we see it” figure discovery.

A Cost‑Effective Division of Labor

To keep your budget focused on analysis (not cosmetics or copy), split the work like this:

  • Your lab handles:
    • Data and metadata hygiene (sample sheets, consistent IDs, clear conditions).
    • Figure polish for final submission (fonts, colors, journal‑specific formatting).
    • Manuscript prose (introduction, discussion, and related literature).
  • We handle:
    • QC and rigorous analysis (e.g., DE, clustering/annotation, integration, modeling).
    • Reviewer‑proof choices and statistics.
    • Figure‑ready plots and tables.
    • Methods text and code so everything is reproducible.

This division keeps costs lean and lets trainees contribute meaningfully without spending their semester learning an entire toolchain from scratch.

What to Expect From Us

  • Fast kickoff once scope is set. We schedule starts promptly and keep you posted.
  • PhD‑level analysis you don’t have to babysit. We make choices transparent and document them.
  • Figure‑ready outputs and clean methods. Drop them into your manuscript with minimal edits.
  • Reproducible artifacts. Notebooks, parameter files, and pipeline manifests live with your results.
  • Plain-language updates. Short check‑ins, clear next steps, and no jargon walls.

When We’re the Obvious Choice

  • Data in hand; publication clock ticking. You need figures in weeks, not semesters. 
  • Major revision lands. A reviewer asks for extra analyses or different thresholds. We execute fast and clean. 
  • Grant support. You want a credible analysis plan and methods you can defend. We can provide a letter of support too.

Common Questions

  • “Isn’t a student cheaper?”
    On paper, yes. In practice, hidden costs pile up: learning time, your guidance time, reruns after critiques, and the risk of delays. Our rate is core‑range, but our experienced team and shorter queue often make the real cost—and the stress—lower.
  • “Will you take authorship?”
    Only if you want us to and if our intellectual contribution merits it. Otherwise, we provide clean acknowledgments and thorough methods so credit remains where you intend.
  • “What about compliance and reproducibility?”
    We assume de‑identified data by default and return a reproducible package: QC summaries, parameter files, methods text, and code that stands up to reviewer scrutiny. 

A Short Field Guide to Faster Projects

Before kickoff

  • Write one paragraph that states your central claim or question.
  • List your cohorts/conditions, sample counts, and any known pitfalls.
  • Clean your metadata: consistent sample names, tidy spreadsheets, no mystery columns.

During analysis

  • Respond quickly to interim results; momentum matters.
  • If the path forks, choose one clearly—or approve a bounded exploration.

Before submission

  • Have your trainee apply journal style to figures (fonts, colors, panel letters).
  • Paste our methods text and citations; adjust voice as needed.
  • Use our code and QC notes to pre‑empt reviewer concerns.

The Quiet Luxury: Time

The most expensive thing in your lab is not the hourly rate of an analyst. It’s time—time before a scoop, before a grant deadline, before a trainee defends, before the field moves on. The Bioinformatics CRO trades in time saved without rigor lost. That is the value we offer: publishable certainty, delivered quickly, at a price you already recognize.

If the next dataset is knocking, let’s make the waiting the shortest part of your story.

Ready to move? Send us a brief description of your data and desired figures, or tell us you’d like to work iteratively. We’ll match the approach to the moment—and get you from data to done.

The Bioinformatics CRO Podcast

Episode 69 with David Scieszka

David Scieszka, founder and CEO of Vertical Longevity Pharmaceuticals, tells us about VeLo’s pioneering senolytic vaccine approach to clearing senescent cells and his quest for longer, healthier lives for everyone.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

David Scieszka

David Scieszka is founder and CEO of Vertical Longevity Pharma, which is currently pioneering a senolytic vaccine approach to targeting atherosclerosis and aging.

Transcript of Episode 69: David Scieszka

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the Bioinformatics CRO podcast. I’m your host, Grant Belgard. Today we’re speaking with Dr. David Scieszka, founder and CEO of Vertical Longevity Pharmaceuticals, AKA VeLo Pharma. David’s team is pioneering a first-in-class senolytic vaccine that teaches the immune system to clear senescent cells, those dysfunctional zombie cells that accumulate with age. With a PhD in biomedical sciences, an MBA, and even a stint as a U.S. Army PSYOP specialist, David brings a uniquely interdisciplinary lens to the quest for longer, healthier lives. We’ll dive into how VeLo’s platform could reverse atherosclerosis, where the company sits in the fast-moving longevity landscape, David’s winding path from scientist to biotech CEO, and the advice he wishes he’d had earlier. David, welcome to the show.

David Scieszka: Thanks, thanks for having me. It’s great to be here.

Grant Belgard: So in 60 seconds, what problem is VeLo solving and how?

David Scieszka: That’s a good question. To be more specific than I usually am, we are initially targeting the disease of atherosclerosis, and we’re doing so by targeting a fundamental driver of aging. And so we can potentially unclog the arteries that have already been clogged, which is something that people are trying to do right now, but to this day, no one has been able to do yet. And so one of the things on our platform, we are targeting those zombie cells like you’re talking about. And from that, we can have multi-disease capabilities where we can intervene not only in atherosclerosis, that’s just the first step. Our larger goal is to impact healthspan, the number of healthy years that you’re alive. If we can extend that for every human on the planet, we’re in a really good spot, but we have to initially focus on atherosclerosis. So that’s the key that we’re targeting first.

Grant Belgard: The term senolytic vaccine is unfamiliar to many. Can you break down the mechanism in lay terms?

David Scieszka: Yes, absolutely. So like you said, a lot of people like to attribute senescent cells to zombie cells. They are pro-inflammatory, they can cause localized tissue dysfunction, but they also feed forward the senescence phenotype. So they excrete pro-inflammatory molecules, both locally, and then those pro-inflammatory molecules enter your circulation, those go systemic. And so they transform other senescent cells all across the body. There’s a low level of these in every single cell type that we can study, including neurons. But the senescent cells themselves, being hallmark of aging, causing that tissue dysfunction, there’s of course going to be a therapeutic push to clear them out. And that approach is called a senolytic approach. It’s a little bit of a misnomer. It’s actually the apoptosis or apoptosis approach rather than an actual lysis, because that would cause even more inflammation.

David Scieszka: So it really is activating that mechanism of self-death. That natural process is a senolytic approach. So people have tried with dasatinib, quercetin. Many different senolytic approaches are being investigated currently, but they suffer from dose dependent toxicity, off target effects, and sometimes limited efficacy. And so finding the right antigen to target or the right marker to be able to intervene at is critically important so that you’re not harming healthy cells as well as senescent cells. And that’s been a real push in the senescence field recently. The vaccine approach is to basically engage your immune system to clear out these aberrant cells, these pathologic cells, allowing your immune system to do the work for you, which is hopefully seen as a positive approach as opposed to a potentially hazardous one, because you have to select the right antigen, absolutely.

Grant Belgard: Could you tell us about your preclinical mouse data?

David Scieszka: Yeah, absolutely. So pre-clinically, we have investigated the vaccine on a standard black six model. So this is a mouse model, an aged model. So it was aged 18 months naturally, and then we vaccinated. We did that, we selected that age point, because if it works at that age, we know it’ll work at every age before that as well. Things like thymic and dilution, where your thymus is degrading with age so that your immune system is responding less robustly. So we selected the 18-month-old time point, and then we vaccinated, monitored, monitored lung function, so heart and lung function, and then visually as well, multiple different metrics. It was really surprising the responses that we found. Not only did we see healthspan and lifespan extension, that was pretty expected. We also saw hair loss recovery, which is expected. That’s known in the senescence field.

David Scieszka: Qualitatively arthritic reductions, that was not expected for me. We also saw cardiovascular rejuvenation. That was really unexpected. We, as a scientist, I always am pleasantly surprised when experiments go well, and incredibly excited when things go better than planned. And so we expected the heart function to decline at the same rate as normal aging. We didn’t expect that the senescent cells would be having that drastic of effect, but we rejuvenated the heart to a younger time point based off the parameters, and incredible results. So now we’re focused on atherosclerosis because of the potential impact on humans as well. But yeah, I could go on about the beta too, but happy to talk to anybody who’s interested in reaching out to me as well.

Grant Belgard: And for listeners who are interested in reaching out to you and following VeLo Pharma, how should they do that?

David Scieszka: I’m on LinkedIn all the time. I try to connect with as many people as possible there. My inbox on email is always inundated, so it’s much easier to connect with me via LinkedIn. And I’ve got a unique last name, so I’m gonna be, if not the only David Scieszka, one of the only David Scieszkas on LinkedIn. So it’ll be pretty easy to find me, I think.

Grant Belgard: And where does VeLo sit relative to other longevity players? What differentiates a vaccine approach from small molecule senolytics?

David Scieszka: Yeah, the small molecule senolytics, right? So specifically talking about like, Dasatinib, dasatinib is a different mechanism. And also don’t mean to inundate people with weird terms, but like cyclin-dependent kinase inhibitors, cell cycle processes, P53, if you’re familiar with cancer. And so those are messing around with the internal metrics of a cell. We’re trying to go after a surface protein after it has transformed senescence. So whereas a lot of people are concerned with the senolytic approaches like Novidoclax, like Unity Biotechnology’s previous approaches, because there is an inherent cancer risk. If you’re messing around with the nucleus, if you’re messing around with the internal processes, we might be stopping the senescence transformation process.

David Scieszka: We’re going downstream of that post-senescence transformation, only killing the senescence cells after they have since transformed. And so I would argue that this is a much safer approach than all previous approaches, and especially by selecting an antigen that is all eyes to a few specific processes on the surface of proteins, as opposed to a much broader antigen. We have selected a very safe way to move forward in the senolytic space. Again, that’s just my argument.

Grant Belgard: At what point did you select atherosclerosis as your first indication? Was that after you got back the mouse data?

David Scieszka: It was, so that’s an interesting one. If you look at the data, I have a background as a computational biologist as well. If you follow the data, it’s much more cash efficient to go chronic kidney disease. Senescent cells have been implicated in chronic kidney disease. There’s even clinical trials right now against senescent cells using piscidibic trisetin. That’s a cash efficient way to get to market, but we went through what’s called the I-Corps program, which is in three months you interview 100 people, including KOLs, doctors, people on the street. From that, I actually had that as part of my AD testing. Are you more excited about a vaccine that targets chronic kidney disease, or are you more excited about a vaccine that can unclot arteries, and resounding response from doctors and people alike was atherosclerosis. It wasn’t even a comparison.

David Scieszka: And so the data, we have to follow the data absolutely. We could have generated primary data on chronic kidney disease. Instead, we reallocated those dollars to echocardiograms and the cardiovascular measures because we anticipated the product market fit really. So it was truly an internal strategy from the get-go. How do we find the best niche to fill?

Grant Belgard: Can you outline the next 18 months for us on your roadmap?

David Scieszka: Yes, for the next 18 months, it’s all fundraising. No, just kidding, but it’s definitely a big part of it. We are fundraising currently. As soon as we receive sufficient funding, we can engage in primate study, which is going to be incredibly translational. I would say we’re partnering with academics right now and also trying to open up conversations with the NIH. Part of what’s akin to their tech transfer office, so that we’re trying to get our vaccine in the hands of investigators who have the animal models to be able to test this out in their different indications and their different ideas. So we have on the horizon a vaccine study in primates that can measure whether or not it works. And of course it will, because we know the protein exists in monkeys and humans. That’s been known for a long time. We have to validate it, show that proof of concept.

David Scieszka: In 18 months, depending on funding, we can initiate manufacturing and we can do both of our toxicology. So taking a step back, there’s steps that you have to go through before you get your drug approved. And part of that is doing toxicology. Part of that is doing manufacturing. Those are the less exciting, a little bit boring aspects of it, but that’s part of the process. So we can definitely do that in the next 18 months. And as well, we can get translational primate study done. So basically we, in the next 18 months, we could have everything ready for our submission to the FDA.

Grant Belgard: Best case scenario, how do you envision VeLo contributing to a reduction of morbidity?

David Scieszka: Like what are the next steps beyond that? Yes, if we were on the market today, best case scenario, we would be able to reverse atherosclerosis to the point where another organ system would fail first. And that would be the extension of health span. So we would be pushing out [?], longevity, escape velocity. We’d be pushing out that lifespan and health span a couple of years, and then a new organ would fail. And hopefully our vaccine is also able to positively impact that organ, say the kidneys, or say metabolic dysfunction associated with a different type of disorder. If we have the capabilities to impact that, then we’re doing multiple interventions simultaneously. We just have to make sure that we’re showing that through testing.

David Scieszka: And then afterwards, in the sense of where we find ourselves in the landscape, we’re going to transition this from an injectable into an oral formulation, because our vaccine platform has to produce this in a pill form that is shelf stable, greater than six months. So we can go into the driest deserts, the wettest jungles. We can get this in the hands of everybody. As soon as we show it’s safe, we can get this in the hands of everybody and at extremely reduced cost. And that’s part of the strategy that we have too. We intentionally chose our platform because it is safe and cheap to produce. We want this therapy to go completely different way than say CAR T-cell therapy, where it costs you hundreds of thousands of dollars. That is insufficient in my future. I will not be a part of it. It’s a great approach, don’t get me wrong.

David Scieszka: It needs to happen so that we can find better alternatives though. We chose our vaccine platform for the people. We want this to be in the hands of all, democratizing the longevity process. And that’s the longer vision of VeLo Pharma.

Grant Belgard: So you were a US Army PSYOP specialist before grad school. How did that shape your worldview and how you approached VeLo Pharma?

David Scieszka: It was an incredible opportunity, an incredibly formative process. The resilience that I gained to team management, leadership that I know many people don’t go that route and they’re afraid of what the US military does and can do. And I want to say that it’s not necessarily all that way. There are great guys there. They are doing their jobs and what you learn along the way is so beneficial. I learned philosophy before I joined the army. And then I kept reading philosophy up until this day. It reinforced many philosophical principles like you work faster and better in a team. Things like if you have the right tools at your disposal, you can be a force multiplier instead of an individual. So team leadership, facilitating, giving people the right tools in order to empower them to be better on your team. All of these different things.

David Scieszka: It was an opportunity just as a young man to solidify about a foundation of hard work, resilience, stick-to-it-iveness, and an ability to think on your feet. But outside of that, it actually, it inspired me to join biotech. We were on a deployment in the Philippines, which doesn’t get better than that, right? But we were interviewing local populations. We were census takers, basically, and [?] specialists in that deployment. We would ask, where are your hospital schools and supplies? And do you need more of them? And so then we were finding from our census, if I investigated this rural area and found a guy who was, say, 40 years old, he would look like he was 50. And then if we go into the city where hospitals are everywhere, I would ask, are you, okay, what age are you? He would say 30, and he would look like he was in his 20s.

David Scieszka: So he would look younger than he appeared, and it was access to healthcare, access to simple things like toothbrush, toothpaste, good food. And so that was my inspiration into aging, really, was through the army and as well into biotech, because I thought, wow, there’s a real-world example where we can intervene in biology and see this effect. It was, yeah. I know not everyone who’s in the military has just any experience, but for me, it was absolutely incredible, one of the more formative ones in my life.

Grant Belgard: And can you tell us about your decision to do both a PhD and an MBA, why you did that and at what point? Decided you wanted to be an entrepreneur?

David Scieszka: Yeah, it was during grad school. Wanted to do an MD/PhD route as part of my undergrad. I was in biotech, and I saw that the movers and shakers, a lot of them had a lot of letters after the name, and I came into contact with an MD PhD, my first mentor, actually, Dr. Marcelo Freire, if he’s listening, shout out. Amazing man and a brilliant investigator. But he had an MD PhD, and I wanted to emulate that because of his understanding of physiology and also basic science, spanning the gamut. Got into grad school, got to talking to as many people as I could, and it turns out that I was not looking for an MD, and that was only by the advice of somebody who I deeply respected, a chair of a department, and he said, do you really want to be working on your MD PhD for the next 10, 15 years? You’re gonna have to go into residency after this. You’re gonna have to be patient’s side for several years.

David Scieszka: Are you sure that’s what you’re looking for? And thankfully, he was able to steer me in the right direction. I said, no, I wanna translate science into therapies. That’s what I wanna do. And he said, you’re looking for an MBA. If you can stomach it, you want a PhD MBA. You don’t want an MD PhD. So I took his advice. I’m very coachable in that way that people who have been there, done that, you gotta listen to them. You gotta, as long as they are an expert in their field, I gotta qualify that statement. But yeah, with the respect that I have for him, I listened to his advice, and then I was able to take on the MBA at the same time. It’s always been about saving as many people as possible, intervening in lifespan in as many patients as possible. Yeah, during grad school, I would say is the shorter answer.

Grant Belgard: So we’ve discussed what got you interested in aging and longevity. What specifically convinced you that senescence was the way to go right there?

David Scieszka: There are many approaches within the longevity space, senescence, big one. Yes, it’s as we do when we’re going through advanced degrees, we look very deeply at particular mechanisms, pathways, in this case, hallmarks of aging. And so I did a deep dive. Identity dive into every hallmark that was available at the time that we had because of [Lopez-Otin?] paper. And so I was finding that, of course, they’re all interconnected, mitochondria, nutrient dysregulation, but I found a through-line reactive oxygen species that impacted more of the hallmarks than others. And then there was an obvious phenotype associated with it called senescence. So when we think about [?], aberrant [?] causing DNA damage, aberrant [?] causing misfolding, things of that nature, nutrient dysregulation, surface receptor dysregulation, a lot of it stems from inflammation.

David Scieszka: And so then taking a look at what senescent cells do, inside the cells, there are these lysosomes that are filled with acid. And when they permeable out, that acid leaks out and it affects, well, cytosolic pH, of course, but as well, it hits the DNA. It goes and hits every organelle and it starts having proteins misfolded. And so the senescent cells seem to be a more fundamental, and I still haven’t found out whether or not there is a more fundamental layer than senescence, but it appears to me that because of the obvious ability to track and target a phenotypic expression of one of these hallmarks of aging, it’s a much simpler intervention to be able to find out is fundamentally driving the others. It’s a much more difficult thing to say, to target a tRNA synthetase inhibitor, although people who are listening should check out Mark McCormick on this.

David Scieszka: He’s doing some great work in that avenue. But if you’re trying to focus on a target, you have to be able to intervene. And for me, the senescence field was a fundamental lifting of other hallmarks of aging and as well.

Grant Belgard: Can you tell us about an early failed experiment or startup lesson that still guides you?

David Scieszka: It’s hard to pick which one because there have been so many. The entrepreneurial process is always iterative. And I think that a lot of our PhD and master’s projects are that way too. And so we learned from an early age in our budding scientific careers, how to bounce back, I would say, from a failed experiment, but learn from it at the same time. From a failed experimental point of view, for this vaccine, surprisingly, we haven’t had any. That I got to knock on wood. But from an entrepreneurial standpoint, there has been numerous. I was initially completely misaligned with investor expectations. I went out too fast. Before I understood the landscape, I would say jumping the gun is something that I try not to do anymore. As an example of what happened, I went out trying to raise $5 million initially.

David Scieszka: I talked to some guy who finally sent me straight and he said, your company right now is not even worth that. You understand that. You would be selling 100% of your company. You would have no ownership of it. And that got me thinking, oh my God, I gotta figure out what this investor landscape looks like. And so then I joined the Life Science Angels, which is an incredible group as well, taught me a lot of both sides of the founder’s side of the table and the founder’s side of the table. And so that was a really great learning experience for that. And then you take that learning and you go out and then you learn where you were wrong again. The second time I went out, I was raising too little money. There were people saying that, oh, you haven’t thought about the long-term trajectory of your company. And it wasn’t that. We were raising in small tranches being say 100K here.

David Scieszka: The next round is gonna be 250K. The next round is gonna be 500K. But that’s completely misaligned with what investors do because it’s just as easy for a small stage investor to write a 250K check as it is for them to write a million dollar check. That’s for them, it requires as much legwork. And so for me to be going out raising 100K, they’re not gonna do it. You need to find a specific localized angel that’s gonna be willing to cut such small of a check. And so it’s a constant iterative improvement but I would say strategize first and then go out as opposed to just going out because you have the action. So try not to jump the gun. That’s gonna be something that sticks with me for a long time.

Grant Belgard: And speaking of fundraising war stories, what’s the hardest lesson you took from the first iteration of your pitch deck?

David Scieszka: I would say [?] is key. It’s the word of the century, especially for entrepreneurs. As a data scientist, I love hearing yes. I love hearing no is fine as long as there’s a reason. And as an experimental scientist too, if we have data to support why yes, and if we have data to support why no, there’s a way forward. But no data is awful. If there’s an experiment that goes haywire and can’t track down why, it’s a wasted effort and a waste of time. And so from our first pitch deck, no’s and then requesting feedback and having radio silence on the other side, that was, it required a different level of self-examination than I’ve had to do up to that point. It had a lot to do with probably what people feel during their master’s and PhDs a lot too, like what’s wrong with me? Why can’t I get this done? It’s great science. What am I doing wrong?

David Scieszka: And without the data to support it, it was incredibly difficult, but it took a supportive woman, my wife, she was able to set me straight. And she said, it’s a numbers game. It’s gonna be fine. You gotta stick to your guns. You know this better than anybody. And you know you. If you start trying to change who you are in order to placate to every single person that you meet, and if you just beat yourself over the head over every single set of non-existent data that exists, you’re not gonna get out of this alive. You have to be able to stick to your guns and stick to yourself. So out of the war story, I think actually came some positive growth, but it was, yeah, it was difficult at first.

Grant Belgard: And what role does bioinformatics play in your R&D?

David Scieszka: So far, I haven’t been able to touch R or Python in about a year. And it’s, I wouldn’t say killing me, but I wanna get back to it because of the AI revolution, because of everything that’s on plate right now in silico medicine, everything that’s coming down. It will play a role in the future. I know that to be true because there’s going to be other hallmarks of aging that we can potentially target after this one is commercialized. Right now, we could optimize potentially greater optimization of antigen selection, greater optimization of peptide sequence targeting. That could be a role in the bioinformatics pipeline. That has a lot more to do with the computational modeling, docking, as opposed to what I’m more familiar with, which is omics, multi-omic analysis and integration. And as well, I’m sure that’s something that you do all the time.

David Scieszka: But yeah, I would say less now, even with this agentic tidal flow that we impending see on the horizon, but that’s a misnomer in itself. And I don’t need to get off on a tangent there, but as far as I can tell, and as far as all the companies that I’ve seen, this agentic revolution is not as close as it may appear. So far as I could tell anyway, we are several years out. And even for simpler tasks, it’s been interesting to see some of the companies that made waves earlier in the year by laying off relatively low-skill staff to replace them with agentic AI has been quietly rolling that back as it hasn’t panned out as well as they had hoped.

David Scieszka: Yeah, and I feel for those employees, to be honest with you, I know that’s going down a completely different direction but I feel the employees right now who are being subject to this unreasonable layoff system, yeah, okay, you’re on unemployment now, that we should really think about who we’re allowing to have power over these people and how we think about when it’s time to hire an AI, when it’s time to hire a person so that we don’t keep messing with these people. Yeah, hopefully it gets better. Hopefully we come to our senses and not fire people as fast. The hiring system is broken but that’s a different conversation altogether.

Grant Belgard: So if you could replay one career decision, what might you do differently?

David Scieszka: I don’t think I would. And I know that’s an unsatisfying answer for some but we were in the same unit together and I actually asked him that if you could do anything over again, what would you do? And he said, if I do anything different then I’m opening up my future to the unknown. I’m here because of everything I’ve ever done before. And even if I don’t like it, if I don’t like today, there’s still tomorrow. If I don’t like the next week, look at all the good that we’ve done before this. If I say it’s also removing all of the positive momentum that we’ve gained up until this point. And so he said that he wouldn’t change anything. And it took me a while to come around to the idea but I don’t think I would anymore either. I used to think about it. I used to think, oh yeah, I wouldn’t stand up in the middle of class or I wouldn’t forget this at that time, but it forms us.

David Scieszka: It really is who we are.

Grant Belgard: That’s interesting. I don’t think we’ve ever gotten that answer before. So for PhD trainees eyeing entrepreneurship, what hard skills should they cultivate now?

David Scieszka: I would say self-examination, a thorough ability to understand the self. If, oh, that reminds me of a quote. I don’t remember him, but he’s a prominent entrepreneur turned venture capitalist. And he said, entrepreneurship is the worst thing that I’ve ever done. I don’t know why people do it. I will never do it again. And so that kind of implies the difficulties that are facing a lot of people who are getting into this. The intrinsic motivation needs to be high, the compulsion or the specific focus or whatever it is that gets you up in the morning and keeps you up at night. If that’s a driving force behind entrepreneurship or your specific focus or your specific task, then it’s definitely worth it. And you have to know yourself to be able to know if that’s true, because a lot of the times external influences can be confused with internal motivations.

David Scieszka: So a lot of people are fronted with, I’m a broke grad student. If only I had an additional 40K a year, I’d be doing better. It’s not necessarily going to help to have more money either. So if we can separate this internal motivation from external inputs as data guy, if we could do that, then we can have a greater understanding of what really needs to happen before we even jump into the entrepreneurship idea to strategize initially. Don’t jump the gun to be able to say, okay, this is a good fit for me. And that’s true of most things. I don’t know how you feel about that, but I would think that understanding what you’re good at, what you want to be good at, where the market’s going, because we’ve seen things like, oh, autophagy wasn’t a big thing until there was a Nobel Prize for it. And so people who were studying autophagy back in what, the 60s, they had no grant funding whatsoever.

David Scieszka: So if you’re studying a process that isn’t hot and it won’t be hot, you’re facing an uphill battle, are you willing to fight that hill? If you aren’t, if it’s not those things, if it doesn’t check those boxes, then it might not be worth it. But it all comes from Socrates’ self-examination. It all comes from self-examination and it’s a real understanding of soul.

Grant Belgard: So regarding building an interdisciplinary team, what do you look for in the first 10 hires?

David Scieszka: I’m gonna quote another person that I wish I was as smart enough as him to actually make this quote myself. He said, hire slow fire fast. And that’s not necessarily true, the firing part, but the hiring slow is very true. If you’re looking for a job or a task that needs to be repeated and internal functionality as opposed to external functionality, that could be worth a hire. If you’re looking for something as a one-off or if the tasks aren’t solidified in your mind, then I would not hire yet because it is so critically important. Once you hire somebody, it’s a relationship. I don’t wanna fire somebody. I’ve had to do it before. I still, it keeps me up at night. So hiring slow, find a job that you need that can’t be done externally and then outline those responsibilities and tasks in order to make sure that they’re functional to the organization.

David Scieszka: In the first 10 hires, as critically important as they are, alignment, culture, fit. So the culture is gonna be important for the rest of your organization. And if your first 10 hires aren’t culturally aligned, you’re setting yourself up for a bad culture. I know it’s a business word that a lot of people think, oh, that’s hokey, it’s culture. If you want it as an altruist, I want to set up a culture of people who are morally aligned with what I want my organization to do. If I hire somebody who’s going to a bottom line, dollars are all that matter, it’s going to be culturally misaligned, morally misaligned and ethically misaligned. And so the first 10 people set the culture. And if you want those 10 people to meaningfully follow you through, they have to be aligned with that for sure. So I hope I explained culture at least.

David Scieszka: And then the mission as well and the vision and the functionality. So I hired our CSO because she did her PhD on our specific vaccine platform. But that was after I had gone through another 10 people who I could tell were just not there. They wanted to commercialize this and then exit immediately. For people who don’t know, that means get out and take basically your bankroll. So get paid as fast as possible. That’s not who I want. I want somebody who’s in this for the long haul. I want somebody who’s in this because they care about humanity. And so that’s when I found our CSO and that’s the kind of all things fell into place. So it takes a long time. It takes a lot of legwork. I don’t know if you probably have some insights on this too because you’ve actually hired probably quite a few people and maybe some of them better than others. I don’t need to call it out like that, but it’s hard.

David Scieszka: It’s definitely hard and it takes a lot of time but a good hire is definitely worth it I would say. What do you, I don’t know, what do you think?

Grant Belgard: Yeah, I totally agree. And culture I would say would be up there for me as well because as the organization grows larger the culture starts to get out of your hands and becomes in the hand, it gets in the hands of your early hires who are more directly interacting on a day-to-day basis with your next 10, 20, 50 employees. So yeah, couldn’t agree more. So longevity is hot but crowded. How should founders pick a viable niche?

David Scieszka: I’m gonna have to quote Matt Kaeberlein on this one. That’s somebody who actually remember Matt Kaeberlein he says that longevity is like, or the hallmarks of aging is like longevity under a lamppost where we focus on what we can see, what’s outlined for us. And so if you, well the analogy is to take a step back. If you drop your keys in the dark and then all of a sudden you start looking for your keys where there’s light shining, it makes absolutely no sense. And so it’s probably based off of an old cartoon like, oh, I can see over here, but my keys are over there. It doesn’t make sense. And so in the longevity space, if you have a hot new thing that is specifically targeting a hallmark of aging and it’s defensible, I’d say go for it.

David Scieszka: Even as crowded as it is, if you can find a niche within that, that you are either better for some reason, more defensible than somebody for some reason, or you’re potentially your AI modeling who’s going to outpace the next AI modeler, I’d say go for it. The fear is that we get outpaced by somebody else and what a terrible life it would be if we had never tried. So I think that going for it, even in a crowded space, if that’s what you found you’re the best at, I would still do it. So senolytic spaces, the crowded, it doesn’t matter. It doesn’t matter because we found a better antigen on a different realm. If you find something that’s not in the hallmarks of aging, but you think that it is, by God, go for it. There’s two organelles. There’s an organelle that we’ve never really talked about called The Vault. It’s got an HDAC.

David Scieszka: It’s got a DNA repair system in it and it’s got a telomere extension system in it. The Vault, look it up. We don’t talk about it in longevity. Nobody does. We didn’t even talk about it in Bio 101 because we don’t really know what it does. If you’re studying something out there that nobody really talks about and you think it has a relevance to longevity, go for it. Same thing with that other organelle that I can’t even remember now. There’s two of them that recently came up that we never learned of in Biology 101 and now people are hopefully studying it, but same thing. If you find yourself in a position where tangentially you’re related to longevity and you can see yourself impacting a disease as opposed to aging, that still helps span too. Absolutely go for it. I would recommend everybody pursue their dreams regardless of whether or not it’s a crowded space, as long as it’s defensible.

David Scieszka: I don’t want to tell anybody to say, go waste 10 years of their lives.

Grant Belgard: What’s one bold prediction you have for the longevity sector over the next 10 years?

David Scieszka: I am one of those crazy guys that thinks longevity, escape velocity is nigh. I really believe it. I think that our senolytic vaccine is going to be able to push the boundary of lifespan and healthspan by at least five years. I do believe that. There are directives right now. The ARPA-8 is one of them. They want you to be able to, and also Peter Diamandis’ directive as well, they’re trying to help you reverse age by 20 years. Jeez, if you’re talking about sarcopenia, God, there are so many companies right now and there are shots on goal, to quote Mitch from Ora Biomedical, there are shots on goal for the longevity field that are getting FDA approval right now. And so we’re going to be able to extend healthspan by a couple of years. And then the next therapy is going to extend by a couple of years.

David Scieszka: And then we might be the last generation to have to choose whether or not we expire naturally. That is an incredible thing. I know it’s bold. And I know that I don’t have the data to back it up, but it’s fun. It’s fun to think about the possibility that what if our parents can live forever? What if they get to choose? What if we get to live forever? We could choose. So hopefully my prediction holds true, but that’s a longevity escape velocity in the next.

Grant Belgard: I did ask for a bold prediction. What closing advice would you have? Write this on a sticky note above your desk. What would you suggest?

David Scieszka: Either “you’re worth it” or “you’ve got this” because it’s hard. It’s a hard world out there and I hope it gets better. But right now, going through master’s, PhD, undergrad, it’s easy to view the world as sharp. And it’s easier still to look at a crutch, say a bottle or something like that. And maybe it’ll soften the world for a duration of time. But as soon as you let go of that crutch, you can face it. You’ve got this.

Grant Belgard: Where can our listeners follow your work and keep up with VeLo Pharma?

David Scieszka: Yeah, I gotta do a better job than this. And actually we should probably connect after this and maybe [?] or if you have any recommendations. So I’m trying to build up the LinkedIn. You can definitely follow us on there. We have at least a landing page. We’ve got a website, vertical-longevity-pharma.com with dashes in between the letters. We’re going to be starting up revamping our media presence with updates, especially fundraising updates, progress points, things of that nature. Yeah, if you wanted to reach out to me personally, I’m just a guy just like everybody else. Happy to talk to you. It doesn’t matter if you’re looking for advice or if you feel like you can help me. I’m a big believer of the mantra, find somebody to help and repeat. And so if you need help and I can help you, reach out. If you think you can help, I’m happy to reciprocate.

David Scieszka: So yeah, find us on LinkedIn, find VeLo Pharma on LinkedIn, Vertical Longevity Pharma. You can find us on now, hopefully Twitter in the future. Yeah, that’s probably the best.

Grant Belgard: Well, David, thank you so much for joining us.

David Scieszka: Thanks for having me. This has been a load of fun.

Phil Ewels

The Bioinformatics CRO Webinar Series

February 18, 2026: Phil Ewels – Reproducible Bioinformatics at Scale: nf-core + Nextflow

Phil Ewels

​Phil Ewels is Product Manager for Open Source at Seqera. He holds a PhD in Molecular Biology from the University of Cambridge, UK. Phil joined Seqera in 2022, previously working at the National Genomics Infrastructure (NGI) at SciLifeLab in Stockholm, Sweden, where he became involved in the Nextflow project and co-founded the nf-core community. Phil’s career has spanned many disciplines from lab work and bioinformatics research in epigenetics, through to software development and community engagement. He is passionate about open-source software and has a soft spot for tools with a focus on user-friendliness. He is the author and maintainer of tools like MultiQC and SRA-Explorer, and helps lead the nf-core and Nextflow development teams.

In this live webinar, he gives an overview of Nextflow and an introduction to some of its new and exciting features for bioinformaticians looking to scale up their pipelines.

Transcript of The Bioinformatics CRO Webinar Series – Reproducible Bioinformatics at Scale: nf-core + Nextflow

Disclaimer: Transcripts may contain errors.

 

Grant Belgard: Welcome to the final talk in The Bioinformatics CRO webinar mini-series. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision ready insights, providing flexible expert bioinformatic support from study design through to analysis and reporting. As part of that mission, this webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s talk is by Phil Ewels. Phil is a senior product manager for open source software at Seqera where he helps lead the nf-core and Nextflow development teams. Today Phil will be presenting on reproducible bioinformatics at scale: nf-core and Nextflow. After the talk, we’ll host a live Q&A session. This is streaming both to YouTube and LinkedIn and on either platform, you can put your questions in the chat or the comments at any point during the talk and we’ll bring them into our discussion afterwards. Phil, over to you.

Phil Ewels: Thanks very much for the introduction and thanks Grant for the invite to come and speak today. It’s a pleasure to be as part of this webinar series and it’s always nice to have the opportunity to talk a little bit about Nextflow, a topic close to my heart. Um I don’t know if my slides are ready to come up but I yeah so basically my talk today is in two parts. I’m going to give a bit of an introduction to what Nextflow is and what nf-core is and why I think they’re good and useful for you and why I think you should care and then I’ll talk a little bit about some of the new features which have come out especially for Nextflow in the past kind of year or so, year or six months and this is particularly good for anyone in the audience who maybe has dabbled in Nextflow especially a little while ago because things are changing quite a lot and for the better. So I hope I convince you to really pick up Nextflow and see if it could help you in your work. So yeah, so my name is is Phil and I’ve been working originally in the lab and then kind of became a self-taught bioinformatician and went slowly moved from research into kind of core labs. So I worked at the National Genomics Infrastructure in Sweden developing new lab techniques and analysis and then started kind of accidentally getting into software design. Started writing pipelines, had my own pipeline tool. It was all the rage 10 years ago. And wrote software like MultiQC which I imagine many people will be familiar with. And got into Nextflow probably about eight years ago or so while I was in Sweden at the NGI. And we were running huge numbers of samples. It was a real step up from my previous work in Cambridge where now we were running hundreds of samples, hundreds of projects, sorry, thousands of samples. And we needed to the software I’d used previously wasn’t really up to the task. So I looked around and found Nextflow and we started building lots of different pipelines and because we’re a team of about eight people building pipelines and kind of we started to standardize and nf-core was born out of that standardization of our pipelines. I’ll talk a little bit about what made that possible.

Phil Ewels: So the background to the whole picture of why Nextflow exists is these classic statistics from this Nature paper quite old now 10 years ago. Where a simple study I think it’s the statistics resonate with many of us working in bioinformatics about this reproducibility crisis where it’s famously difficult to reproduce experiments that you find in the published materials and even reproducing your own experiments kind of what I refer to as your- one of your most important colleagues which is future you is notoriously difficult to do and reproducibility is the foundation of the scientific method. And so we were kind of in a bit of a bad place 10 years ago where data was really starting to scale. NGS was taking off. We had more data than we knew what to do with and we couldn’t really reproduce the analyses that we were doing and certainly we couldn’t transfer those analyses to other people. And it’s not surprising because it’s a really difficult problem. We’re running many different tools, each one of which might have a numerous different complex dependencies. Everyone’s running on a different system. And often, you know, even if it works on your machine, it might not work at a collaborator. Everyone is doing things in their own way and there was very little in the way of provenance, of knowing where data came from when your supervisor sent you an Excel spreadsheet with some results in. So, Nextflow set out to basically try and provide an answer for this. And it’s a workflow orchestration tool. So, it takes your analysis pipeline of multiple different steps and puts it together into a language. And it’s quite a unique syntax. It’s flow-based programming which kind of makes sense for what it’s doing. It’s got some real key features which make it very very popular. Something that’s really important is it’s got built-in support for software packaging. Docker was very new about when Nextflow was first launched and Nextflow supported it almost right away. And so you can package individual tools at the level of single processes within your pipeline. So the software effectively comes built in with the pipeline. So end users don’t have to worry about installing 20 or 50 different tools every time they run a new pipeline. And all those versions are pinned so you know you’re always running the same version of the software when you run that version of the pipeline. It’s multiplatform so Nextflow supports lots of different what it calls compute environments. It can submit jobs to all kinds basically anywhere you can run computing Nextflow will support. It has one of the most popular features is the ability to resume. So it’s got it’s quite clever with a cache of completed tasks. So if you’re, if you lose power halfway through your run and it’s been running for like three or four days, you haven’t lost everything. Nextflow was able to look back and understand which tasks already completed successfully and pick up where it left off. With this kind of dash regime, it’s massively scalable. Really, you know, I’ve put thousands here, but up to millions of jobs. We’ve seen truly enormous workloads passed through Nexftlow and it’s able to scale to really massive volumes of data and in the last 10 years it’s really grown an extremely active ecosystem and community which is one of the most attractive things to the system really is that there are lots of other people building with it. Okay, so for those unfamiliar with Nextflow how does it work? What does it do? There’s basically a few different steps to building a pipeline and running it. Firstly, you define kind of processes within your Nextflow code which are the building blocks. So, a single process usually corresponds to a single tool. You say what the data inputs are, what the expected outputs are, and then you have a script which could be a bash command. It could be a Python script, an R script, can be anything really, but that’s able to be resolved on the fly and that’s then submitted to your compute environment to be run as a single task. So you describe all the different processes in your pipeline and then you link them all together with what Nextflow calls channels, which is the data flow aspect of Nextflow. And you can have one but Nextflow handles all the data flow automatically when you run for the pipeline and it handles all the dependency and all the parallelization so that when you describe this flow then Nextflow automatically figures out basically how your pipeline should be run. Then once you have the pipeline logic and the code written you have a separate step which is to write configuration and then the configuration is importantly separate to the pipeline code and this is where you describe your specific setup. So your HPC, your cloud compute credentials, your laptop, whatever, and when once it’s configured you’re ready to go and you can execute it wherever you want to, basically.

Phil Ewels: The really key points if you remember nothing else is that Nextflow is not just one thing it’s several things. It’s a language. So it’s a language and a code syntax which is designed for describing workflows, the steps and also the data flow within workflows. It has separate configuration from that syntax so that you can separate the logic of the pipeline from how the pipeline should run and it’s also an orchestrator. So it’s the actual job that you run which actually passes that code and understands it and runs the pipeline for you. So it’s both a language and also an executor. The two things that Nextflow brings are reproducibility that you can run the same workflow and it can be years apart and as long as you run the same git versioned pipeline code which has pinned the exact same software for every step and you’re using the same version of Nextflow you’re almost guaranteed to get exactly the same results out which is really fantastic. And the other thing is this idea of it being portable. I can write one pipeline code and share it with different people in different places running on different systems and they can write their own config files but the pipeline code stays unmodified. And so for the first time really when Nextflow came out, it was possible to write one pipeline and run it anywhere, which now seems kind of obvious 10 years in, but at the time these two facets in Nextflow were really revolutionary and and groundbreaking.

Phil Ewels: And so what this means, this touches on this concept of scalability which was in my talk title. So you can write a single Nextflow pipeline and you can test it out on your laptop with one small test sample and and once you’re happy that it’s working properly, you can scale that same pipeline up without touching the pipeline code to tens or thousands or millions of samples. And you can also scale up the compute that it’s running on from just your laptop to maybe a slurm cluster somewhere or cloud computing basically any kind of cloud computing AWS, Azure, Google, um, Oracle. And because of the way that Nextflow is structured and architectured, it’s able to handle that scaling and basically grow grow with your needs.

Phil Ewels: Nextflow has become massively popular because of this. The figure on the left is from a recent paper that we did for nf-core community and shows just the number of citations for different workflow managers is a bit of a lagging metric but you can see that Nextflow has become more and more popular over recent years. And then on the right we just have the number of runs and you can see there’s there’s hundreds of thousands of runs of Nextflow pipelines every day. And this is probably undercounting it quite a lot as well. So Nextflow is arguably one of the most run workflow managers certainly in life sciences.

Phil Ewels: So that’s Nextflow. Quick introduction there for those who are unfamiliar. So that was how it works and why it was built the way it is. Because for the first time Nextflow was able to give us workflows which were portable between different systems. Back in 2017-18 we had this kind of light bulb moment where up until then everyone wrote their own RNA pipeline wherever you were in your core facilities, in your labs. You had to because other people’s pipelines didn’t work on your system. And they had hard-coded paths or maybe the environment module system with the software used different names. All these different things made it very difficult to collaborate. But Nextflow suddenly removed those blocks that we could now share code for running pipelines and we didn’t all need to write our own pipeline. And so back in around 2017-18 I started nf-core with some collaborators and friends and we started taking the standards that we we built in Stockholm and kind of opening it up to the wider world and based on those principles we founded this Nextflow community called nf-core. nf-core has exploded in popularity alongside Nextflow. The two have kind of formed a very symbiotic relationship and now we have over 140 different pipelines which is astonishing when you bear in mind that one of our key guidelines is we only have one pipeline per data type or analysis type. So we have only one RNA pipeline. So that’s 140 different types of data analysis that we have pipelines for. In the recent years we’ve also grown to be more than just pipelines. I’ll touch on this in a second, but we also now have shared modules, which are basically individual processes within the pipeline. And so these themselves are shared and can be reused across different pipelines and across pipelines outside of nf-core. And so every one of those is is a different tool and it comes bundled with its commands, its usage, and its software containers and everything. And there’s now over 1,700 of those, and that number is growing really, really fast. And then we have a community Slack where we have channels for every different pipeline for discussions. We have kind of a core team and a maintainers team and kind of some level of governance within that. And we have going on 14,000 community members in Slack now. So it’s an extremely active community and of course really kind of it’s built on this concept of best practice where we you can write Nextflow is a programming language and you can write your Nextflow pipeline in basically any way; there’s huge variability in how you do that and nf-core takes a very very opinionated stance and says if your pipeline is going to be part of nf-core it has to be written exactly this way we you have to use our template you have to do things our way. And the reason we do that is that then makes it possible for components to be interchangeable and for folks to be able to collaborate. So standardized tooling, best practices and a lot of documentation.

Phil Ewels: One of the things that’s quite unique about nf-core versus other pipeline registries and software registries is that one of the requirements of adding a pipeline to nf-core is acknowledgement that it’s not owned by you anymore. It’s community owned. This is another figure from that recent paper. But I really love these plots. This is for the small RNAseq pipeline which we actually started in Sweden before the origin of NF core. And you can see that top green bar is SciLife. And you can see that we were sole owners, maintainers, contributors to start with in 2017-18 and then more and more different organizations have joined in with maintaining and contributing to the pipeline and actually SciLifeLab stopped contributing really to it around 2022. But the pipeline lives on because the pipelines are community owned. They don’t suffer from this problem of a PhD student finishing a PhD and moving on to a different position and abandoned the software getting abandoned because it’s community owned. We can build updates in based on community consensus and bring in volunteer works from from groups across the world. And that’s a real kind of superpower for nf-core.

Phil Ewels: I also want to touch on the fact that nf-core is not just pipelines anymore. This modules library and the tooling that we build for nf-core is deliberately done in such a way you can use it for any pipeline, any Nextflow pipeline. And so this is the nf-core CLI I’m showing on the right and it has a TUI, a terminal interface which you can use to create new pipelines and that very rapid example there is creating a new pipeline which is not using nf-core template and you can choose which of the features from the template you want. So you can make it very very minimal or you can have everything that NF core comes with. And it’s up to you. And once you’ve got your pipeline, you can then go into that pipeline and use the tooling again to pull in these shared modules from a community repository. So here I’ve pulled in SAMtools sort and BWM and it fetches that those modules. It fetches that code with everything that comes with it and pulls it into my pipeline. And really then all that’s left is to connect those channels I mentioned. I’ve got the building blocks of my pipeline there provided for me from a community and I just need to put them together. So, nf-core tooling really provides a fantastic starting place for for anyone building their own Nextflow pipelines that you can just mix and match and you’re building on on community best practices. You’ve got all the the learnings of thousands of scientists using Nextflow over many years. And your, the modules you’re sure are well tested and being used by many other people. So you’re benefiting from a a huge pool of community knowledge.

Phil Ewels: Okay, that’s it for my introduction. So next I’m going to touch on some of the new developments in Nextflow.

Phil Ewels: Nextflow itself is is developed at Seqera and we have a team of engineers working on Nextflow and basically the last year or so we’ve had some pretty major projects based on the community survey that we do. We try and do one of these almost every year. And for as long as I can remember, people would always say that they love Nextflow, but they find it really difficult to work with. The error messages are unhelpful. The syntax is confusing. And there’s none of the kind of nice stuff that people are used to working with when they use other programming languages. I myself write a lot of Python. Um, multiqc’s written in Python, for example. So, I absolutely sympathize with these these requests. And so we really went back to the drawing board with Nextflow about a year and a half ago and said okay how do we solve these problems and basically we took on a really massive project which is we completely rewrote how Nextflow understands Nextflow code. In the past Nextflow was what’s called like a Groovy DSL like a domain specific language. So the way it worked was you wrote your Nextflow script and that was basically cross-compiled into Groovy code at runtime and then the Nextflow engine would run that Groovy code. That’s still kind of the case but now we have a new language parser which takes your language which takes your Nextflow code and is able to natively understand the syntax that you’ve written. This is really changes the game for us in terms of what we’re able to provide for developer tooling, for error messages and things like this and means we’re kind of moving away from the days of Nextflow being a Nextflow – sorry a Groovy DSL really Nextflow starts to become its own native language.

Phil Ewels: One of the first things that was possible with this was that we launched a language server, an LSP, which um, and we incorporated that into the VS Code extension, which is probably the best way to write Nextflow code. And so suddenly we were able to bring up this developer experience for writing Nextflow code to be in line with other languages that you might be used for. The simplest thing is error reporting. Just being able to see in real time as you write your code that something’s wrong rather than having to hit save, run the pipeline, and then try and figure out where the bug is when you’ve been writing code for half an hour. There’s things like quick navigation and auto formatting of your code. So, you don’t have to argue about whitespace and things like this. But, picture’s worth a thousand words. So, let’s have a quick couple of kind of examples of what I mean. This is one of the simplest things but probably the most impactful is just the little wiggly lines that you can now get when you’re writing Nextflow code. Here you can see that the red line is telling us that that variable is unknown and it’s not defined and the clue is just above it where we have defined a variable called locations with an s and we’re also getting a warning there that we’ve defined a variable and it’s not being used anywhere and these hints are being shown as you write your Nextflow code. So it’s a huge productivity boost to writing Nextflow. There’s features like this where we have tool tips over every Nextflow language item. So when you hover over in this case a channel factory, but it can be any part of the Nextflow syntax really, you get a short description about what that is and what it’s doing. And then there’s also a link underneath to read more and that takes you straight to the Nextflow docs. There’s things like this where there’s special little buttons that pop up in certain places in your workflow. So if you have valid syntax, your top level workflow now has this button saying preview DAG or D-A-G. And you click that and it will show you a mermaid diagram of your whole workflow in the sidebar right there in VS Code. And so this is a great way to get to grips with a new workflow which maybe you haven’t worked on before and you’re inheriting from someone else. I had about six of these slides in, but I thought they were a bit too much, so I pared it back down. But this is just a taste. There’s many different things like this now built into VS Code. So if you’re writing Nextflow Code now, it’s just vastly vastly better than it was a year ago or more. So if you’ve ever tried in the past, I recommend having another go now and seeing if the experience is better. Along with the language parser, other things that allows us to do is actually change and develop the syntax of Nextflow itself. Before we were kind of limited by what the Groovy Nextflow language parser could handle, but now because we have a totally separate step, we can develop the language however we want. And so we’re bringing out several improvements as a result of that. And something that’s been asked for for a long time is static types. So here we have some input parameters for a pipeline. On the left is the traditional way to do it. You save a name and you say a value, default value, then that’s that. But Nextflow just on the fly tries to typecast stuff based on the values it’s been given which sometimes leads to problems. For example, if you have a sample name as a string, but the sample name is given as something with leading zeros and it gets converted to a number. Now on the right hand side you can see we’re defining the types of each parameter whether it’s a path a boolean an integer a string so on and Nextflow will then strictly typecast those things on input and also validate that the values that’s been given are correct. So you’ll get immediate validation and errors if you try and launch a Nextflow pipeline with the wrong kind of input. So these kinds of things are small changes to a syntax but really make a huge difference to re-usability of Nextflow pipelines.

Phil Ewels: Another thing I mentioned was error messages. So here you can see one of the old style error messages where because it was compiled to Groovy before it ran, Groovy threw an error and it was really unhelpful. it was just like top level pointing at a squiggly bracket and then you had to go through hundreds of lines of code to try and work out where the error actually was. Whereas now we throw the error at the parsing, language parsing level. And here you can see it even indicates the exact character which is wrong and and you can go straight there. It’s just again way better. And we have a lint command you can run as part of your continuous integration test for example linting just to to find those validation errors before you even run the pipeline.

Phil Ewels: Okay, I need to speed up a bit. Other features that we’ve been working on, these are kind of low-level features which you might not notice right away but are really kind of foundational blocks for us being able to build a lot of cool stuff. Workflow outputs is a new way of defining where, how files are basically published at the end of of pipelines. And data lineage gives a way of saving and storing all the information that next flow has about the provenance of all of your data. And so when lineage is enabled, you can kind of find out from any given file the entire analysis path that it took through a pipeline where it came from. And and we can do start to do some really nice things such as passing inputs between pipelines and things like this.

Phil Ewels: Before I wrap up, I want to just touch on a few things that we do at Seqera, which is the company which was formed around Nextflow. So, Nextflow is all open source and of course, and that’s kind of been my focus professionally, but if you’re running Nextflow, then Seqera has a lot of extra tooling that you can build on top of Nextflow. Tthe key thing we have is something called Seqera Platform which is basically a way to manage running Nextflow. So Nextflow is a command line tool. But when you’re running a lot of Nextflow pipelines, it can be difficult to keep track of all those different runs and which ones through errors and where they are and where the data is. And so Seqera Platform kind of provides an interface to launch and to monitor different workflows. Importantly it works with your compute. So you connect it to your AWS account or your slurm cluster and all your pipelines are still running in the same places that they were before. It’s just that they’re being exposed through the Seqera Platform interface. This is kind of an example of the kinds of things you can do once Seqera Platform is aware of the great- basically the Nextflow pipeline and everything around it. So the encapsulation of configuration and execution environment and data. So you can use it then as a control plane which you basically can build on top of. And so one of my extra little projects in the past year is a plug-in for an open source tool called Node-RED. There’s a link here, but basically you can use this as a low-code platform for setting up automation. So here it might be that when a file is added to an S3 bucket, it automatically triggers a workflow and when that workflow finishes, it triggers a second workflow and when that one is finished, it triggers the creation of an analysis studio which you can then go in and do your downstream analysis in things like that. You can basically create any kind of automation and this is all done via the APIs of Seqera Platform. And so when you abstract away all the complexity of actually configuring and launching and maintaining all the infrastructure, you can start to build some really cool solutions.

Phil Ewels: We have a lot of tooling to make basically running your pipelines faster and cheaper and better. A big one is is fusion which handles all the file operations. Nextflow is traditionally very well targeted towards working with huge data files. You know, your BAM files and your fastq files and everything and fusion basically is optimized specifically for Nextflow. It knows how Nextflow works and it’s really you know it can really fine-tune it for that use case and one of the latest things that fusion can do is snapshots. So if you’re running on cloud with spot instances for example AWS might tell you that this you’ve got one minute before your instance is being reclaimed and snapshots will now freeze that, freeze that image, that running task and you can restart it. And don’t lose all the progress you’d made in that long running task. And then just this is like an everything else slide because there’s I could give another two hour long talk about all these features. There’s so much more. But if any of that sounds interesting, I’m happy to ask answer kind of questions or yeah, come back and talk about more.

Phil Ewels: So to wrap up if you’re interested in becoming more involved with Nextflow, writing your own pipelines or getting involved in the community, we’ve got kind of a smattering of links here. The top one with the Nextflow website of course has all the documentation. We have a website called training.nextflow.io which is all basically walk through tutorials and training which you can do yourself. We’ve just had a training week last week where we had over a thousand people registered for it just in that one week. And the there’s multiple different courses. The beginner one is called Hello Nextflow. I’ve done a set of video tutorials for each of those chapters. And so you can kind of follow through with me step by step as we work through all the worked examples. Which is basically the best way to to learn Nextflow. And that’s all up to date now with all the latest Nextflow syntax. We have a very active community forum. So if you ever need any help, you can drop in there and ask you a question and you can usually get a response very quickly. Another plug, I run a Nextflow podcast. So at the moment, I’m trying to do one every two weeks. We talk to all kinds of different people using Nextflow for different things or other kind of tangentially related technical topics. It tends to be very technical deep dives. So that’s kind of fun. We have a really good blog and I’ve written a community forum twice. Didn’t mean to do that. And then finally we have a bunch of events coming up. So in a few weeks time we’ve got the nf-core hackathon which is both online and then people self-organize different local sites all the way around the world. I think we’ve something like 20 or 30 local sites already from Argentina to the UK to the US to Germany all over the place. So very welcome to join. It’s a great way to get involved. And there’s all different projects so you can kind of dive in and help people with their pipelines. And then we’ve got the the two flagship summit events. One in Boston at the end of April and then we’ll have the main online one with some in person in Barcelona in October. And there’s loads of Seqera sessions and all kinds of other events if you click that link where it might well be something near you. I think there’s Seqera sessions coming up in London and a few other places soon.

Phil Ewels: With that hopefully I’m about on time and happy to sort of take any questions. I hope that was all clear and and made sense and was useful.

Grant Belgard: Thanks, Phil. Um, so what’s the easiest way for someone to get started with Nextflow?

Phil Ewels: So the training website I think is the best way to get started really it the the examples that we use with the Hello Nextflow training are kind of domain agnostic. We we use cowpy to print a little cow to the terminal saying different messages and things. So you don’t need to really know anything specifically about RNAseq or anything else. And you can do most of that course probably in an afternoon or a couple of afternoons. And that takes you from almost nothing all the way through to building your own pipeline complete with containers, Docker containers and everything. And it’s all set up to work on GitHub code spaces. So it doesn’t, yeah.

Grant Belgard: And for people who currently use Snakemake, how hard is it to migrate to Nextflow?

Phil Ewels: So yeah, so I mean I didn’t really talk about any of the other workflow managers, but Nextflow is not alone in this field. And what I generally say to anyone is that just using any workflow manager is better than not using any. And so Snakemake especially and Nextflow um and WDL and others they share many of the concepts about kind of splitting up different tools and running them sequentially and working out risk DAG. Because of that it’s not usually not too bad to convert from one to the other. Especially with AI these days like we have our own Seqera AI which is particularly good and well versed in the latest syntax of Nextflow. And so honestly with many pipelines these days you can just dump your Snakemake syntax in and say convert this to Nextflow for me and it will do a pretty good job almost from the first go. So I would that I’m kind of lazy and a bit of an AI advocate. So that’s what I would definitely do in that situation.

Grant Belgard: If someone has a pipeline that could be useful for nf-core, how do they go about adding it?

Phil Ewels: Yeah. So, nf-core is this kind of like I say, it’s kind of a unique community because we don’t just kind of list any pipeline. We, it’s specifically kind of community owned and and only one pipeline per data type. So, because of this, it’s not just a question of kind of clicking a couple of buttons. You have you have to come and forward and put in a proposal and basically then we say yes or no and then there’s a kind of a system for going through and building your pipeline and adding it to nf-core. The short answer is go to nf-core website and click on the docs and there’s a guide saying how to add your pipeline and then there’s an nf-core proposals website where you go and basically describe what it is you want to do and get a thumbs up.

Grant Belgard: Where do you see AI fitting into pipeline development in the next couple years?

Phil Ewels: A couple of years is difficult to say. I’m struggling to predict anything more than a month ahead at the time at the moment because things are changing so fast. But I mean nothing in tech is going to be the same and I’m sure that pipelines will be included in that. We’re starting to see it already like I say converting between languages. We have our Seqera AI tool and we’re trying to kind of take the rough edges off these tools and it certainly lowers the boundary. Nextflow is known for not being the easiest in terms of learning curve and AI makes it possible to get started so much easier. So right now I think the benefits are kind of a low hanging fruit is it’s just much easier to write to debug your Nextflow pipelines using AI. And as we go forward I’m expecting kind of more foundational changes with how how we just approach the whole concept of building up scientific analysis to be honest.

Grant Belgard: Is Nextflow overkill if someone’s just running a few samples on their laptop?

Phil Ewels: It depends a bit. So I mean it depends a bit on your background and how much Nextflow you’ve written. If you’ve never written Nextflow before then is it worth you learning the whole syntax and going through the whole process just so you can run a couple of samples? Maybe not. But once you have kind of got familiar with Nextflow I kind of think it’s a bit like wearing gloves when you’re pipetting in the lab. It’s difficult. You want to, you end up wanting to write Nextflow pipelines for everything because it is self-documenting. It’s automatically versioned. You can rerun it any time in the future. And you know when you try and remember what it was you did six months ago, you can just see the next pipeline and it’s there. So it ends up being quite a low lift. So then of course I’m a bit biased in this question, but I would say yes to everything in Nextflow pipelines. That’s what I find myself doing.

Grant Belgard: Mhm. Is Seqera containers free and how does it compare to biocontainers or Docker hub?

Phil Ewels: Yeah, so I didn’t touch on this so much but Seqera containers is something we do on the Seqera side. So what one of the tools we have containers are are key and fundamentals in Nextflow and the success of bioinformatics workflows that you can encapsulate the software in this kind of clean environment on a per process basis. So your versions of Python don’t conflict and this and that. And so you almost every Nextflow pipeline you will see now have these container declarations and you have you might have 50 different or 60 different steps in your pipeline and you need to come up with a Docker container for every single one and so the bioinformatics community has kind of responded to this usage of containers in a few different ways. The biocontainers project has been wildly successful and basically every conda package gets a Docker image for free and so we’ve been using biocontainers in nf-core for a long time. The limitations we found are when you want to have a process in your pipeline that has more than one tool then you have to – the whole process for generating one of those containers is quite convoluted. And so we have we built a tool at Seqera called wave which is also open source which basically builds Docker containers on the fly. So you build, you add this into your Nextflow pipeline and you say I want to run tool A and tool B in this process and it will go off and it will request it and if Wave has seen it before it will just give you the container straight away. And if not it will sit there and it will build it on the fly and then give it to you. Which is really cool because it means you basically don’t have to think about containers anymore. They just magically happen. So Seqera containers is based on this technology and it’s exactly the same thing but it’s just a public repository. So when you build your, you request your image you build it it then gets stored there for we say a minimum of 5 years and then anyone can just fetch it and download it. So we for example are now going to be using this in nf-core where every single one of those 1700 modules will have their own custom built, you have and docker and singularity you’ll have x86 you’ll have ARM CPU processing will all be built automatically on the fly and then pinned for a long time for perfect reproducibility and it’s just free and yeah it works really well.

Grant Belgard: When can one start using static types in Nextflow?

So the syntax example I showed with those params you can do that now. So that’s out as of, we do two Nextflow releases every year one major release in April and one in October. And so the 25/10 release came out with that syntax. So you can use it for parameters today. Basically, we are working on developing more syntax which will come out in the next major release, so 26/04, which will have basically strong typing through all of your pipeline code pretty much. And so that will really take that concept and kind of bring it through and then you’ll have a lot more validation because if you try and as you’re building as you’re connecting all your processes with all these different channels, excuse me, if you say that this you know this process has an output which goes into this it will tell you immediately like well you can’t do that because those are different types. So we’re going to have that very soon in a few months but already today you can do typing for just the input parameters for the pipeline.

Grant Belgard: And lastly how do you go about deciding if it’s worth updating an ancient DSL1 pipeline?

Phil Ewels: Yeah. So, so for for those who don’t know where DSL1, DSL2, this is like back when Nextflow started, it was this Groovy DSL and this term got bandied around a lot. And then around 2020, I think there was a major language update. We used to have these huge monolithic scripts of like thousands of lines of code and DSL2 changed a bunch of the syntax and one of the things it allowed us to do is break out different files, have these modules which we now, you know, like I say, rely on for this level of granularity and testing and and community. So, but that change from DSL1 to DSL 2 was was quite painful. It was quite hard work doing a lot of the rewrites which I should say we’re taking great pains to avoid with the new syntax updates we’re doing. We’re doing it much more gently and there’s also a lot of tooling to automatically update code. But so if you have an ancient pipeline in DSL1 and you want to sort of leap frog all this and bring it forward what like six years in terms of syntax it’s surprisingly common to have this question but like basically you have a couple of options probably the easiest is the same as converting from Snakemake you chuck it into an AI tool and say rewrite this pipeline for me or you start from scratch and you just kind of copy over the logic into the new syntax and you take the nf-core template or something. Or if you really want to and you’re a bit of a sadist, you can go through and try and update all the syntax line by line, which is doable. But, you’ll probably have to go DSL, you know, it’s like a software migration. You have to go DSL1 to DSL 2 and then DSL 2 to a new syntax. It’s doable.

Grant Belgard: Well, Phil, thank you so much for joining us. Thanks to all our listeners.

Phil Ewels: It’s a pleasure. Thanks very much for inviting me.

The BCRO Webinar

The Bioinformatics CRO Webinar Series

January 21, 2026: James Opzoomer – Biophysics-Informed Spatial Transcriptomics Approaches to Identify Cytokines Causally Driving Downstream Gene Programs

The BCRO Webinar

James Opzoomer is a Senior Scientist in the Innovation Lab at Relation, where he develops single-cell and spatial genomics platforms to accelerate drug discovery. His projects span high-throughput multimodal single-cell sequencing and spatial transcriptomics technology development, generating ML-ready datasets that power novel therapeutic insights.

In this live webinar, he discussed BISTR (biophysics-informed spatial transcriptomics regression) as a computational toolbox for building biologically plausible predictive models from spatial transcriptomics by combining RNA dynamics as a readout of changing gene programs, and paracrine cytokine diffusion as a physically constrained model of cell–cell communication. By linking inferred cytokine secretion, a spatial propagation diffusion model, and receptor-associated changes in mRNA maturation, BISTR aims to suggest cell-type-specific, testable causal relationships between extracellular signals and downstream transcriptional responses.

Transcript of The Bioinformatics CRO Webinar Series – Biophysics-Informed Spatial Transcriptomics Approaches to Identify Cytokines Causally Driving Downstream Gene Programs

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the next talk in The Bioinformatics CRO webinar miniseries. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision ready insights, providing flexible expert bioinformatics support from study design through analysis and reporting. As part of that mission, our webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s talk is by James Opzoomer. James is a senior scientist in the innovation Lab at Relation where he developed single cell and spatial genomics platforms to accelerate drug discovery. His projects span high throughput multimodal single cell sequencing and spatial transcriptomics technology development generating ML ready data sets that power novel therapeutic insights. Today James will be presenting on biophysics informed spatial transcriptomics approaches to identify cytokines causally driving downstream gene programs. After the talk, we’ll host a live Q&A session. This is streaming both to YouTube and LinkedIn and on either platform you can put your questions in the chat or the comments at any point during the talk and we’ll bring them into our discussion afterwards. James, over to you.

James Opzoomer: Thank you and hello. So I’m delighted to be speaking today at this uh BCRO webinar and I’d like to thank Grant and the BCRO team for inviting me to speak with you today about relation and some of our spatial transcriptomics work within the Innovation team. So I’m going to start today by giving you an overview of Relation and our approach to data generation and then I’ll dive into a novel spatial transcriptomics an analysis method that we’re developing called BISTR and provide a worked example at the end. So, first I’d like to start with a question. What are some of the main challenges with the current model of drug development? And why is now a uniquely good moment to deploy large-scale patient genomics to solve this problem?

James Opzoomer: So shown here are four major trends that define the future of drug development and healthcare. And on the left we have sort of two negative trends. First that the cost of drugs is ever increasing. We spend more money on health care but we don’t see commensurate increases in in life expectancy. And this is also demonstrated by the the ratio of health care spend to life expectancy on the left. Now the two good trends on the right are that the cost of sequencing is is drastically decreasing. You can now do a whole genome sequencing for about $100 and the cost of compute that’s driven by titans like Nvidia has made it more accessible than than ever before. So really the problem that Relation is trying to deal with is the first one decreasing the cost of drugs. And what we want to ask is can we use these two trends on the right to solve those on the left.

James Opzoomer: Now this slide really represents a simplified overview of the drug development funnel which I’m sure you’re all well aware of. On the left we start with maybe 20 programs, 20 ideas for new medicines and we invest on the order of 1 to 3 billion across this funnel and after all that work we typically end up with just one marketed drug. So most of the attrition here is because we were wrong about the biology. So although every stage in the funnel is important, the decisions we make right at the beginning in target discovery echo all the way through this funnel to the clinic where failure is acutely expensive. And so that’s why we believe that that target discovery is really the most important problem in in drug discovery.

James Opzoomer: So at Relation our ambition is to transform target discovery into an engineering discipline. And now this means building systematic repeatable processes powered by large-scale patient data and ML models.

James Opzoomer: So the funnel that I previously showed you is another representation of this statistic on the top left that over 90% of drugs that enter clinical trials ultimately fail. So how do we transform R&D so that this number looks very different in the future? Now over the last few years several large analyses have given us an important clue. So on the right there are two examples of these. The first is is a recent Nature paper where it looked across many clinical programs from and the papers from Matt Nelson’s group. They show that when a drug target is supported by human genetic evidence the probability of success in the clinic is increased compared to targets without that evidence. In other words, genetics gives us causal anchors in human biology. The second work shows that single cell RNA sequencing of human tissue sharpens that picture. So by knowing which cells in which tissues express a genetically supported target, we can better predict efficacy.

James Opzoomer: So how do these these approaches fall into historical data collection strategies? So on the left we have large end low value highdimensional observational data. These are things like the human cell atlas um large bio bank cohorts. There’s a lot of it but it’s noisy, confounded and often only weakly connected to clear interventions that we want to make in drug discovery. And on the right we have small and high value but lowdimensional uh interventional data mechanistic experiments in model systems but in small numbers and with low dimensional readouts a few readouts and few perturbations. Now what we actually need for AI driven target discovery is bespoke multimodal perturbation data that links interventions to rich molecular and cellular readouts across diverse biological systems that are related to patient primary patient material. Now that missing data layer is what enables us to train models that actually learn the consequences of perturbing a target in a specific cell type and tissue.

James Opzoomer: And you know overall we believe that current models and data in the public domain are nowhere near sufficient to deliver meaningful impact in target discovery. So we therefore have to build the right data and the right models applied to where they most make sense.

James Opzoomer: So now that I’ve talked about why we care so much about genetics single and single cell data, I wanted to give a quick overview of how Relation is actually set up to do this in practice. And this slide represents a highlevel map of our platform. On the left you see human tissue profiling. This is where we generate deep multimodal data directly from patient samples. whole genome sequencing um single cell spatial transcriptomics single cell transcriptomics and proteomics. Now all of this is connected to the cellular modeling teams who run perturbation experiments on patient derived primary cell systems to generate bespoke data for the models and this connects to translation pharmacology who take the prioritized drug targets and turn them into to drug discovery programs. Now this is all connected to to both data science and our three main machine learning platforms. ROSALIND which identifies genetically validated drug targets, ADA which focuses on reversibility and TURING which provides drug discovery context of our targets. And I’m not going to go into these platforms in detail today because I really want to focus on the spatial genomics data that we generate in human tissue profiling and some of the new analysis methods that we’re developing to better use our spatial transcriptomics data in in drug discovery. So as an example of the type of primary patient data that we collect, I just wanted to show a case study of osteomics. This is our flagship observational clinical study focused on osteoporosis and bone disease. So in this study we partner with orthopedic surgeons across London to collect human bone waste from key surgeries. So these are total joint replacements elective surgeries associated with osteoarthritis and um hemiarthroplasty. So these are non-elective surgeries resulting from osteoporotic fracture really the end stage of osteoporosis.

James Opzoomer: So from each patient we build a genuinely multimodal data set. So that’s whole genome sequencing to identify variants and genes that causally in influence bone density, fracture risk and response to therapy. And this feeds into our genetic discovery platform at ROSALIND. We also generate single nucleus RNAseq of bone and joint tissue to map those genetically supported targets into specific bone stromal and immune cell types and states within the tissue and this sharpens our view of where these targets are expressed within the tissue. We also collect blood-based proteomics to find circulating biomarkers that report on pathway activity can be later used for for patient stratification. And in addition to this also rich clinical metadata including bone BMD or bone mineral density to anchor everything back to quantitative phenotypes. And now this lets our models learn how genetics and cell state translate into real clinical outcomes.

James Opzoomer: So in addition to the the single cell RNAseq we generate we generate spatial transcriptomics data with Xenium and the VisiumHD platforms on human bone and other tissues in associated with the other disease programs we’re working on. And this is really important because single cell data tells us what cell types and states are present within the tissue, but really we lose where they sit in the tissue and how they interact and communicate with other cells within this spatial context.

James Opzoomer: So together these genomics and single cell data sets give us a dense patient centric view of disease biology and in particular we in the Innovation Lab are interested how we can utilize this spatial transcriptomics data to disentangle the causal microenvironmental signals. So the cell communication pathways that drive cell state and cellular response to micro environment. And this has led us to develop a new analysis method called BISTR um or bioysics informed spatial transcriptomics regression that I’d like to share with you today.

James Opzoomer: So spatial technologies are key for preserving the in situ cellular context present in tissues providing a contextual perturbation system of sorts to understand some of the micro environmental signaling factors that may be driving a particular cell state within a tissue or within a particular disease. So we’re often attempting to model our disease states in less complex in vitro systems like some of the ones shown here 2D cell models and 3D organoids or organ-on-chip models. And the kind of the motivating feature of this BISTR package is to answer some of these questions. It’s can we identify cytokines responsible for cell identity and behavior in primary patient tissue and could we then stimulate cell models to mimic some of these these disease relevant or patient relevant micro environmental niches. And we hope that this can add value to the drug discovery process and to kind of our efforts in in vitro cellular modeling by using this knowledge to build experimental systems with greater disease relevance in vitro.

James Opzoomer: So a lot of this work is enabled by the advancements in the resolution of of spatial genomics technologies which is is really rapidly changing. And we recently published a review in Cell Genomics tracking these technology trends called SC trends. And this kind of summarizes the historical development in spatial omics technologies as well as some of the analysis packages available. And we also comment on these these kind of developing spatial technologies in real time since it’s such a such a fast moving field at our blog sctrends.org. So I encourage you to check it out if you’re able to.

James Opzoomer: So the work that I’m going to show you today is really focused on uh 10x Genomics VisiumHD platform and this is one of these spatial sequencing based spatial transcriptomics technologies where the increased resolution in this generation of platform now two micrometers has really enabled subcellular resolution allowing us to track several biophysical processes that are shown out here on the right. So RNA abundance, RNA localization and also RNA splicing at the subscellular level. And we can use these two micron pixels to approximately reassemble single cell data based on image segmentation tools in the imaging modality to create approximately single cell data.

James Opzoomer: So this slide sort of positions BISTR among other spatial modeling approaches. On the left are are sort of simple heuristic based approaches like using a radius around a specific cell or a k-nearest neighborhood and computing sort of some summary statistics. They’re fast. But the spatial scale is often somewhat arbitrary and the tissue is treated more like a discrete bin than a sort of a continuous space that it is. On the right, we’ve got deep learning based approaches. Now, these can be powerful, especially when they leverage analysis pipelines from the image space or are often paired with single cell data, but they’re typically more data hungry and and less sometimes less mechanistically interpretable. So, BISTR sits in the the biophysical model space in between. So we encode this process of of um intracellular signaling via ligand diffusion as a diffusion decay problem with boundary exchange to generate interpretable per cell exposure features without choosing an ad hoc neighborhood. It runs with more modest compute and also sets up a clean entry point for ML once the inverse problem is well posed.

James Opzoomer: So this is sort of a schematic representation of the BISTR computational pipeline. You have your underlying biological system and you generate subcellular spatial transcriptomics data say 10x VisiumHD data. We then use an image segmentation, vision transformer for instance, to identify nuclei and cell boundaries and infer subcellular compartments. You then quantify the transcripts on the nuclear and cellular level and then we construct the extracellular domains the space between the cells as a finite element triangulation mesh and we model paracrine signaling fields per ligand across this mesh using a finite element methods. This allows us to extract the per cell signaling features which we identify with receptor gating. So understanding the concentration of the ligand at a cell boundary and whether the cell expresses the cognate receptor to this ligand and from that we can characterize which ligands predict certain gene expression via a GLM based model.

James Opzoomer: So this is another schematic that that represents the data flow within the the BISTR Python package. You have your VisiumHD data. You identify nuclei with a vision transformer and you perform a morphological expansion of cells to create a like a cell cytoplasm boundary giving you approximately single cell data. You then build the FEM triangulation network. You use public databases to look up ligand receptor, ligand and receptor genes that are expressed within your cell types of interest and you solve the FEM network across all of your ligands within the intracellular space. Now this gives you the FE solution at the cell boundary. And we also look at ligand flux which is the relationship between the expression of a ligand within the cell and the FE solution at the cell boundary effectively identifying whether a cell is a source of a particular intercellular communication ligand or a sink, is it just receiving this signal and then we use a GLM to identify which ligands are most predictive of certain gene expression programs downstream.

James Opzoomer: So now I want to show you a kind of a worked example on a publicly available uh VisiumHD data set. So this is the BISTR package applied to this uh 10X Fenomics colorectal cancer data set. This is a a 10x VisiumHD FFPE data set that was published as part of the preprint that was released along with the VisiumHD product launch in in 2024. So here you can see a a highlevel view of the image of the tissue that has been assayed and zooming in onto a smaller subsection of the tissue. So you can see the individual cells. We use a vision transformer model to perform nuclei segmentation and then morphological nuclei expansion. So we follow this expansion to assemble the two micron spots into approximately single cell data which we annotate with its various cell types giving us a tissue representation of single cell data that looks like this. Here they’re colored by their cell type annotation.

James Opzoomer: So on the left here you can see we construct the extracellular domain and mesh. So we triangulate the extracellular space between the cells whilst using a tissue mask to limit the extracellular triangulation to the space that’s only underneath tissue. And starting with a ligand expression per cell, we formulate an FEM problem with diffusion and and decay parameters plus [] membrane coupling that allows us to solve a sparse linear system per ligand and get the FE solution across the tissue space. And here you can see the cells themselves are colored by the expression of ligand vgf-a. And you can see the FE solution in the intracellular space colored in this sort of white to red heat showing that cells express- expressing high vgf-a secrete, are predicted to secrete vgf-a into the intracellular tissue space. And we model this diffusion with decay throughout the tissue. And this ultimately gives us a FE solution across each of the communicating cells within the tissue which we gate basically binarizing them based on whether they express the receptor to a particular ligand or not. If they do express to the ligand then we calculate the FE solution across the cell membrane of each cell and also the flux. So this is the average membrane exchange signal. So effectively this is the proportion of the ligand expression within the cell and at the boundary of the cell from the extracellular space. Is this cell a source or a sink of this intercellular communication signal?

James Opzoomer: So in order to understand what ligands might be affecting certain cell types, we found that the coefficient of variation and also in a related sense looking at the mean ligand flux versus the standard deviation of the ligand flux is informative to understand the kind of most variable intercellular communication ligands across a tissue and cell type. So in this respect, in this particular example we’re looking at vgf-a here in tumor cells which is, which has a relatively high mean flux across this tissue section.

James Opzoomer: So we use then a negative binomial GLM fit to the per cell gene counts which has predictors such as receptor gated ligand exposure. So the coefficients of this model quantify how exposure shifts expected expression and we can see which exposure to which ligand are related to specific genes and then gene programs. Here we can see that our model captures the directionality of many genes known to be associated with a vgf-a exposure in tumor cells. And this indicates that we’re capturing known biological processes associated with this ligand inter- ligand receptor interaction in this tissue.

James Opzoomer: So in closing remarks I think we often find that sequencing based spatial transcriptomics technologies um have a lower UMI coverage um that’s somewhat sparer than single cell RNAseq. This has motivated us to develop novel tools to understand the relationship between intracellular ligand receptor signaling and downstream gene expression. So this tool that we developed BISTR converts spatial transcriptomic counts and in coordination with segmentation into physically constrained extracellular ligand fields and then into per cell exposure for downstream modeling of the effect of ligand exposure on gene expression. And we believe that modeling um ligand receptor interactions like this with a biophysics constraints gives more interpretability into the intercellular signaling process. And we’ve designed this BISTR method as a flexible toolbox that is deployed as a Python package which we hope to make publicly available sometime soon. The goal of this approach really is to generate more tissue contextual experimentally testable hypotheses especially where simple in vitro systems miss micro environmental signaling contexts so we can better understand the intercellular signaling processes that drive cell states in patient tissue and in particular to better understand disease. So we we will be publishing this approach hopefully as a pre-print soon and so I encourage you to to keep your eyes out for it at that time. So yeah, thank you for listening today.

Grant Belgard: James, thank you very much. Um so does the BISTR package work with spatial transcript domain technologies other than VisiumHD?

James Opzoomer: Yeah. So it’s designed from a like the core methods within designs within a spatial sort of transcriptomics method agnostic approach. I really hope that I kind of highlighted that what you need is subcellular resolution spatial transcriptomics data and from there you can reassemble sort of approximately single cell and whatever compartment you can segment with your sort of image layer into that form of data. So, VisiumHD is great for that, but we’re excited to get our hands on um hopefully the new Illumina spatial transcriptomics technology that’s coming out that appears to be sort of in this one micron resolution. But yeah, it should work across different spatial transcriptomics technologies although we have only tested it with VisiumHD but we hope to expand that outwards soon. Thanks.

Grant Belgard: Now what makes the BISTR package biopysics informed rather than just a spatial regression?

James Opzoomer: So that’s a good question. So the the kind of BISTR approach explicitly models paracrine signaling as a spatial field within the extracellular space using this diffusion with decay FEM approach solved over the effectively the finite element mesh that we build from the native tissue geometry from the spatial transcriptomics you know the sort of the imaging data and the spatial transcriptomics data itself. So this kind of we believe this builds a more representative intracellular communication space than just representing cells as nodes on a graph without understanding you know the distance but also some of the spec- tissue specific features that might exist within it. For instance, you know, a future direction that we hope to go is to to use image segmentation within tissues to create different tissue zones, right, which you can identify from H&E and other types of immunofluorescent staining where ligands might have difficulty passing through and thinking in particular, we work a lot on bone as I touched on, but you know, using that to to create more representative data.

Grant Belgard: Great. Well, thank you, James, and thanks to everyone for joining us. Join us for our next webinar on February 18th at 11:00 a.m. Eastern. Uh, Phil Ewels from Sequera will discuss reproducible bioinformatics at scale, nf-core, and Nextflow. Thanks.

James Opzoomer: Thank you.

Ania Wilczynska - The Bioinformatics CRO Webinar

The Bioinformatics CRO Webinar Series

November 11, 2025: Ania Wilczynska – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks  

Ania Wilczynska

Dr. Ania Wilczynska is Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their ioCell products and employing cutting-edge AI and machine learning solutions to data analysis. She has over a decade of experience in Bioinformatics and Data Science and over two decades in molecular, developmental and cancer biology.  Prior to joining bit.bio in 2020, she held positions at the University of Cambridge, MRC Toxicology Unit and the CRUK Beatson Institute (now CRUK Scotland Institute).

In this live webinar, she explores how modern bioinformatics must evolve from one-off analyses toward robust, interoperable platforms capable of integrating multi-study, multi-modal data at scale. Drawing on best-practices in data architecture, metadata design, workflow automation, and AI-ready infrastructure, she discusses evolving omics pipelines into a discovery engine.

Transcript of The Bioinformatics CRO Webinar Series – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks

Disclaimer: Transcripts may contain errors.

Grant Belgard: At the Bioinformatics CRO we help life science teams turn complex omics data into decision-ready insights providing flexible expert bioinformatics support from study design through analysis. As part of that mission our webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s session features Dr. Ania Wilczynska presenting “Thinking beyond the single data set: pragmatic solutions for scalable, AI-ready bioinformatics frameworks”. Ania is the Senior Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their IO cell products and employing cutting edge AI and machine learning solutions to data analysis. She has over a decade of experience in bioinformatics and data science and over two decades in molecular, developmental, and cancer biology. Prior to joining bit.bio in 2020, she held positions at the University of Cambridge MRC toxicology unit and the CRUK Beatson Institute. In this live webinar, she will explore how modern bioinformatics must evolve from one-off analyses towards robust interoperable platforms capable of integrating multi-study, multimodal data at scale, drawing on best practices in data architecture, metadata design, workflow automation, and AI ready infrastructure. She will discuss evolving omics pipelines into a discovery engine. We’re live streaming this on YouTube and LinkedIn. Please drop your questions in either chat or email them to [] and we’ll bring them into the discussion. Ania, over to you.

Ania Wilczynska: Thanks very much, Grant. It’s great to be here. Are we sharing? Oh, here we go. Okay. Right. So, welcome everybody. today I’ll be talking to you about how teams can move beyond single bioinformatics data sets toward scalable AI ready bioinformatics frameworks. the talk is grounded in our experience building such systems at bit.bio but really should be equally applicable across academia and industry. We’ll talk about principles that have guided our thinking over the years and highlight cultural changes in how we need to think about data and workflows. So everyone in bioinformatics faces scaling challenges. And this is going to be about practical ways to solve them. So just as an overview of our talk we’ll first start with stating the problem of data growth. We’ll discuss some principles of scalable design. We’ll talk about infrastructure so automation and building a platform integration of data focusing a lot on metadata and using SOMA objects as an example of data integration and then we’ll go into discussing AI workflows human in the loop and how we integrate bioinformatics data in the new AI world and really how we move into creating AI native data sets.

Ania Wilczynska: So a lot of bioinformatics still operates on single studies and ad hoc analyses and of course modern AI and ML requires scale, structure, and reproducibility. So the question is really how do we evolve bioinformatics platforms to address these questions. Data generation currently outpaces analysis and every omic imaging and metadata stream grows exponentially. So classical ML and now large language models give us tools that turn data into insight but only if our systems are consistent and reproducible. Thus, we can’t treat these data sets as isolated projects anymore. And we need to think about platforms that integrate data, automate quality control, which is obviously the first and very important step. And enable us to use all of the information throughout our organization, be it again industry or academic. And this will be the thing that will enhance discovery, precision and scalability. And this is the area where reproducibility, standardization and machine intelligence intersect. So we need to treat data systems as long lived infrastructure not one-off workflows. And by building structured and automated systems, first of all, this is very simple but very important to everybody. We reduce costs. we do accelerate discovery and then create data that as a consequence AI can actually learn from. So this is building into the future. and if analyses can’t be repeated, they can’t be automated. So platform thinking means designing for reuse. Every data set, every model, every workflow should be modular and interoperable because reproducibility means scalability.

Ania Wilczynska: So now moving on into what scalability can mean. So the first principle that we use is this simplicity scales. And we build modular pipelines where the complexity is contained within the module, but the interfaces between the modules are really clean and clear, which then as a consequence means that the complexity is localized to a module that can be easily interchanged. new platforms can be plugged in easily. Now what’s really key to highlight and we’ll be going into more detail on this shortly is consistent language and naming and shared hierarchies are really important in this because this is how we have clarity both across data sets and across teams.

Ania Wilczynska: And finally, by designing for API and cloud integration, we can future-proof our systems so that new technologies can be onboarded very quickly. So, the princ- the take-home from here is design for evolution, right?

Ania Wilczynska: So, this is how this can look like in practice. So automation, end-to-end automation, in our case connects experimental metadata compute analysis storage and dashboards in a continuous loop. It allows multiple analyses to run in parallel. This is instantly reproducible when new data sets arrive. And so the outcome is that scientists including bioinformaticians spend more time interpreting results and less time essentially babysitting pipelines. So this infrastructure supports hypothesis generation and iteration. And it creates a complete cycle from data to new insights. The automation of course doesn’t replace scientists, it amplifies them. And so by automating the flow of data ingestion to reporting to API access we iterate faster and keep quality consistent. So the schematic on the right shows you how we automate the full bioinformatics cycle from data to insight. We start with metadata capture in Benchling as well as our in-house built app. And we ensure that every sample and condition is traceable. This is really key. Next pipelines run automatically with the use of AWS Batch and Lambda and these are scalable as new data volumes arrive. The results are stored in AWS S3 then linked through APIs to feed dashboards and AI tools and LLM agents can summarize the results, flag patterns, QC issues and the scientists interact with the data through dashboards. And so the key idea is that the loop data in analysis inside out runs reproducibly at scale. And it frees our time to focus on interpretation rather than execution.

Ania Wilczynska: Right? So, I’ve mentioned metadata quite a lot already because in our view it really is the connective tissue behind all of the data sets and capturing rich technical biological provenance metadata early allows us to integrate across studies, perform batch correction, and reuse analyses efficiently. We employ fair principles, define once, reuse everywhere. This is really essential. And standardized metadata allows for not just traceability, but also creates a central data store that ensures findability and reuse and structured data feeds directly into AI and ML tools. So metadata can of course be vast. One thing that we found that’s been really important for automation of pipelines, tracking of samples has been a unified sample naming system. Again it sounds pretty trivial. It’s- it takes a little bit of engineering and a little bit of cultural change to to deploy, but it’s been extremely important for us. Again, as a relatively trivial example. So again, just to reiterate that metadata turns messy data into machine readable knowledge.

Ania Wilczynska: So, a slightly busy slide. But this is just an overview of what our platform looks like in terms of integrating research and again using metadata as a foundation as well as automation. So our in-house bioinformatics platform connects this metadata with the sample tracking analysis pipelines QC and reporting through a unified database. it provides live links, data sanitation, and API access, enabling researchers to explore, analyze, and develop AI workflows directly. And as a result of this, we have a self-service interactive research environment where data flows seamlessly from experiment to model. And the way we think about this is that we move from a data set to really a living research system. And we use this platform internally to handle everything from single cell to genotyping to plasma design. and we emphasize empowering users and bridging all these systems.

Ania Wilczynska: Okay. So now we’re, now we start moving into making AI — sorry — making data AI ready. So of course data alone isn’t enough. it needs to be transformed into structured knowledge and we do that by explicitly extracting relationships from our experimental data metadata as well as publications. A lot of our work relies on relies on external open source data and we codify these relationships into a knowledge graph and now we can start to infer new connections using AI. So this structured understanding supports predictive biology which is what we as a company do. And we hope that this will give us the ability to anticipate outcomes of experiments rather than just to measure them.

Ania Wilczynska: So AI now helps us to ask better questions about the data that we are actually generating, generating in house and the workflows that we’re developing using our data as well as again like I said external data from publications from open source data sets is — the loop consists of first defining a question then again aggregating lots of data moving through AI agent synthesis through human review and I cannot stress enough that at how important it is at this stage to have the human in the loop. We- we’ll talk about that again in a second. And finally updating the knowledge graph and iterating again. The human element is very important in terms of both ensuring quality scientific rigor and of course eliminating faulty or hallucinated information. So again to emphasize we’re not at the stage yet of replacing the scientist were augmenting their ability to work. And of course this reduces lag between experiments and insights. So every iteration enriches the knowledge base and therefore improves the outcomes for the next round.

Ania Wilczynska: So, the more we automate retrieval and summarization, the more time our scientists have to focus on creative reasoning and high complexity tasks. So, this is really what this simple schematic at the right is showing you. We are less a lot less focused on the medium and low complexity tasks thanks to both our automated modular pipelines as well as the plugging in agentic AI to be essentially better scientists. Moving a little bit away from the engineering into the more creative science. And this of course implies huge efficiency gains.

Ania Wilczynska: So once again emphasizing the human in the loop. but on the engineering side of things we implement all these principles through a retrieval augmented generation or RAG stack. It allows the AI models to query internal data safely without retraining or exposing to sensitive — exposing sensitive information. Again the architecture is modular. This is obviously a theme. And each of the agents, be it a specific bioinformatics agent or imaging agent or a developer agent, specializes in a particular domain. And this is all coordinated by the human in the loop scientist. So of course this dramatically accelerates tasks that used to take hours that can now be done in seconds. The structure is secure, modular, with replaceable components. So again this is how we think about scalable AI in kind of pragmatic production. All right. So circling back in a way to something a little more formally bioinformatics focused. I’m going to talk to you a little bit about how we scale to multi-data set and multimodal analysis. And this will be mainly focused around single cell data sets. Because this is really an area where the concept of scale is quite obvious and quite a pain point for a lot of researchers. So single cell data sets scale to millions of cells. And integration of these data sets becomes a core challenge for many reasons. But a lot of it is because it is a data engineering problem not just a problem of statistics. So internally at Bit.Bio, we routinely handle data sets from millions of cells across multiple studies and modalities. And to integrate these data sets successfully, we have to normalize technical variation while preserving biology of course and build models that scale efficiently. So data sets need to be aligned across batches, labs, modalities, be that RNA or ATAC-seq, spatial data, protein data, imaging, what have you. And new algorithms do scale to millions of cells and to multiple modalities. There are of course integration tools such as Seurat, Harmony. This is all available open source and scales to unprecedented levels. But the volume of the data is still a huge challenge even when it comes to just loading the data objects for analysis. So the way we’re currently addressing this question is with the use of SOMA. So that stands for stack of matrices annotated. It’s a new open standard from the Chan Zuckerberg Initiative designed exactly for this challenge. So, it provides an array based data format that supports multimodal data sets, again RNA, ATAC-seq, and so on, at a massive scale. It’s fully interoperable across R, Python, C++. And it enables out-of-core access to data aggregations much larger than single host main memory, ensuring distributed computation over data sets. So SOMA provides a building block for higher level API that may embody domain specific conventions or schema around annotated 2D matrices like cell atlases. So for us adopting SOMA has meant that we can first of all store huge amounts of data in one place, slice it quickly and any way we like and share and finally share the data reproducibly preparing it for AI training or in or more simply retrieval. So what this looks like in practice as an, as one example is we’ve integrated a number of perturbation screens where both the technology in terms of sequencing as well as conditions were very different and about 2 million single cells in total have been integrated from our own data. We’ve also been able to put this together with pseudo bulks for each of the screens as well as pseudo bulks from the open- source 44 million single CELLxGENE census data set. This is all integrated in a unified SOMA data layer. What this means is that for all of these data we have a consistent schema for metadata. One of the important things to highlight here is we have, we’re using a unified gene annotation which does require some data wrangling as especially external data sets can use very different annotations. Now that everything is put together, we can easily query slices of the data in seconds across different data sets instead of waiting for minutes or sometimes even hours for our data to load. And this is really the first step towards truly AI native data sets because it- the data is structured is standardized and is really ready for automated reasoning. And with that a kind of whistle stop tour of our thinking. I’ll end. So just three principles to summarize. Treat data as infrastructure not a byproduct. Make metadata-first design non-negotiable. And realize that AI readiness is an outcome and emerges naturally from reproducibility and structure. So the goal is not to automate everything but to build systems that let us scale and scale the scientific discovery. So we need to work towards these scalable interoperable systems instead of just thinking about individual scripts and creating silos.

Ania Wilczynska: And thank you very much.

Grant Belgard: Thank you Ania. As a remember — as a reminder, live viewers can submit questions to the live chat on YouTube or LinkedIn. To kick us off, how do you ensure reproducibility across so many heterogeneous pipelines?

Ania Wilczynska: Yeah so we have I think over 30 different sequencing technology pipelines as of my last counting. And so of course deliberate design is very important. In terms of particular pipelines I cannot emphasize the need for containerization enough. having versioned pipelines, fixed parameters and in it — just to sound like a broken record, very well structured and deliberate metadata capture.

Ania Wilczynska: Having a centralized way of sample submission is also, has also been really important for be, for us being able to very quickly version our pipelines as well. We have relatively rigorous testing approaches as well. But yeah, so containerization versioning and metadata first and foremost.

Grant Belgard: How do you align or normalize data sets across modalities and platforms?

Ania Wilczynska: Yeah. So batch correction is obviously a a big nightmare. And we’ve already spoke about spoken about tools like Harmony, Seurat. There are plenty of publications that talk about various pitfalls of the just single cell integration tools. But again metadata is extremely important. And for example, in our hands, thinking about integrating imaging with transcriptomics is a nontrivial problem especially in the absence of spatial data. So we do single cell but we don’t do spatial transcriptomics. And we’ve spent a lot of time thinking about how we can integrate some ML approaches to to image analysis with our transcriptomics data. And once again, unsurprisingly, metadata, excellent sample tracking, and integration of the, of systems, including, ELN’s, ELN systems like Benchling has been really key to this and sort of deliberate. So there’s also an element of deliberate experimental design again thinking beyond a single experiment. Because you know and I appreciate that academic labs will be less naturally used to thinking about consistent experimental designs because that’s not kind of the core way of thinking in academic labs. However, I still think that asking the question “does it scale” is extremely important also in academia. I mean in my academic career I found that the lack of the question of “does it scale” has meant that we often missed a lot of opportunities to integrate data sets because everything about them was just incompatible.

Grant Belgard: Yep. What’s the advantage of SOMA over existing HDF5 or anndata approaches?

Ania Wilczynska: Yeah, so it’s really the out-of-core scalability, the fact that there is multimodal support and the interoperability. So it’s, it is really from everything that that we’ve seen so far in actually implementing these huge SOMA objects. It’s the next step towards big data rather than single experiments. We’ve — you know I — this may sound like a plug, but we’re, yeah, we’re really excited about how this is enabling us to just to iterate through computational experiments if you like very quickly.

Grant Belgard: How is AI readiness different from just automation?

Ania Wilczynska: Well, so AI readiness is really it means a different way of thinking about structure and about semantics. You know automation, with automation reproducibility is sort of your main output and your main gain. Whereas AI readiness means that the data is discoverable and learnable. And so it does require cross experiment but also cross function thinking. I think that’s another you know thing that people often disregard is how important it is to think about data and about computational biology not just within the computational biology function. But also make sure that the wet lab scientists or indeed in industry other functions understand how everything fits together in a data stream.

Grant Belgard: So related to that what cultural or organizational changes are required for this to be successful?

Ania Wilczynska: Primarily cross functional collaboration. And I think a little bit of mix of evangelizing and education can really go a long way. So we work both on the, on embedding AI workflows into bioinformatics but we also work with other teams for example you know the commercial team in our company to create AI workflows and that of course it helps the business understandably but it also means that there is a lot more understanding across the company as to why such workflows are important, why data is important and why data structures are important. And again this is, this does require a little bit of outreach, a little bit of evangelizing. Structured metadata and deliberate experimental design can at first seem like a little bit of an overhead in the lab because oh it’s another thing I need to capture.

Grant Belgard: We’ve never seen that before, have we?

Ania Wilczynska: Indeed. But showing people the value of that rather than going, “Oh, ping, my whizzy whizzy machine gave you a new result,” but rather going, “Well, because of that overhead, we’ve now been able to bring three data sets together. One that we did three years ago and one that we did now, and now we have a better outcome.” And again, this is, you know, this is all kind of cultural shift, but I found that, you know, that a little goes a long way in that respect. So, so yeah, a bit of showing by example, a bit of evangelizing and also treating colleagues as partners.

Grant Belgard: What’s the next step beyond AI native data sets?

Ania Wilczynska: Well, of course, in the utopian brave new world, it is AI scientists. I don’t think we’re quite there yet. Or at least so we’re all telling ourselves or we’ll all be out of jobs. But really it’s the closed feedback loops, data, models, experiments, self-improving hypotheses. And I think that’s really where things are heading very quickly.

Ania Wilczynska: But there is also well I guess everyone’s talking about it now. There’s also a lot of hype about what AI can do. But I think a lot of what we’re all in a way promising ourselves is is really bottlenecked by data.

Grant Belgard: And so I’m hearing it’s really essential to train people to think in terms of cycle time, right?

Ania Wilczynska: Yeah.

Grant Belgard: Uh, so we have an emailed question. Uh, how can these platforms be translated into the clinic or are there regulatory requirements that need different setups of the platforms?

Ania Wilczynska: Yeah. So, everything I’ve talked about, just to be clear, is in a R&D preclinical setup. I think the important thing to remember about regulatory requirements is that everything is extremely slow for good reason. And there is very little at least as far as I’m aware existing regulation around AI tools. I think that will probably take quite some time. Which again brings me back to this to this idea that you know the AI tools are — I mean we’re all blown away by stuff every day but because these, the regulatory principles don’t really yet exist the experimentation is still necessary. And so again I — thinking about the fact that it’s not just the tool the data has to come before it but also will for some time come after it is very important. Of course on the other hand in terms of you know just building automated modular pipelines you know there are the a lot of the cloud platforms provide certain standards so it you know we’re sort of working towards it but I think we shouldn’t expect the really novel solutions to be adopted all that quickly.

Grant Belgard: Well, Ania, I think we’re at time, but thank you so much for joining us. Um, and the series will resume January 21st at 11:00 a.m. Eastern with Jake Taylor-King from Relation Therapeutics, followed by Phil Ewels from Seqera on February 18th at 11:00 am Eastern. Uh, mark your calendars and thank you everyone for joining us today.

The Bioinformatics CRO Podcast

Episode 68 with Caspar Barnes

Caspar Barnes, founder and CEO of AminoChain, tell us about his mission to make biospecimen sourcing transparent, ethical, and efficient.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Caspar Barnes

Caspar Barnes is founder and CEO of AminoChain, a decentralized biobanking protocol with a mission to make biospecimen sourcing more transparent, ethical, and efficient.

Transcript of Episode 68: Caspar Barnes

Disclaimer: Transcripts are automated and may contain errors.

Grant Belgard: Welcome to the Bioinformatics CRO Podcast. I’m your host, Grant Belgard. Today, we’re speaking with Caspar Barnes, founder and CEO of AminoChain, a startup marrying biobanking and blockchain to make biospecimen sourcing transparent, ethical, and efficient. We’ll explore what AminoChain is doing today, how Caspar’s path unfolded, and the advice he has for the next wave of biotech builders. Caspar, welcome to the show.

Caspar Barnes: Thank you so much for having me, Grant. Excited to be here.

Grant Belgard: How do you describe AminoChain to someone who you meet in an elevator?

Caspar Barnes: Yeah, absolutely. So AminoChain is a decentralized biobanking protocol. It’s an infrastructure company that connects hospitals, biobanks, pharma companies, and other users and actors in the life sciences industry. And it allows any number of decentralized healthcare applications to be built on top. The first app that we are building on this decentralized biobanking protocol is a biosample marketplace that we call the Specimen Center. And how it works is we want to turn donated specimens into non-fungible tokens. And then we let those digital assets get listed onto a marketplace. And we let life sciences companies license these biospecimens for research use. And we can encode rights into the NFTs that represent broad consent from the patients and royalty rights back to the individual donors, and perhaps even MTA and licensing conditions of the biosamples.

Grant Belgard: Which steps in today’s biospecimen procurement pipeline are most painful?

Caspar Barnes: Oh, there’s so many. Where do we begin, right? So like today, in the United States alone, there’s around 2,500 biobanks out there. And a biobank, for those that don’t know, is a really big fridge filled with donated cancer samples mostly, but all sorts of other disease tissue used for research. And across these 2,500 biobanks, there’s roughly 200 million retrospective specimens stored readily available for research use. And only around 10% of all of those samples ever see the light of day. And that is because of many different reasons. We’ve spoken with hundreds of biobanks when we started AminoChain. And the most common themes that crop up are firstly, poor financial planning. Biobanks say that they’re scientists, they’re not business people. So from their perspective, they’ll go and raise grant funding, they’ll build the infrastructure to store specimens.

Caspar Barnes: The second they start collecting samples, they’ll just go back to doing research. And they don’t think about access policies or governance on the samples or distribution policies or cost recovery models. And so effectively, poor financial planning is the main reason why these biobanks are unsustainable resource. The second thing is searching for samples is really difficult. So today, if you’re a scientist and you want to get access to a specific biospecimen from a bank, you’ll have to go individually to each bank and ask if they have what you need. They have their own bespoke access procedures where you go and try to contact the PI at the institution. They’ll go and look for the sample. And if they find it, then you’re in luck. And if they don’t, yeah, move to the next one. But there’s no real way to harmonize search across these disparate databases.

Caspar Barnes: And then the last thing, of course, is licensing the thing. Once you find what you’re looking for, you could spend on average like three months back and forth debating the conditions of a licensing agreement called material transfer agreement. And then once you’ve actually reached those types of conditions, you can sign a document and get the samples. The whole process of finding, acquiring, licensing, distributing these pre-clinical research assets from biobanks scientists is just riddled with problems. What does a three-month delay in specimen access cost to mid-sized pharma program? It can totally vary, right? But speed is everything within pharma. So your mid-sized biotech or pharma company could, especially if they raise money to build out their own library of omics data, which maybe they’re training AI on, or they’re doing target ID and validation and so on. Speed is everything.

Caspar Barnes: So three months could mean being the first person to find that insight or validate that target or not being that. And if you’re not that, then maybe that’s the entire USP of your company kind of down the drain. And so we found in many instances, people, A, want access to data straight away. And then B, if they can’t have the data, they want their retrospective specimens so that they can turn those specimens into data. And then C, if there are no specimens available, then they want actual human beings so that they can donate samples, so that they can get the specimens, so that they can turn that into data. And so it’s quite difficult to quantify, but it can quite literally be the matter of lack or death for some of these companies. And so speed is everything in the industry.

Grant Belgard: Why did you choose a permissioned blockchain rather than public chain?

Caspar Barnes: Well, we have a lot of different things to consider within the chains that we’re working with. We are actually settling on a public chain. So we most recently decided to work with a private app chain company called Syndicate, which means that we’re able to customize a little bit of the data that isn’t, isn’t visible. It could be permissioned in that aspect, but we still do settle on Arbitrum, which is an open public blockchain. And so we do leverage the security and open transparency of public chains, but we also customize to some extent the transaction data or the publicly visible metadata within a sort of private permissioned app chain infrastructure. And we straddle these two different strategies specifically so that we can be a fully decentralized protocol by settling on Arbitrum eventually. But we also tailor the app chain specific needs towards our users.

Caspar Barnes: We found before, for example, we tried to have a totally open public app on Polygon and that data being totally visible on a public blockchain made a bunch of our users nervous and the language needed to cover all of the functionalities of the blockchain and our technology and our stack in a provider agreement that we would then present towards a hospital. So a biobank was incredibly confusing to these biobankers and researchers that have never even heard of crypto before. And so for all these reasons, we ended up, you know, deciding to focus on developing our own app chain, which basically means we can customize a lot of the information that isn’t as invisible, but we still are leveraging all the benefits of being on a fully decentralized protocol, like by eventually settling on those chains as well.

Grant Belgard: Can you walk us through a typical search match compliance workflow with AminoChain?

Caspar Barnes: Yep, of course. So today, a scientist will log on to the specimen center. They can see all the different biobanks that have created the profile and they’ve listed their specimens on the search platform. We currently have a global network of over 20 biobanks, some in the European Union, some in Eastern Europe, some in Canada, and some in Africa, and some of the United States. Folks can log on and they can see the profile pages of these banks. It’s totally open and transparent. None of the specimens are blinded. None of the suppliers are blinded. Experience is meant to recreate something like Facebook for biobanks. So you can go on and connect, browse each other’s profiles and see the high level overview of the collections of each biorepository. Then you can go down to a more granular level. You can go towards a specimen level.

Caspar Barnes: And across these 20 different biorepositories, we’ve ingested all of the metadata of the collections that these biobanks have. And we’ve mapped all this metadata into a universal. So if you go into the specimen center and you’re looking for a glioblastoma from a Caucasian male, you will also get a brain cancer sample from a white man, for example. Those are the same specimen, but they just have synonyms of each other to describe it, right? So we’ve used a series of different LLMs and AI technologies to map all of these metadata against each other. And now users can come on and search for what they need and harmonize, or they can search across all these 20 different repositories. From there, they can then select the specimens and then turn inquiry. They have extra information that they want to know about the samples or about collection or about provider.

Caspar Barnes: They can add that context into a chat and send that context with the request that they get pinged directly to the biorepository. So we don’t actually get involved in licensing at the moment. We don’t involve payments. We’re just nailing the search experience for the users and for the biobanks.

Grant Belgard: How are you handling private key management for sites that aren’t crypto native?

Caspar Barnes: Right. So the specimen center in its first iteration, she doesn’t have anything on changes yet, right? So we’re slowly starting to integrate all of the app chain enabled features right now and bringing the existing transactions onto the blockchain. How we are going to do it is work closely with third party key abstraction providers, like for example, Privy.io, the company recently acquired by Stripe. And they’re fantastic. We can work with them and they can outsource all of the key management and compliance and they can provide a fantastic that abstracts away the crypto in the backend. So they make building on chain a lot easier than it used to be.

Grant Belgard: What are the best traction metrics for Amino Chain? Samples onboarded, active buyers, cycle time reduction. What do you think best encapsulates your story?

Caspar Barnes: Yeah, fantastic. It’s a good question. So the key metrics that we’re tracking is first of all, the size of the network, right? So, I mean, how many buybacks are on the platform? How many have bought into the mission of improving their visibility and improving their cost recovery? And so that the first and foremost, the main thing that we track is how many providers do we have and how many specimens do we have? And then of course, how many unique individual donors or patients do we have? That’s the main thing that we track. The next thing of course, is how many users, how many scientists are logging on, how many people are looking for biospecimens. And then the most important KPI perhaps is how many requests are actually happening on the platform. So how many channels do people log on? Do they use the full search experience? They find what they need and they send a request to the bank.

Caspar Barnes: And there’s a dual-sided approach there where you need breadth, of course, because you need to be relevant and applicable to so many different types of scientists. But often people will come and if they don’t find the specific bit of insight that they’re looking for, then they would churn and they’ll just go directly to the bank or they’ll go to another one and they’ll try to find the specimens they need elsewhere. So apart from breadth of all these different collections, you also need depth. We need highly detailed information on all of the sample donors and different collections. And at the moment, we’re only tracking perhaps, you know, 18 to 20 different fields of metadata. And some of those fields have largely unstructured data, so people can drop in clinical notes or path notes and so on.

Caspar Barnes: But the way where our search is going is into vector embeddings and into more sort of natural language processing and so on. So that means that people can come on and ask questions in natural language. And we can have, you know, agentic tools to help us find the exact specimens that they need. And we can see then if any of the insights these people are looking for lies within the data that is uploaded onto the specimen center. So all these things considered, I think the most important KPI for this would probably still be transactions or requests, because that shows that our search is providing the experience that the users want.

Grant Belgard: And how do you defend against large CROs that might try to spin up a similar platform?

Caspar Barnes: Yeah. Well, the good news is that over the last like 30 years, people have tried many times and no one has made a lasting successful platform. And the reason for this is the traditional marketplace model is to de-identify where the specimens come from and to add a markup and to force people to do payments and transactions to the platform. We are largely of the opinion that we shouldn’t be brokering retrospective biosamples. We don’t think it’s an ethical practice to add, you know, markups on top of selling diseased tissue. First thing is just the values perspective. But then the second thing as well is that if you don’t de-identify where these specimens are coming from, then there’s the risk that you have marketplace slippage. And that’s the same with any marketplace. So people would log on, they would use your platform for search.

Caspar Barnes: And then if they can see exactly where the sample is, then they’ll just go offline and buy it directly. And, you know, all CROs, all buyer sample brokers, all the big players out there, they’re forced to make money by putting the value of the transaction on the actual brokery of the tissue. So we have tried to find a way to provide value for a network without necessarily focusing on trying to extract value out of a buyer sample transaction. And I don’t think that’s actually really been done before, let alone successfully done before. So that’s our approach. If we make it totally open, and we don’t mind if you do the transaction on our platform or off platform, right? We just want to nail the search experience. Then we could end up being the platform that everybody comes back to because it is actually the thing that is more engaging for the providers.

Caspar Barnes: It does have more rare specimens on it. And there’s no reason to jump off. You actually get a better user experience finishing your transaction on the platform because there’s no, there’s no reason not to do that, right? It’s not going to be more expensive for the user. And then once we have that good retention and we have good network of both provided and procurers, we can monetize in many other ways. Firstly, with the blockchain, all the amazing things we want to do there when the specimens are protocol integrated. But secondly, even without the blockchain, just nailing the search experience is already a good, good value add for these procurers. So like, for example, when you go onto LinkedIn, you can go and scroll through everyone’s profiles in a sort of freemium way.

Caspar Barnes: But there’s these amazing, you know, added tools on top like LinkedIn Sales Navigator or LinkedIn Recruiter or whatever that people have to pay for having an extra service. But we can totally do the same thing for biobanks here, right? If you have a phenomenal search experience and you want to have automated feasibility assessments for prospective collections, you want to have a gently tools where you can drag and drop your protocol and you have an agent search the marketplace for you and so on. All of these amazing things that we want to do later, we can charge subscriptions for or other things for and we can put that onus on the actual, you know, researcher that’s looking for the specimens. We don’t have to provide any barriers towards the providers.

Caspar Barnes: And of course, the most important thing is we can take the whole emphasis on brokering tissue off of, you know, that retrospective transaction.

Grant Belgard: How do HIPAA, GDPR and other laws and regulations interact with your cross-border workflow?

Caspar Barnes: Yeah, that’s a great question. So we do have banks in the EU at the moment and what’s particularly difficult is that each country, you know, can have their own interpretation of GDPR. And so even GDPR in of itself isn’t like, you know, a standalone uniform beast that you can just address one time because each user interprets it differently. So first of all, what we do is we take in de-identified sample metadata. We take in very high level information on the collections and we make that searchable. We don’t actually get involved in the licensing and the payments of the samples as well. We highly vet all of the providers that we work with to make sure that they are GDPR compliant. It’s written into our provider agreements that they also assume the risk of being GDPR compliant and that they have the capacity to erase data if that’s what the users wish.

Caspar Barnes: And we customize the specific fields based on their interpretations of GDPR. So for example, the French banks, they think that including information on a patient that has an age above 90, for example, would be personally identifiable. So for the French banks, or under five as well, by the way. So for the French banks, we would then change the fields to say, you know, 89 plus or six under or something like that, right? So that happened, similar things happen all around the EU. And we basically meet biobank where they’re at. We customize the data fields to their stipulations and their interpretations of the regulations. And we have it baked in, in our process of vetting the providers and in the provider agreements that we assign to these different biobanks.

Grant Belgard: What mechanisms ensure donor reconsent if the intended research scope changes?

Caspar Barnes: Well, at the moment, we’re not in the process of engaging the research participants. It’s all just focusing on retrospective collection. This problem of, you know, the hundreds of millions of samples out there that are sort of languishing and they’re really expensive to keep and they’re never seen the light of day. The first thing that we can do to help the industry is just to go out to those folks and say, we’ll help you increase this sample exposure and visibility and harmonization of search. We’re going to continue to do that likely for the next six to nine months. But at the same time, what we’re currently doing right now, [?] is building out our own prospective cohorts. And that’s a really exciting pivot that AminoChain folks got, or evolution of the product. That looks slightly different.

Caspar Barnes: That basically involves working closely with clinical sites, closely with advocacy groups, designing custom interfaces and user experiences for patients connected to advocacy groups. And either ourselves sponsoring new collections at those sites or raising money on behalf of these advocacy groups to sponsor collections at those sites. And then when these patients are consented, those specimens will be banked in the specimen center. And the data, the multi- data that’s produced from these studies would be put into a database and access to the database would also be governed by smart contracts. And so if a pharma company would have paid out access to the data for discovery or any other researcher would pay to have access to it for research purposes, in that transaction, we can pay back the people that helped sponsor the collections.

Caspar Barnes: We can pay dividends and royalties to advocacy groups, to trial sites, to patients, to anybody who was involved in the curation of data set. The buyer of that data can use it exclusively for an embargo period. And after which the data would be made available within the decentralized biobank, and it can be repackaged and relicensed in another product to somebody else. And so all this considered within this new prospective collection and data management product that we are soon going to launch, the donors will always have a way of identifying how their data set is used within this decentralized biobank. We use a combination of private-public key photography. They can authenticate how their data set is being used by [?] committed on chain. And from that, they’d also be able to claim rewards if the data is used for commercial in certain ways.

Caspar Barnes: And then through this system, we hope to have a fully incentive-aligned, decentralized, community-owned biobank.

Grant Belgard: If you could only track one KPI for 12 months, what would it be and why?

Caspar Barnes: That’s a good question. So it would still be, I mean, in the context of the specimen center, the one KPI would be requests. It would be, you know, that’s our true north is how many suppliers do we have? How many users do we have? And then ultimately how many requests are we making? And then the context of this new product that we’re launching, sort of the prospective collection metric, the one KPI is licensing data for exclusive use. Like, you know, are we finding people that want to buy access to multi-omic data sets for discovery? And that probably is the most important metric, because I think if we’re able to prove out that flywheel of funneling data, aggregating data, and you know, selling it, then all the other apps on top are easy to build and they benefit from the network.

Grant Belgard: Was there a formative experience that pushed you towards decentralized solutions?

Caspar Barnes: Yes. So not so much like a decentralized solution, per se. It just turned out that crypto was a good way to fix the problem of, you know, biosample tracking and so on. But I certainly did have a formative experience, you know, biosamples in general and, you know, the bioethics of donating tissue. And so very quickly, I’ll give you an overview of that. But I grew up in South Africa, right? You might be able to hear that from my accent. And growing up in Cape Town, South Africa, my mother, she started the charity called Yabongo, which helps women with HIV and AIDS get access to antiretroviral treatments and provides homeschooling support towards kids in the townships outside of the cities. And so growing up, I would spend a lot of time in and out of these townships with my older sister.

Caspar Barnes: And so we had very regular discussions around race, equity, privilege, and especially in post-apartheid South Africa, right? There was always a big emphasis on having these conversations openly and saying, why do we live in this area of town? And why do other people live in this area of town? So trying to find ways to give back throughout your career has always been a really big familial and cultural value that’s definitely now playing into the vision of AminoChain with an aspect of health equity. And the second thing was when I was around 12 years old, I had a malignant melanoma and I was very lucky because it was caught super early. So I just needed one big operation to remove tumor. But since then I’ve been thrown into a world of healthcare, right? And I still remember listening to these doctors explain concepts of healthy cells and malignant cells and how cancer spread and so on.

Caspar Barnes: As a little kid, I just listened with wide-eyed fascination and fell in love with biology at that moment. I was like, I have to work in life sciences in some capacity. And so then since then, I, you know, at the age of 16, would already spend my summers working in [oncolytic?] biovector research, did my undergrad degree in neuroscience, did a graduate degree in biotech, another one in bioethics, all at UCL, Columbia and Harvard Medical School. And so I’ve always been in love with biology since those early days. But the key thing is I can still remember waking up from the surgery and the doctor was standing over the bed and he was holding this biopsy sample. And he was like, check it out. This is what we cut from your lower back. And I was like, dude, that’s so cool. Can I take it home? I want to show my friend. We’re not going to believe this.

Caspar Barnes: And the doctor said, no, we need to keep this biospecimen so we can research it. And I was, of course, too young to understand the connotations of what was going on. But my mom, importantly, was like, yep, sure. That’s for the benefit of science. Let’s sign this consent document and give away the sample. And we never saw it again. And then many, many years later, when I was in the lab at Columbia, I was doing research on somebody else’s donated tissue. And we’re generating all this valuable information, finding all these markers. And I went to my PI and I said, can we tell the patient about this information that we’re generating? And she said, I don’t know where that thing came from. It just came from the biobank. And that blew my mind. I was like, how is that possible? How many people around the world are doing research on biospecimens, generating so much data?

Caspar Barnes: And you’re telling me that none of them know where the samples came from. They all just came from the biobank. And the more I dug into it, it turns out consent rates are incredibly low. There’s almost 25% at some major institutions. The people that are periodically not consenting to having their samples and data used are marginalized communities and patients of color more often than not. And so I got fascinated by the problem and I just got kind of sucked down the rabbit hole where I just did everything I could to try to find an interesting new emerging technology that could fix this. Turns out crypto is great, right? It has all the benefits of immutability and prominence tracking and ownership and agency. And this happened to be around the same time as when crypto was booming in 2020. And so it just seemed like a great match. And I was just like, this is awesome.

Caspar Barnes: Let’s go and try to see what all we can build at the nexus of all these incredible fields, which is emerging tech, life sciences, health equity. And now I’ve just become so mired in this, this like interdisciplinary platform and approach. And so it’s kind of become my life’s work and I don’t think I know. So many consent needs to be disrupted. The current consent model is like as recommended by, you know, the OHRP and the HHS and so on. It’s just like informed consent. That’s it. Here’s one document, write what you want on it, get someone a sign. And then if you see a signature, great. That’s basically your waiver of liability. You know, it’s, it’s not at all a way to meaningfully engage someone in what’s happening with their, their samples or their data and so on. And it’s, that’s kind of like a problem that we find ourselves in right now.

Caspar Barnes: Since 1970s or so, we had a period of, you know, like progressive change within America, right? It was like women’s rights were coming up. There was civil rights coming up and there was all these, you know, bioethical discussions happening as well. We ended up having the Belmont report, 1978, 1979. Of the Belmont report, we came up with these principles for human subjects research, which were, you know, autonomy, justice, and beneficence, right? So of these guiding principles, how could we, you know, have a, a scaffold for involving people in research? Well, informed consent legislation seems to be the best policies. Since the late 1970s or eighties, they were like, okay, let’s try to codify this into law as much as we can or at least make it like public policy that anytime you do research, you can only do it with informed consent.

Caspar Barnes: And it was all done with the, you know, great intentions and it made a lot of sense. And so then since the 1980s, we have to ask everybody for informed consent before they’re involved in a procedure, before they donate the best ones and so on. But, you know, even though those consenting frameworks haven’t changed in the last like 45 years-ish, the world has drastically changed. You know, it doesn’t look the same as it did in the 1980s anymore. Particularly the storage of biospecimens for secondary research has become a booming practice, enormous. In the 1990s, we spent billions of dollars to sequence one human genome. Now we can do it in a matter of days or hours for a few hundred bucks. It’s an enormous progression where we now live in a world where there’s a diaspora of data. And it’s like, all these samples are stored for secondary research use.

Caspar Barnes: And things are becoming increasingly re-identifiable. Things are becoming more and more personal, especially with whole genomes even seen. And with all this considered, informed consent just doesn’t cut it anymore at all. People are asking for a one-time consent document and then they’re doing whatever they want with the tissues afterwards because they’re just getting broad clauses. So all this considered, how do we see a new world where consenting can change the biomedical research industry? Well, we’ve jumped up a new framework called Demonstrated Consent. And under Demonstrated Consent, we can basically take personalized conditions for broad use from research participants. And basically like they’re personalized terms, we take them, we put them as metadata of a specimen, as an NFT. So we have ways to automate the record keeping of the samples.

Caspar Barnes: And then we list them on a platform where anybody can acquire these specimens for research use. Their protocol upholds the consent that the patients originally gave. And if it does, then they can license it, they can use it. At all times, patients have a way to stay informed with the outcomes of research and they can stay informed with how the samples are being used. And so therefore, the blockchain would be demonstrating to you how your sample are being used, as opposed to somebody just saying that they’re using it the way that they will. And that changes the paradigm that actually makes a better experience for the research participants. And you actually have the need for flexibility research, which is like a societal benefit, right? You don’t have to compromise between asks for progressing research and the ask for promoting patient autonomy. And so that’s what we see as the future.

Caspar Barnes: And that’s what we want to embed into the AminoChain protocol. It’s actually like personalized conditions for broad use and an automated way to re-contact and re-engage participants.

Grant Belgard: What surprised you the most in your customer discovery process?

Caspar Barnes: Um, surprises? That’s a good question. Many things are surprising. I think, I think, you know, when starting this out, I thought a biobank was a biobank because, you know, it’s just like, they’re all the same. It’s like, you know, a place where you store samples and that’s it. And I didn’t really realize the complexities that go into biobanking and how many different types of users there are within biobanks and how they are all separate from each other in terms of their, their priorities and their missions and approaches and so on. And so what surprised me, you know, one of the things that surprised me was that you, you have so many different types of doing things in biobanks. Some commercial brokers go and buy remnant material from hospitals and emerging economies. And then they add enormous markups and they sell those specimens to labs in Boston and in San Diego.

Caspar Barnes: And I was like, that’s crazy. I didn’t know that was a practice. And then you try to speak with, you know, other biobanks in America and they’re the part of AMCs, academic medical centers, and they don’t really care about cost recovery at all. What they care about is publications and they care about, you know, insights and they care about all these other things that will make them more eligible for grant funding. And so they’re not brokering tissue. They’re more focused on, you know, moving knowledge forward. And so I thought that was super interesting. Others are independent and they’re part of government labs and others are part of hospital networks. And some biobanks just collect remnant materials from clinical trials, which are associated with the pharma companies. And you’ll never be able to see any of those biobanks.

Caspar Barnes: And so all these things I found really interesting, just landscaping the different customers out there, like speaking with them and hearing what their needs are. It’s been fascinating to have the same conversation with different users, but to see the differences and important factors crop up and motivations for each of them.

Grant Belgard: How did you pitch A16Z crypto differently from life science species?

Caspar Barnes: Yeah, that’s also a good question. So, um, you know, building what we’re building, you have to toe the line between the crypto language and the non crypto language quite delicately. A16Z is fantastic because they have, you know, investors across both verticals. They have a healthcare fund and they have a crypto fund. And so when we were pitching A16Z crypto, we can, you know, pitch the crypto vision and how this becomes the Ethereum of healthcare. You know, the world’s biggest composable blockchain for people to build healthcare Apps. And they get it and it makes sense. And biobanking is the wedge to get there and they love it. But then if you try to say the words that I just said to you there, it’s a, the A16Z bio and health team, they get very confused. And as a matter of fact, that’s like what happened. So we spoke with both of the funds.

Caspar Barnes: And then eventually after a few rounds, we first went through their accelerator program, and then afterwards we’re reinvested as a full portfolio company and so on. Even with an A16Z, we have the practice of pitching both the crypto side and the healthcare side. But all in all, how life sciences VCs look at this as opposed to crypto VCs is, you know, how is this an extension of what’s currently happening today? And if you don’t have to give me complex crypto jargon, but you can just explain in normal language, how already what we see in biobanking lays a precedent for distributed ledger technology to help engage, you know, and to help improve user experiences or improve outcomes or whatever, then it builds a more convincing narrative in their head. So our second biggest investor, Socano is the family office of Paul Allen. They have a lot of life sciences companies in their portfolio.

Caspar Barnes: And so when we pitched them, they were basically our life sciences investor. The language that we had to engage with them was stuff like benefit sharing, stuff like co-ownership of IP, concepts of automating provenance tracking and supply chain management and so on. And if you, if we could just, you know, convey the same technology benefits of the tech that we’re using in non crypto language, then, then eventually it made sense to them. And then it, you know, ticks across all the people that we have on the, on the cat table.

Grant Belgard: What traits do you screen for when hiring at the biology web three interface?

Caspar Barnes: Yeah. It’s a great question as well. The people that are well versed and experienced at exactly the nexus of the two are few and far in between. And so when you find them, you really got to look after them. But then all things being equal, I’d see my job as the founder of AminoChain as being the person to stimulate conversation between the either non life sciences experienced people or the non crypto experienced people, such that they learn about the industry and they become experts at both, or they become at least knowledgeable of, of both the fields in which we’re building. And so we have people just that focused on the life sciences with their PhD backgrounds. They’ve worked in bio sample procurement and, and, and in life sciences research in general.

Caspar Barnes: And then on the other side, we hire people that just have cryptography experience and they just have blockchain engineering experience and they know how to build amazing software. And across both, the main thing that I look for is proactivism. Somebody that just says, just let me take care of that. I’ll, I’ll make sure that gets done. I mean, anybody that is autodidactic, anybody that is, you know, self-starter and proactive, tries to make life easier for their teammates is just an instant green flag. We would sooner have someone, you know, that is very proactive, but maybe less experienced as opposed to someone that’s super experienced, but not very motivated. So across both of those, that’s what we look for. And then second to that, we probably do focus mostly on, you know, the experience and the network that built out within the industry.

Caspar Barnes: So we have some folks that have been doing this 30 years and they’ve got fantastic connections within the space and they can just click their fingers and make things happen. And then I think the last thing as well is people that you can just trust, right? I think that’s the most important thing. So the people that you don’t have to worry if they’re, you know, not working today or if they are working today, just trust that they’ve really bought into the mission and they think that we’re building something incredibly important. And they understand that the faster we built, the faster we could actually help human beings. And so if we have that level of trust across anybody with any level of background and experience, that’s probably the most important thing. And I’m very privileged and like grateful that we’ve managed to build that with the team we have so far.

Grant Belgard: What’s the single best piece of advice you’ve received from a board member?

Caspar Barnes: The single best piece of advice I’ve received from a board member, they give us so many pieces of advice. I think, you know, maybe they sound a little bit cliche, but I think probably the most important thing are the best advice. The only time when you are guaranteed to fail is when you give up or when you stop trying. The whole first year of AminoChain, we were picking pennies, trying to make it work. We were like five people living off of a hundred thousand dollars in New York City, like really trying to make it work. And, and we did, you know, we were super frugal, very resourceful. We incredibly proactive, went out and spoke with everybody and did everything we could to move the needle. We took 250 VC meetings before we got our first yes. And somehow that first yes happened to be Andreessen Horowitz, which was incredible, but it was a long, long, long process.

Caspar Barnes: And the one thing that particular board member I’m thinking of reminded me of the entire time was, well, the only way that it’s a hundred percent not going to work is if you stop trying right now. And that was like a real fuel of motivation that got us through the early days. And that’s, you know, kind of resonate with me for a long time.

Grant Belgard: So looking back five years, what would 2020 Caspar find most surprising about today’s AminoChain?

Caspar Barnes: He would be so mind blown that we found ourselves in the situation that we’re in right now. I think that I, old me would probably be very, I’d like to think he’d be very proud of all the things that we’ve achieved so far, but he also probably been very unsatisfied with how far we’ve come because there’s always more to do. But in 2020, we had the earliest trappings of an idea of AminoChain. And so we knew what it could be, but it was so nebulous at the time. We just knew there was potential. It was not at all clear where we should go. We’ve learned so much throughout the process. I think that old me would probably say, we would probably just be excited for the years to come because it’s like, nothing’s guaranteed. Everything’s difficult. So many people are relying on you. It’s not an easy job at all, but for some reason, you just can’t stop.

Caspar Barnes: And so I think you would be happy that we’ve gotten closer to finding something that’s worked. Honestly, I still think we have a way to go to prove the real product market fit that we need to nail the adoption. But maybe 20 year old me would already thought that we’d taken it further than it could have gone, which means now there’s only one way up to keep going and double down the direction that we’re going in. And so it would be a mix of excitement, maybe pride, but then more so above all else, like motivation to keep going. So I’d like to think that’s what 2020 capital would say where we are now.

Grant Belgard: What early mistake would you warn every tech bio entrepreneur about?

Caspar Barnes: Oh, well, don’t over dilute your capital too soon. And I think everybody says that. And I also see other people warn early stage founders about things around the capital and who to bring on and advisor shares and like over promising equity to people that don’t add any value. It’s not all these mistakes I read about, but I didn’t really know what they meant until I found myself in the situation. And so I guess now I’d pass the same advice on to other early stage founders. Be careful with your capital, do the research into what the term mean, what are drag along shares, what are rights of first refusals, what are all these things. Understand it well, model out what your capital looks like between rounds very carefully. And then if you’re going to give anybody more equity than needed, give it to your team, give it to your employees.

Caspar Barnes: Don’t give it to advisors that are just trying to shop for freebies or investors that are giving you very aggressive jabs. So I would definitely say research and be diligent and careful around how you structure your cap table. And on that note, I’ll just put a short plug for a program I did called VC University through Berkeley Law. It really teaches you the fundamentals of venture capital, which as a founder, very useful to understand the nature of the pressures that your investors are under from their own limited partners and to really have a more holistic understanding of the ecosystem.

Grant Belgard: What vanity metrics do you see startup decks overusing right now?

Caspar Barnes: Good question. Vanity metric. I think the first thing that comes to mind, I’m sure there’s many more, but the first thing that I can think of is like logos. There’s people overhype the logos, right? You know, like I think there’s a team slide and there’s like logos that pop out, but then, you know, there’s, there’s Disney, Amazon, Harvard, and MIT on there. And then you look through it and it actually turns out that, you know, I shopped at Amazon one time and I took an online course at MIT or something, like something ridiculous. And so I think people massively overinflate the use of logos, both on the team side and the customer side, that it can come across as a little bit disingenuous and maybe TAM, Sam, some metrics. I think that slide tends to be really overhyped.

Caspar Barnes: And if people say they have, you know, a trillion dollar addressable market, I always like, you know, focus more on that slide and see what they really mean and what they’re actually building.

Grant Belgard: So to wrap us up, when people talk about AminoChain in 20 years, what do you hope they say?

Caspar Barnes: I hope that they say, wow, look at this case study from Harvard business school on AminoChain. They proved that you can build an incredibly successful business by putting bioethics at the heart of your business model. And this company proved that if you really care about patient engagement, patient experience, and align incentives for human beings that make research possible, then downstream, everybody benefits. You know, it’s not like providing a better consent experience compromises pharmaceutical interests. It actually aligns with bringing drugs to market and helping people. And along the way, you know, it’s a fantastic protocol and it’s crypto enabled and it’s innovative and whatever. But like, I really would love it if people talk about AminoChain as being a company that proved you can make, you know, a lot of success by caring to people first and foremost.

Caspar Barnes: And so that’s, that’s the real mission of what we’re doing. I’ll happily hang out my hat once we, once we prove that out.

Grant Belgard: So where can our listeners go to learn more and how can they follow AminoChain’s journey?

Caspar Barnes: Amazing. Yeah. So our website is just www.aminochain.io. You can check out the specimen center. If you like, you can log on there. It’s totally open, free for anybody. Go and browse through the hundreds of thousands of biospecimens that we’ve aggregated on there. You can also find us on LinkedIn and follow us on Twitter. We’re just AminoChain. And yeah, if you’re a builder in the space on the life sciences side, or you’re a protocol crypto engineer, then please don’t hesitate to reach out to us through our website as well. We’d love to.

Grant Belgard: Caspar, thank you so much for joining us.

Caspar Barnes: Thank you so much for having me, Grant, it’s been a whole lot of fun.

Nick Wisniewski

The Bioinformatics CRO Webinar Series

October 22, 2025: Nick Wisniewski – AI-First Drug Discovery Pipelines

Nick Wisniewski

Dr. Nicholas Wisniewski is an expert on AI in drug development and regenerative medicine.

In this live webinar, he discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, thus shortening timelines from concept to clinic.

Transcript of The Bioinformatics CRO Webinar Series: AI-First Drug Discovery Pipelines

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the inaugural seminar in The Bioinformatics CRO webinar series. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision-ready insights providing flexible expert bioinformatics support from study design through analysis and reporting.

With that mission in mind, we’re launching The Bioinformatics CRO Webinar Series, a practical forum for sharing tools, workflows, and real world lessons from the front lines of modern bioinformatics. Let’s kick off our first session and welcome Nick Wisniewski.

Nick is an expert in applying artificial intelligence to the life sciences. He earned his PhD in biophysics from UCLA where he later served as a faculty member developing machine learning methods for imaging and multiomics data. In 2016, he joined the founding team at Verge Genomics, pioneers in AI-driven drug discovery, and has since helped launch more than four more biotech startups spanning diagnostics, a smart pill, and a cell therapy. More recently, he served as vice president of bioinformatics and data science at Stemson Therapeutics in San Diego. In this live webinar, Nick discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, the shortening timelines from concept to clinic. Feel free to put your questions in the chat as we’ll have a Q&A afterwards.

And Nick, over to you.

Nick Wisniewski: Thanks a lot, Grant. Really excited to be on this inaugural webcast and looking forward to the rest of the series as always a big fan of The Bioinformatics CRO so I am very happy to support this development and look forward to more in the future. The talk that I’m going to deliver today as Grant mentioned is going to be on AI first drug discovery pipelines I think there’s a lot of movement happening in the space clearly a lot of excitement and discussion amongst investors and tech bio people and a lot of new algorithms and methods coming out every day that it’s hard to keep up with.

So the purpose of this talk is to kind of give an overview of all of that development that’s happening as well as an insight into how these machines are learning and where this is all going into the future.

So to start with I think it’s good to point out that the current state of drug discovery is one in which less than one in 10 drugs succeed. And the cost of developing a drug can be up to $2 billion given given the failures that occur and the portfolio approach that that happens and that money gets spent over a period of up to a decade.

So the impact that AI can have on the drug discovery process can happen in multiple ways. One is the increase in the accuracy or the success rate of the drugs and the second is in the cycle time. If you can test more drugs faster, you can kind of overcome some of this challenge to understand where we’re at and what the impact of AI is. I think it’s important to just start with a review of the traditional drug discovery pipeline as we know it.

It’s largely discussed as a waterfall type process where you have a left to right kind of movement through a number of different phases starting with target identification and target validation, compound screening to identify hits and then hit to lead, getting to the lead and optimizing the lead and then all sorts of preclinical testing to understand toxicity and other stuff of that nature.

And then it goes into the clinical development stage where you have the phase one, two, and three trials. And you know the the loss rate as you go through what what is the chances of success starting out very early when you’re still validating a target can be only 3% of molecules are are going to be successful in the clinic. So that’s you know quite a low rate.

If AI can improve that to 5% it would make a huge difference. We don’t need to necessarily get to 100% although that would be even better but I think the other important part of this pipeline is that embedded into it are a number of different design make test and analyze cycles and so we often think of these in terms of synthesizing molecules but they form the standard feedback loop in in the molecular optimization process and so it mainly happens between hit discovery and lead optimization with each iteration lasting maybe weeks to months and to get to an optimized outcome you might need three to 10 different iterations. So that can really be a bottleneck that AI can address in the drug discovery process.

There are a number of traditional computational tools that are being used and have been being used for quite some time.

So in the initial stage of target identification you know we ask the question which protein which gene what what is the target and a lot of the early tools you’ll recognize as things like ingenuity pathway analysis and WGCNA a lot of these matured you know mainly in the 2000s mid 2000s and have been used with with a fair degree of success since then once you get more into the drug development stage where you already have a target and now you’re trying to design molecules to hit that target.

This is where the rest of the tools come in.

So things like virtual screening and docking have also been concepts that have been around for quite some time. And so this is which molecules are going to bind to that target. So things like AutoDock and and the Schrodinger Glide emerged starting in the the 1980s but you know growing more popular towards the late 90s. And then another question is which molecules are bioactive? So maybe you can bind but you can’t get any sort of activity out of it.

This is where quantitative structure active activity relationship models which go back to the 1960s. They’re largely just regression models and they’ve been updated over time to integrate machine learning methods to to kind of improve some of that but they’re you know a mainstay of the process and and maybe alongside of those there’s the pharmacophore modeling and shape matching and this is kind of trying to understand what’s the geometry of the molecule that’s required for bioactivity 3D shape matching and distance metrics between molecules are all quite useful and they allow us to to filter candidates you know going back to the the 90s again more recently there’s been a lot of movement in molecular dynamics and free energy calculations and this is you know more physics based trying to understand how energetically or thermodynamically favorable the ligand- target complex is and so these are some some simulation techniques that that matured you know maybe 15 years ago or so understanding stability of these bindings and then quite importantly once once you can simulate a lot of these things and think you’ve identified a molecule of course what is extremely important is the properties of that molecule once it enters into a body will the molecule be absorbed and safe will cross the blood brain barrier. These are things that are known as ADMET properties and largely we want to think about the toxicity absorption distribution metabolism excretion and so forth.

Now I’d say the first stage of the development of AI has been to start just modularly replacing out each of those different phases with let’s say deep learning components. And so maybe the one that that we haven’t talked about so far is imaging which is very useful in in the target ID step where you can do more phenotypic level understanding and protein localization within cells and I think those models have been very powerful and very influential in in terms of of the rest in terms of target ID you know we we’ve got the the gene former class of inference algorithms out now in terms of protein structure.

Alpha fold has broken that field wide open. And then you know the rest of them often come with straightforward replacements. So DiffDock is now a big replacement in molecular docking. Maybe new ones are having to do with the denovo molecule generation like MegaMolBART which is now integrated I think in in Nvidia’s BioNeMo. We have some deep learning tox predictions and retrosynthesis planning which is good to help you find the easiest path to synthesize a molecule. But maybe some of the more exciting ways to think about things have to do with experimental planning. And I’m going to talk a little bit more about that in a few slides.

But things that aren’t represented here I would say are the most latest developments. One came out just yesterday being Claude for life sciences. And this is I think very exciting. You know if you’re a bioinformatician or a programmer, you’ve likely been using Claude now for a bit of time for programming and tasks of that nature.

So extending that now into integrations with common lab tools like Benchling and partnering with with institutes like the Broad Institute and 10X Genomics to help facilitate access to data in those platforms and algorithms in those pipelines as well as PubMed to really facilitate searching the literature and getting back good intelligence on targets you find and on drugs that design. That’s going to be highly influential. It promises right now to be able to analyze single cell RNA sequencing data, which is going to be great for democratizing access to that data source. And really interestingly, it’s promising to help prepare regulatory documents, which may be one of the the biggest bottlenecks in the real world into putting together a pipeline and and accelerating it. This stuff takes a lot of time.

Similar developments are coming out of partnerships with Nvidia now more and more every day. Again with Benchling maybe leading the way. Benchling launched their Benchling AI recently and as part of that Nvidia is integrating its NIM microservices into Benchling. So this offers access to all the optimized GPU implementations of things like openfold 2 for protein structure prediction, and I think the other models like the ADMET models and everything are coming shortly. So that’s also very exciting.

But to return to some of the other bottlenecks that are being addressed by AI, let’s go back to the DMTA cycles. So going from design to make is the the first half of the cycle where you know you present a chemist with a bunch of designs and tasked with making those molecules and it may not be immediately obvious how to make them. You may have some information on how it’s done, but along the way, what you learn is the feasibility of synthesis and any constraints that might exist for future design choices.

So in learning that you can already make the first step into thinking well what would I do with a machine learning model like that retrosynthesis model? Well, you can update kind of based on your learnings from that step and retrain your generative design models with those new constraints. And then likewise when you go to test these molecules in a series of biological assays and the ADMET profiling and other stuff like that you learn a lot about the potency, selectivity, toxicity, off-target effects, everything you can measure about these drugs that you may have had predictions for from the QSAR, docking, pharmacophore, ADMET models but now you have new data and you can go back and update all of those models in real time to improve the predictions on the next iteration of the cycle.

So you start to see how a more continuous learning framework can arise from the existing cycles that exist in the drug discovery pipeline. And this starts to hint at the next transition that’s coming in the field where we currently think of AI as more of a tool which we’ll call augmented AI where you have module assistance for each of these different steps of the pipeline and it’s still there being controlled by humans and informing humans empowering humans.

The next step that things are moving towards is this AI first regime where you have some sort of orchestrated autonomous learning cycle. And here the AI acts as the central control architecture orchestrating not just the DMTA loop but you can extend that loop into into different feedback loops and you can start then thinking about these closed loop continuous learning cycles. Combined with automation wet lab automation bioinformatics automation and everything you need to be self-contained. This is I think the crux of the idea that we hear a lot in terms of lab-in-the-loop.

This is a concept being popularized across a number of different institutions — at Genentech, Aviv Regev has put together a team that is exploring a lot of lab-in-the-loop operations. Nvidia is highly supporting lab-in-the-loop architectures and this is kind of the main goal of getting to a continuous learning closed loop architecture where the AI proposes novel molecules. It synthesizes and tests them automatically, gets the assay results and feeds them back to update the model for subsequent iterations.

I think as we’re moving into this regime, it’s important to understand some of the key machine learning paradigms. And so I’m going to talk a bit more about those in the in the next slides but I’ll introduce them here in terms of you may have heard of things like active learning, Bayesian optimization, reinforcement learning and so forth. And then the third component which I mentioned earlier is the automation component.

So right now there’s a range you don’t necessarily need automation in order to build these loops. You can have human in the loop doing the experiments.

But of course the hope is that by having automated experimentation you maybe reduce some of the variability in the experiments increase some of the reproducibility as well as the speed at which you can experiment. You can run all day and night highly parallelize things and so it’s going to scale a lot better.

So thinking about how we’re making this transition I think organizational principles are one of the big bottlenecks. There’s a big issue that we all face with adopting new technology in terms of understanding how it works and deciding to what extent we can trust the decisions that it’s making.

As we work as programmers with these AI tools like Claude, Cursor, Codex, we see and get immediate feedback on how well it solves problems. How many iterations and recorrections we need in order to keep it on track and do what it needs to do. And we can gain some sense of how much we can trust the decision-making that’s happening.

It’s a little bit harder in drug discovery because I think primarily the cycles are so long. So benchmarking these tools is very difficult if it takes five years in order to get something up running molecule created and then get it through trials before you know that it works. Of course it’s a very long feedback cycle and it can take quite a while to develop that kind of trust.

Moreover, we’re handing more and more decision-making over to the AI where traditionally humans maybe directors director level people are making these decisions and that introduces some accountability questions and other organizational problems. So I think one of the most important things that we can do in order to help facilitate trust in the decision-making is understand at a basic level how these decisions are being made. If we’re going to let AI determine what experiments to do next and where to allocate those resources, it probably helps to understand a little bit how it’s making those decisions.

So again I’ll introduce briefly the concepts of active learning, Bayesian optimization and reinforcement learning as kind of the three main techniques right now that you see in these sorts of systems where active learning is one in which the AI understands a bit about what it doesn’t know or what it’s most uncertain about and then it targets experiments in order to learn the most it can in the next set of experiments. So this is a fairly straightforward concept. I think scientists think in much the same way.

Bayesian optimization is maybe more product focused. You’re trying to optimize some property of a molecule and you kind of have to navigate a search space, do some hill climbing on a landscape that you’re inferring while you’re climbing it. And this is a method that’s used in order to find kind of the most potent drug out of a large set of molecules without having to test them all.

And then reinforcement learning, something that we read a lot about these days in terms of particularly the LLMs, is a method of learning that’s really finding trajectories through that space. It’s trying to optimize a sequence of decisions or what’s called a policy in order to optimize long-term gains. And you know that’s very computationally expensive, maybe not as efficient. But it has some strengths over the previous two particularly Bayesian optimization in terms of parallelization capabilities and ability to explore the space in a more efficient way but also has some some drawbacks in terms of inability to learn in sparse spaces and so forth.

So I’ll just show kind of a graphic example of active learning. I’ve got the other two but for the sake of time we’ll skip over it. You know, imagine you’re trying to do a classification task where you’ve got, you know, red team on the right of class one and blue team on the left of class zero, whether these are, you know, whatever you want to call them, toxic, non-toxic, and then you’ve got a bunch of unmeasured molecules.

So each dot is a molecule here. The ones in white are ones that we don’t have any data on yet. The ones in orange are also ones we don’t have data on yet. But in fitting the boundary between red and blue, we find there’s a bunch of unfitted molecules along that boundary. And we color this orange to point out that these are maybe the most uncertain in the whole model.

From the model’s point of view, learning what these are would likely have the most impact on what that decision boundary is. And so you go forth and you test those with the idea that you’re really trying to choose the next experiment in such a way that it can have the maximal impact on your prior beliefs about what that boundary should be. It maximizes your information gain.

And so I think that may help a little bit better in understanding how these lab-in-the-loop systems are working. As a result, I think that waterfall topology of the standard drug development pipeline, we’re going to start seeing change a little bit. It’s going to become possible to flatten and collapse different stages into each other where they all share maybe a group of objective functions that you can optimize simultaneously which can dramatically shorten certain stages of the pipeline.

At the same time, we can also merge and parallelize certain loops. So you can you can do all sorts of different DTMA loops at the same time, integrating all of that feedback instead into what’s called a continuous feedback mesh where you have a bunch of models all kind of conditionally dependent on each other all being updated whenever new data comes in being concurrently influencing each of their predictions for for the next cycle. And you know the one of the most important changes in this process is the shift of the human role into that of a supervisor.

So as humans shift from the decision makers and the gatekeepers to the supervisors you know they’re going to start overseeing these autonomous loops monitoring them to make sure things are on track and only intervening strategically while the AI is handling the rest of the routine iteration and optimization.

We’re starting to see real world examples and commercial platforms. So, if you want to build a startup and design one of these systems yourself, of course, you’re free to do so. Many of those tools are open-source but putting them together can require a lot of effort, a lot of engineering. There are commercial platforms that can be licensed and there are places that are are using these that we can use to benchmark success rates. So I think Insilico Medicine is maybe at the forefront of this. They have a a fully automated robot robotic lab. 31 active programs and claiming to achieve concept to phase one in under two and a half years. And there’s a platform Pharma.ai you can license from them or form partnerships. Similarly with Iktos this is another big player in the field. They have a a similar licensable software-as-a-service program. Recursion and Exscientia are our big players in the field that everybody’s watching to understand how the progress and whether we can speed stuff up is actually progressing.

Isomorphic Labs of course is deep in this and all of the commercial platforms and tools from Schrodinger to the Nvidia and AWS tools the HuggingFace models and I’ll even note new lab automation as a service like Strateos offers these cloud labs where you can control the automation.

So to conclude with a future outlook, I think we’re moving towards this period where autonomous AI scientists are going to start leading a lot of the process. They’re going to be able to design, synthesize, and test molecules in these closed loop cycles. And this is going to improve data quality and integration with every step. The data sets are going to become more unbiased and more accessible. And I think again the key component here is human trust and collaboration which is definitely going to take some time to develop. And I think that may be the most interesting part of of the of the path forward that we’re going to experience in the coming years.

So with that I’ll kind of conclude the talk. Again bringing it back to to Grant and The Bioinformatics CRO. Again, happy to kick off this this inaugural webinar. There’s going to be three more following me over the coming months and I’ll turn it over back to Grant to introduce those speakers and tell us what’s coming.

Grant Belgard: Nick, thank you so much. Yeah, so our next webinar will be broadcast at the same time, 11 am Eastern on Tuesday, November 11th. We’ll be joined by Ania Wilczynska, senior director of Bioinformatics and AI at Bit.bio. So hope to see all of you there. But Nick, questions. So, everyone watching, you can put questions, in the chat and we can kick off with, what do you think of the new foundation models for single cell analysis? Are they having an impact on drug discovery?

Nick Wisniewski: Yeah, they’re very interesting. I was very excited when I saw Geneformer and SCGPT come out and I think there’s been a lot of adoption of these at new startups. This is a big part of the new phase of AI target discovery.

So I think the things that they bring to the table that are fantastic are moving things into the transfer learning paradigm where you know you can bring in a whole bunch of knowledge and do zero-shot predictions on your data without having to to train or learn from external data sets that’s already been done for you.

It also gives you a good way of representation learning and so it gives you a bit of a new representation by which to learn stuff. I think the benchmarking of these things hasn’t shown much more than maybe moderate increases in performance in cases like predicting drug perturbations. I think the benchmarks are still showing no clear improvement over linear methods which is a bit surprising, and I think it’s it’s important to look at that and wonder whether or not that’s telling us something about the data about the algorithms and about biology. I think there’s something to learn there.

My guess is we’re probably coarse graining somewhere, whether that is in the molecule set that we’re using. I think there’s been some recent studies showing that maybe you need to know what the phosphorylation state of every intermediate molecule is what that chemistry mess actually happening in the cell looks like. And that by just measuring broad activity, you may be fine graining or coarse graining too much. And the other is maybe we’re course graining in time and that there’s dynamics that need to be learned that aren’t being captured by our snapshots. Yeah. They’re always very often focused on steady state. Yeah.

Grant Belgard: What do you think the impact of Claude for life science will be in drug discovery? Speaking of recent developments.

Nick Wisniewski: Yeah. This is you know I spent a lot of time looking at it yesterday after I saw the launch. I don’t know if you’ve had had much time to explore it.

Grant Belgard: Few minutes.

Nick Wisniewski: Yeah. I mean it the integrations it’s made I think are fantastic.

Like you know we tend to think particularly in bioinformatics in terms of some of the scientific questions you know these foundation models and stuff like that but when you actually work in in the pipeline and in the lab you notice the overhead in terms of connecting different systems particularly ELN’s like Benchling, the inconsistent metadata that you might find across experiments and the the access to data is a real bottleneck for bioinformaticians in order to to get the data, synthesize it, harmonize it and move forward.

So I think it has the capability to really have huge impact in the way that bioinformaticians work as well as biologists because it gives them access to a lot of this data. I think there, you know, probably other questions having to do with reproducibility that come out of these tools. Every time in the past where we’ve seen access to tools whether it was you know buttonclick testing of p values that you could throw models at everything you saw you know an increase in p hacking and loss of reproducibility and stuff like that so it’s going to be very interesting to see the impact on actual science that Claude has.

My impression of that largely comes maybe from experience and that you know working with Claude when you’re programming you get a lot of “you are absolutely right!”. Let me and I can say you know most of the time I’m not absolutely right. And in my 20 year career working in bioinformatics and biology, I don’t think I’ve really ever said those words aloud in the practice of doing biology. So, you know, a lot of biology comes from pushing back and creating a lot of counter scenarios and debunking ideas rather than the narrative driven science. And so we’re going to see, I think, how Claude navigates that space and whether it’s a positive contribution in that sense.

Grant Belgard: Yeah, that’s a really good point. You can certainly imagine someone running the same query a few times until they get their their favorite gene showing up in a list and running with that, right? So we have a question from the chat. What do you think the timelines are on the transition to AI scientists?

Nick Wisniewski: This is a great question. So you know, of course, predictions are always hard, especially when they’re about the future. And these timelines are, of course, maybe the most contentious part of the AI field because there’s so much hype around them and the fundraising that goes into things. There’s a number of different influences on the timelines that I think go beyond just the development of the tools.
The adoption of these tools is going to be slower than they can be developed particularly at large institutions which you know use most of the resources in the field. You know for good reason big pharma is going to be slower to adopt these systems than the startups.

So I think we’re going to see more development happening in the startup space than in the big players probably with a continued pattern of then acquisitions whenever somebody’s successful. Given the current funding environment for startups you know factoring that in there may be a delay and that delay particularly in The States may be overshadowed elsewhere. So, you know, I read a lot these days about how far ahead in automation places like China are in terms of biotech research. And so, I wouldn’t be surprised if we start seeing the first successful closed loop continuous learning labs emerging from somewhere like there rather than San Francisco.

But in terms of then guessing an overall timeline given those factors and still the need for some development in the automation robotics and the manufacturing of those so that we can get them into labs cheaply here. I think we’re still looking at like 5 to 10 years before we get to these systems even though the capabilities to do this may come a lot sooner.

Grant Belgard: And what’s the one misconception about AI first pipelines you’d like to correct before we wrap?

Nick Wisniewski: Yeah, that’s a great question. I think again the idea that they may be a magic bullet. I think there’s a lot of hope that it’s going to improve reproducibility, reduce variation, and accelerate the speed of research.

But also given the fact that we’re seeing only modest improvements in terms of performance over linear models and stuff like that, it still depends on having the right set of molecules, knowing if you need dynamic real-time data as opposed to snapshot data like we’ve been using. And so it may not be I think if we institute it right now given the the same tools that we’ve been using it it may fall flat in terms of delivering on its promises and I think we need to to also incorporate that ability to question whether or not the data that’s being posed to it is well posed. And I think from a scientist point of view, this is often the ground floor when you approach a problem is asking, is the problem well posed?

And until we build in that base level intuition into these things, it’s easy to start optimizing or overoptimizing something that shouldn’t be optimized in the first place. I think that’s a really good cautionary note.

Grant Belgard: Well, Nick, thank you so much for joining us and all our viewers, thank you for joining. We’ll see you November 11th. Bye-bye.

The Bioinformatics CRO Podcast

Episode 67 with Manos Metzakopian

Manos Metzakopian, co-founder and CEO of CellCodex, joins us to discuss CellCodex’s mission to provide high-quality, scalable cellular perturbation data, ready to train advanced AI models for biology.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Manos Metzakopian

CellCodex is a CRO that generates AI-ready perturbation data at scale. Our founder and podcast host, Grant Belgard, is also a co-founder and the CTO of CellCodex.

Transcript of Episode 67: Manos Metzakopian

Disclaimer: Transcripts are automated and may contain errors.

Grant Belgard: Welcome to the Bioinformatics CRO Podcast. I’m your host Grant Belgard and today I’m joined by Manos Metzakopian. Today’s episode is special. We’re using this conversation to introduce CellCodex to the world. Full disclosure, I’m a co-founder and the CTO of CellCodex and Manos is co-founder and CEO. We’ll explore what the company is setting out to do, the scientific and engineering choices behind it, and Manos’ path to this point and practical advice for anyone building at the intersection of wet lab and AI. Let’s dive in.

Manos Metzakopian: Wow, this is amazing. Thank you for the invite.

Grant Belgard: So what would you like listeners to know about CellCodex?

Manos Metzakopian: When we started CellCodex, we imagined a world where there’s abundance of drug targets and basically that there is a cure for every disease. And a major development that happened in the recent years was artificial intelligence gaining this capability of taking large sets of data and providing such solutions. That happened with large language models, with ChatGPT, where all text has been collected and you can now interrogate all text that has been around for use and you can gain a lot of speed in your daily tasks. So imagine if you had an AI model for biology, for discovering new drugs. And that model helps you increase drug target discovery efficiency, but also efficiencies going to the clinic and increasing your chances of success once you go to the clinic. Because at the moment, most of the drugs that reach the clinic fail. And there’s a lot of iteration that goes into drug discovery.

Manos Metzakopian: So AI has the potential of solving these problems. Now, for biology, there isn’t this counterpart of data sets that was there for ChatGPT and text. And there is a big need for data so that the right AI models are trained to realize this future. And yeah, and this is why CellCodex has been brought to the forefront as it’s been created. It’s to solve biology’s biggest bottleneck, which is data. And AI, as I said, has the power to transform drug discovery, but it needs the right kind of biological data, systematic, reproducible, and at scale. And that’s what we want to deliver. Our vision is to accelerate the arrival of the world where every disease is curable. And the first step is giving model builders, AI model builders, and drug target hunters the right fuel, which is the data.

Grant Belgard: So CellCodex is a CRO that generates AI-ready perturbation data at scale.

Manos Metzakopian: That’s correct.

Grant Belgard: So what problem in biology or drug discovery feels most urgent to address right now, and why start there?

Manos Metzakopian: So at the moment, because of the arrival of AI models that can solve these big problems, the creation of superior AI models is moving at a very fast pace, almost at the pace of weeks and months. Whereas a data generation that can feed these models and allow them to be trained, it’s still very slow. And it’s moving at a speed that is not satisfactorily reaching the speed of model creation and testing. So the most urgent gap is reproducible perturbation data we have. And we have plenty of observational data at the moment. However, these are snapshots of what cells look like. So from observational data in biology, we have almost 14 times the amount of data that was there to train ChatGPT. However, it’s the quality of the data and the kind of data that is available that is important.

Manos Metzakopian: And unfortunately, we do not have that right type of data, the perturbation data, the intervening data in cell identity, cell state, and cell function. Without that, AI can’t move from correlation to causation. We started there at CellCodex to create large-scale perturbation data to solve this problem and to allow AI, artificial intelligence, to realize its promise, speed up drug discovery.

Grant Belgard: When you imagine the ideal outcome of this effort five years from now, what does success look like to the end user?

Manos Metzakopian: The success rate for the success is very simple for the end user. It looks like there’s faster drug discovery programs, fewer dead ends, more success in drug target identification, and higher success in the clinic. And for those to be powered by our AI enabling data sets. That’s how I see our success in five years from now.

Grant Belgard: What kinds of decisions do you hope our work helps people make faster or confidently?

Manos Metzakopian: There is a lot of work that goes into drug discovery that takes many, many years. And we can speed it up at the rate of weeks and months to be able to make these decisions in weeks and months versus years. And that includes which targets to pursue, which mechanisms are causal, which disease models are worth investing in. Right now, those decisions often take years and huge budgets, and we want to make them faster, cheaper, and with higher confidence.

Grant Belgard: What milestones are you most comfortable sharing at this stage, and what should listeners watch for next?

Manos Metzakopian: So a major milestone for us is that we’ve set up and are continuing to build our platform, the CellCodex platform at the Babraham Research Campus at the moment, where we are going to launch our first collaborations, partner projects, client projects. We also want to publish benchmarking data sets that show what AI-grade data really looks like. So listeners should watch for collaborations where our data sets are powering new models or enabling novel drug targets and emerging new drug targets due to our data sets and enabling of client models, our AI models.

Grant Belgard: When building virtual models of cellular behavior, what principles guide how you define the unit of prediction or simulation?

Manos Metzakopian: So we think of units not as just a number of cells that are being evaluated as it’s being done in observational data, but we are also thinking of cell states, cell states under perturbation. So a meaningful unit for us isn’t just a cell at rest or it’s just in its normal environment. It’s a cell that is responding to a defined change. This is the building block for causal AI modeling, I would say.

Grant Belgard: Where do you draw the line between a correlational model that’s useful and a model that supports causal reasoning?

Manos Metzakopian: So a correlation model is useful for pattern recognition, but causation comes when you’ve systematically perturbed the system. So our role here is to generate that causal data so customers can build models that go beyond what co-occurs and moves to what actually drives change. For example, in disease, point mutations can lead to changes in cell state, and these are not just co-occurring mutations. They are driving the change. So we are interested in data sets that empower models that can quickly identify mechanisms that actually drive change in cells.

Grant Belgard: In your view, what types of measurements provide the most leverage for learning cell state transitions under perturbation?

Manos Metzakopian: For us, at the moment, we need single-cell multi-omics, and we have two major capabilities to sequence RNA, cells messenger RNA at scale, but also to acquire epigenetic changes, the epigenetic landscape in the cell through ETAC sequencing. So that captures which areas of the genome are open, and so you know which genes are expressed, but also correlate those to which areas of the genome are open as well. So these two data sets provide, number one, which genes are expressed, how are they changed under perturbation, and very importantly, which features of the epigenome change. So when you sequence, when you have ATAC sequencing, you can also correlate the changes in many features of the genome to the gene expression changes as well. So that adds a lot more information to interpret causation versus correlation.

Grant Belgard: How do you think about biological context, cell type, state, microenvironment, when designing a modeling target?

Manos Metzakopian: That’s a very, very important question. This is essence of what we do in CellCodex. So in CellCodex, we have the functional genomics capabilities to produce these large-scale perturbation data sets through our genetic screening approaches and gene editing technologies. However, the foundation that can lead to the right type of data are the models that we would use to generate these data sets. And so we design our experiments according, of course, to what the clients would need with the cell identity and function in mind, developmental states, co-cultures, which cells need to be together in the dish, and the microenvironment that they are supposed to be growing in. So you take all of that together, and then you have your human cellular models that you would use for your perturbation experiments.

Manos Metzakopian: And if you want to think about it, a perturbation in a neuron means something very different than a perturbation in a fibroblast. So that’s cell identity. So we co-design with customers to choose the context that matters to their question and to problems that their AI models would want to tackle.

Grant Belgard: What would you count as a falsifying result that sends you back to the drawing board?

Manos Metzakopian: A very important thing is the quality of the data, and a lot of it goes into data reproducibility. So we put measures, quality control measures, in place at every step of our platform, so our data sets are reproducible across batches. It would be very challenging if we don’t have batch-to-batch reproducibility. If you think about the cellular models that we are using, so every time we perform tissue culturing and using the cellular models to produce the data, they need to be the same and reproducible. And the data sets that are coming out of these models, the perturbation data sets, need to be reproducible. So we have very strict metrics around that. I would say that would be one of the major falsifying results that can happen in the platform. And we have very stringent mitigation strategies for that.

Grant Belgard: And when you plan data for model training, what are the first three design decisions you lock in and why those?

Manos Metzakopian: Most important thing is the context of which cell models to use, because that’s, if you think about disease, they don’t happen in isolation. They happen in a specific context with specific cell types involved and cell-cell communications happening there. So the cell models to use are one of the first decisions we need to make. Of course, they need to be applicable to our screening strategies as well. And then which perturbations to apply? Is it a gain-of-function perturbation, like using CRISPR activation, or is it a knockdown perturbation, or are we looking at completely knocking out the gene? So which perturbations, and depending on the experiments, the different scales. So we might need a few million cells for an experiment, or hundreds of millions of cells for an experiment.

Manos Metzakopian: So if you’re thinking of foundation models, for example, versus very specifically trained models that would need fewer cells. And which readouts, right? If you’re thinking of omics readouts, which of those readouts, so that you can balance resolution, cost, and downstream utility.

Grant Belgard: What’s your approach to quality control from sample prep through to process matrices?

Manos Metzakopian: Our approach is to have quality control steps embedded in every part of our platform and our process. And to have the right type of standard samples or tests in every component of our measurement. So that then we can always have a good measure of the quality of the data that’s coming through. So from tissue culturing and the cells, quality of the cells that we are using for our perturbations, the quality of the material that we are extracting from the cells, and finally, the quality of data that is being extracted from our cellular models. I would say we have very clearly defined pass-fail criteria up front for customers to know what they’re going to be getting regarding quality of experiments and data.

Grant Belgared: What’s your stance on foundation-style pre-training versus task-specific architectures?

Manos Metzakopian: So I think both are going to be important. You will have customers that are looking for models that can generalize very broadly. So they would be building foundation models, and those would require vast, diverse data sets. So there’s going to be breadth and depth required for such models. And task-specific models will need more precisely curated data sets coming from very specific contexts. And in both cases, that will decide the number of cell types and complex cultures that we would be using to deliver the data sets for both types. Foundation models will have quite broad utility, but the task-oriented ones would be more specific. And we will be producing data sets for both types of model training approaches.

Grant Belgard: What does a convincing benchmark look like to you for the model understands a cell response?

Manos Metzakopian: It can be covered by just one word, I would say, replication and validation. So if, sorry, two words, replication and validation. And that is that we are able to reproduce the same perturbations providing the same data. So that would be replication. And also validation, the outputs of the models that can be validated in turn. So I think these are going to be very important benchmarking tools that we have. That’s how I would think of it in simplistic terms.

Grant Belgard: How do you separate evaluation of biological plausibility from pure predictive accuracy?

Manos Metzakopian: So I would say that it’s very important to focus on biological plausibility. Because if the data itself isn’t biologically valid, accuracy metric will matter in the sense. So it has to be applicable to the scientific challenge that the client wants to tackle. So I would put a lot of focus on biological plausibility, initially, especially in scientific design, in the experimental design.

Grant Belgard: What forms of external validation replication blinded test challenges feel most meaningful?

Manos Metzakopian: It would be great if independent labs can replicate. If you think of it from a replication point of view, if different labs can generate the same data with the same approach, that provides a lot of confidence. But in our case, I think we would think of it as customers successfully using our data to build their models and generalize to new biology. So if they are able to use our data, generalize to new biology, and identify targets, solve their biological problems, and expedite the therapeutic discovery path and increase its effectiveness, then I would say that’s the most meaningful external validation.

Grant Belgard: Who stands to benefit first from this work, and how might they plug it into existing workflows?

Manos Metzakopian: I think at the moment, there is a race happening of different entities and institutions and consortiums and consortia that are working towards delivering a model that can solve a lot of the drug discovery problems. And that includes biopharma teams, and that includes biopharma teams, and biopharma teams, and consortia, and so on. But I think what is currently being understood that it’s not going to be a one-dataset-fits-all. It’s going to be models that are going to be trained to solve specific problems, and they’re going to be requiring their bespoke data sets to be trained with. And so I think it’s going to be less of a race towards the best model, but more of a joint effort to generate the right data for the right models and solve pressing issues in the world. And I think that that day is upon us, for sure.

Grant Belgard: What kinds of collaborations or partnerships would be most impactful at this stage?

Manos Metzakopian: Companies that have bottlenecks in their pipelines where our data can actually resolve that issue.

Grant Belgard: How do you weigh openness, sharing resources, or benchmarks against the need to build a durable business?

Manos Metzakopian: I would say it’s very important to make sure that we are leading in the space of high-quality data sets, AI-grade data sets. And we should think of best ways of sharing benchmarks and best practices openly. However, the large-scale perturbation data sets are contract-delivered, and so there needs to be a balance that ensures both impact and sustainability.

Grant Belgard: What drew you personally to this specific problem space?

Manos Metzakopian: I have always been involved in projects and challenges that require large data and perturbation data. Most recently, we’ve used this know-how in the cell programming field. So to democratize cell types for drug discovery research and cell therapies, and that never required large-scale data sets and so on. And during my time solving these problems in academia and in industry, I realized that the potential for AI to solve the drug discovery bottleneck and lead to a world where there are cures for every disease requires us to rethink the way that we produce data, the quality of the data, its reproducibility and its scale, and the context at which it is delivered.

Manos Metzakopian: And so as I was progressing in my academic and industry career, I’ve realized that setting up a platform like this, which is CellCodex in this case, to generate AI-grade data is timely and very, very important to do so now, where we are at the verge of arising to artificial intelligence-enabled solutions in therapeutics.

Grant Belgard: Looking back, what set of experiences most shaped how you approach leading a science-driven company?

Manos Metzakopian: The most important experience that I had during my academic career and my industry career is managing people effectively, making sure that we are all goal-driven, we are ambitious, and we are enjoying what we’re doing. And in my academic career, I’ve mentored PhD students, master’s students, and postdocs, research assistants, and technicians. It led to amazing work where we’ve published over 30 scientific manuscripts in the fields of genetic screening, cell engineering, and drug target discovery. And similarly, in industry, leading larger teams, the most important thing that leads to success is the team, the people that are involved in driving the work and the goals that we set ahead of us. So I think goal-setting and the people that are along for the journey are the most important pieces of the puzzle.

Grant Belgard: How do you structure your day to balance science, product, people, and operations?

Manos Metzakopian: It’s not always easy to balance between everything. It depends on the stage at which the activities are. If it’s joining a mature corporation where they’ve already set off and they’re on a journey, or in this case, CellCodex, where we are just launching, everyone in CellCodex wears multiple hats, and we try to support each other and help each other so that we can deliver the needs of the company. And I structure my day where I look at the needs of the people, if there’s any way I can help in their day-to-day activities, the needs of the company, and in designing the strategy, and what type of products we’re going to have, and offerings. And of course, now, when launching, we’re thinking of operations. How are we going to operate most effectively? And I would say, at the moment, it’s split 30% equally throughout everything.

Manos Metzakopian: So I would say it’s equally divided across strategy, products, and operations.

Grant Belgard: What advice would you offer to scientists considering a leap into company building?

Manos Metzakopian: You’re not going to feel ready. So at any time point, especially when it’s your first venture. So I would say, if you have the right ideas, and you have a very strong feeling and passion about these ideas, you have people equally passionate with you, and you can work together to make them materialize, then I would jump in, and I wouldn’t wait until you feel fully ready. You probably won’t get to that type of feeling. And it’s not a bad thing. And bottlenecks are not going to fix themselves. So if you see one clearly, then that’s your opportunity to jump in with your ideas to solving a problem in the world.

Grant Belgard: What practices help a small team avoid cargo cult, ML, or overfitting ideas type cycles?

Manos Metzakopian: I wouldn’t chase hype cycles. I would ask if the method helps explain or actually lead to a solution. So I would really think and investigate very, very, very well, very deeply, if a new direction, a new tool, a new approach is really going to make a big difference. And ask yourself if it’s worth the investment. So I wouldn’t chase. I would investigate and research what new things come out.

Grant Belgard: What advances outside your control would most accelerate your roadmap?

Manos Metzakopian: So that’s a great question. So outside, so currently, as I’ve said before, throughout this conversation, this podcast, there are a lot of companies out there that are generating their own artificial intelligence models, and they are using them for predictions that can progress drug discovery. Now, there are a lot of companies that are doing that at the moment already, and there is a big need for data. However, as soon as these models start showing the power that they have in increasing drug target discovery and driving efficiencies in therapies, there’s going to be even a larger need, and there’s going to be a larger number of models that are going to be generated to be trained, and there’s going to be a lot more data that’s going to be needed to train these models.

Manos Metzakopian: So I would say that since there’s going to be such a huge need for data advances that can increase the number of cells that we can analyze in a multi-omics context and technology development that can allow us to analyze multiple modalities from similar samples, all of these will allow for better data, larger-scale data that can provide the fuel that these new models will need in the future.

Grant Belgard: Well, Manos, thank you for sharing the CellCodex vision and the thinking behind it. It was nice having you on today.

Manos Metzakopian: Thank you very much for the invitation. It was a great, great conversation. Thank you. For listeners who want to follow along, the best place is cellcodex.bio and also our LinkedIn page. If you enjoyed this, please subscribe and share with a colleague who cares about building predictive biology. Thanks.

The Bioinformatics CRO Podcast

Episode 66 with Eva-Maria Hempe

Dr. Eva-Maria Hempe, who leads NVIDIA’s healthcare and life sciences business across Europe, the Middle East, and Africa, joins us to discuss her work at NVIDIA, the gaps that AI can fill in healthcare research, and the future of drug discovery.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Eva-Maria Hempe

Eva-Maria Hempe leads NVIDIA’s healthcare and life sciences business across Europe, the Middle East, and North Africa. 

Transcript of Episode 66: Eva-Maria Hempe

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to The Bioinformatics CRO podcast. I’m your host, Grant Belgard. Today, we’re joined by Dr. Eva-Maria Hempe, who leads NVIDIA’s healthcare and life sciences business across Europe, the Middle East, and Africa. Eva-Maria, trained as a physicist, earned a Bill and Melinda Gates funded PhD in healthcare service design at Cambridge, and has since moved through roles at the NHS, Bain & Company, VMware, and the World Economic Forum before joining NVIDIA. She now guides strategy for applying accelerated computing and generative AI, think BioNeMo, Parabricks, and DGX Cloud, to genomics, drug discovery, medical imaging, and more. Eva-Maria, welcome to the show.

Eva-Maria Hempe: Hey, great to be here.

Grant Belgard: So what do you do day-to-day at NVIDIA?

Eva-Maria Hempe: I think in general, my day-to-day oscillates between two major poles, like working in the business and working on the business, or playing the short game and the long game. So on the one hand side, I am responsible for the business. And so that means we have to deliver revenue because if you don’t deliver revenue, you’re not a business, you’re a hobby. And when, on the one hand side, I have to hit a revenue number because if you don’t have a revenue, then you’re not a business. But on the other hand, NVIDIA is all about the long game. Like we are creating markets. We are building things that haven’t been built before. And so it’s really about striking this balance. And what it means, very practical, is on the one hand side, as I said, working in the business. So I have customer meetings.

Eva-Maria Hempe: I work with my team. We’re discussing strategies and tactics, like what should be our sales place? How are we going to work with startups? How are we going to work with this customer? I check out KPI if I see like, are we on track to delivering the revenues that is expected of us? I do a lot of talks and evangelizing to spread the message that NVIDIA is so much more than just GPUs that we have all this great software out there as well, which is super helpful and super valuable to our ecosystem that people can save a lot of time by building on top of what we put out there. So that’s the operational part. And then there is the working on the business. So really the more strategizing, making decisions on, should we focus on enterprises or startups? Where within healthcare should we focus?

Eva-Maria Hempe: To whom do we talk about which kind of topics? To which degree are we focusing on the sale? But where do we see new areas emerging which maybe aren’t driving a sale or even a lot of compute initially, but where we really believe that there are, A, making an impact. And then if they make an impact, eventually it will turn into revenue, which is one of the real beauties about working at NVIDIA that the company is set up in this way to build, to disrupt, to change and to, yeah, you have this luxury almost like it’s a bit crazy to call it luxury, but in a lot of businesses, it’s a luxury you don’t have to really work on your business than just working in the business.

Grant Belgard: So BioNeMo just went open source. Can you tell us about that and what pain point it solves?

Eva-Maria Hempe: Yeah, so in general, as I said before, we’re trying to do at NVIDIA, we’re trying to lift up the field. So we’re not looking for the quick buck. So that’s why we’re not looking to, we’re not gonna change the field by collecting licensed revenues on BioNeMo, but we think BioNeMo is a super interesting, super valuable tool for the community. And by putting it out there as open source, we can just make it much more available to a lot more people. And also we can increase the number of people who are contributing to it with their ideas and making it into something that is a lot more valuable to the community and more powerful and much more in line with the community. I think around the same time that we made it open source, we actually also, we changed it.

Eva-Maria Hempe: Like we turned it into, it has two pieces these days, the one is BioNeMo Framework and the other one is NIMs. So Framework is really, it’s also a collection of microservices, but it’s a collection of microservices, which you need to train and deploy models. So it has a curator and an evaluator and a guard railing part to it. And you can use all of these, you can use any of these, whatever helps you to put out models in a better way. And then we have NIMs and so NVIDIA Inference Microservices and some of them are biology specific. So we have some on folding, we’ve got some on generation, we’ve got some on docking, and you can put this together into reference workflows, which we call blueprints.

Eva-Maria Hempe: I often say it’s a bit like, if you think of a big box of Legos, it’s like the building plan, how you build the most basic thing out of them and then you can play with it and turn it into all sorts of other things. But in general, what we’re trying to do with BioNeMo is really solving the main pain points of drug discovery. So drug discovery is slow, it’s expensive and then also quite technically challenging if you want to use computer aided drug discovery. And so here we’re giving researchers tools to handle complex data, to collaborate and just in general, we wanna have an advanced biomolecular research framework out there that people can use and that they can do their best work with.

Grant Belgard: And for our listeners who aren’t already familiar with BioNeMo, can you give a quick primer on what they can do with it?

Eva-Maria Hempe: So, as I said, it is mostly about computer aided drug discovery. So one way I usually explain it, we have another framework called NeMo and that’s not by coincidence. So NeMo is all about training, deploying models that have to do with language, but by now it’s actually also multimodal and BioNeMo is that for the language of biology. So if you think about a sentence has like words and observes grammar and the same way like a molecule has atoms and observes the laws of physics and chemistry. And so that’s a bit the analogy there. And so the same way that with our language model, you might have proprietary data and you might wanna train a model on this or you might wanna fine tune a model with new data, you can do the same thing with biological data.

Eva-Maria Hempe: If you have data coming in, you can curate it and then you can also make sure, so that’s the curator part, then you can also evaluate it against certain benchmarks. So how good is my model? And then finally you can also make sure it has certain guardrails, so it doesn’t do certain things that you don’t want it to do. And so that’s, yeah, that’s in a nutshell about it. It’s about training, deploying and serving biological models for drug discovery.

Grant Belgard: So AlphaFold has made a huge splash in the structural biology world. What do you think is the next big thing that would be GPU enabled in biology?

Eva-Maria Hempe: For me, AlphaFold is really like, I’m a physicist. So I know when I did my PhD, which in my mind hasn’t been that long ago, we locked up PhD students for three years in a basement to find out the 3D structure of a protein. And now you can just do it on a computer. You can go to build.nvidia.com where we host the NIMs, I said before, and we have a model there and you could fold a protein in like a second live on your computer. And it’s just mind blowing. It even works on my phone. I’ve done it during presentations on my phone. So I’ve folded a protein on my phone within less than a second. In general, there are certain things around AlphaFold. There are certain gaps. So it has problems with dynamics. It has problems with multiple conformations. It can’t do disordered proteins.

Eva-Maria Hempe: And 60% of human proteins have at least one intrinsically disordered region. It’s also not great with protein ligand and nucleic acid interaction. So there are a whole lot of things which it cannot do. And so these are actually also the things we see in the field where a lot of work is going on. And as NVIDIA, we’re doing some research ourselves in the spirit I said before, in trying to lift up the field and trying to show what’s possible and trying to also inspire other people to go further down that path. And so we’re doing some research ourselves. We’re doing a lot of research in collaboration with all sorts of other people. Sometimes we’re open about this. Sometimes it’s not disclosed, but yeah, we’re seeing a lot of things that are going on.

Eva-Maria Hempe: And what we’re seeing in particular in terms of frontiers, I would say, are four things. So we see how do you deal with larger complexes and assemblies? How do you deal with post-translational modifications? How do you deal with dynamics, molecular dynamics? And then also how do you deal with protein design? Like how can you turn AlphaFold around? Like with AlphaFold, you have the sequence and you want to know the 3D structure. Can you have a 3D structure and figure out what is the sequence behind it? So there’s a bunch of work going on in the space and I think it’s going to be super exciting to see what will come out of that.

Grant Belgard: How do you see DGX Cloud changing the barrier to entry for academic labs?

Eva-Maria Hempe: DGX Cloud is like an interesting way, which is part of what we offer. And maybe it’s easier to understand in the greater context of what we offer. So in general, we are very much agnostic of what GPU you’re running your workloads on or what NVIDIA GPU you’re running your workloads on. And that is a huge advantage for people who are working with our software because we don’t want to lock anybody in. The only commitment you’re making is you’re going to work on GPUs, which I think is not a bad lock-in. You’re not locked in any other way, but that you’re going to be using GPUs. And those GPUs, the answer what GPUs are the right ones for you will again very much depend on your situation. Like, do you have a data center? Is your data center big enough? Has it liquid cooling?

Eva-Maria Hempe: Does it have enough electricity? Do you even want to run a data center? Or do you have big spikes where you need really high performance computing capacity in a short amount of time? And DGX Cloud is following our reference architecture. So it’s really all the different components, the GPU, CPU, networking perfectly aligned with each other. And it’s in the cloud, it’s on demand. So what we see it used quite often for is spike. And if an academic lab has that, if a lab is trying to train a huge model, it can be the right thing for the lab. And it could be a great way as well to showcase the power of it, but it’s not always the right solution. Sometimes it’s also worthwhile to build your own on-prem capacity or to go with more conventional cloud capacity.

Eva-Maria Hempe: So I think it’s an element of a larger compute discussion, but it definitely allows academic labs if they have the funding, if it’s basically baked into the grants to really get top-notch performant GPU computing on really short timescales.

Grant Belgard: And at what stages in the process does AI assist drug discovery today?

Eva-Maria Hempe: Pretty much along all of them, I think we see different levels of activity. So we see a lot of really early discoveries. So it starts with things like finding new targets, which I think is an interesting one. I think it’s one where we don’t see, I think you could see even, I would hope for even more activity. Somebody told me the other day how many people are working, how big the overlap is between working on the same targets. It’s mind blowing. And for example, what we talked before, intrinsically disordered proteins is a super interesting area to really find new targets, to be able to address parts or proteins, which so far have been undruggable.

Eva-Maria Hempe: And we’re working with a company there, they’re called Peptone, and they actually, AI supported, have found a method to figure out the structures of disordered proteins. So I think this was super exciting. So we’re starting there. And then of course, we have all the virtual screening workflows in terms of, okay, you have a target, you fold the target. Then you have something like MolMIM or like a generative model, which starting from a particular small molecule creates all sorts of variations of that small molecule. And then you take your protein and your multiple variations of small molecules you generated, and then you use another AI model, which can calculate how well they fit together. And as I said, that’s an area of active research as well.

Eva-Maria Hempe: How well can you really calculate those bindings? And again, another company we’ve worked with, they’re called Inoform. They can actually also do a, they can create models that fit into a particular, or molecules that fit into a particular cavity. So there’s a lot of interesting things around there on the real fundamental level. But then there’s even more to it. There’s, we’re trying to figure out how can we also, or companies are figuring out how can you apply AI to pre-clinic?

Eva-Maria Hempe: And then even in clinical research, or the clinical stages of drug discovery and drug development, there is still so much that can be done because so many drugs don’t necessarily fail because the biological mechanism isn’t there, but often also because you can’t recruit patients, you can’t recruit the right patient. And again, AI can actually have a huge contribution to solving these kinds of problems. And then you can go into manufacturing and selling drugs. So I always tell my clients that AI is a topic along the entire value chain. And we are seeing applications today along the entire value chain. Like every single step, there is somebody working on something and a lot of progress is being made.

Eva-Maria Hempe: You still have the whole issue that just things take a very long time because like clinical studies just take the amount of time they take. You can have a bit of time out there by doing optimized recruiting of trial participants, which is usually a pretty of a delaying factor, or you can use AI also to speed up the data analysis and regulatory writing, clinical writing, submissions processes. So there is some speed up you can do there. But I think in terms of the speed up is more happening in the earlier phases of drug discovery. And then in development, we really have more of a trying to figure out where do they work. So a lot of work I see in that area as well is around biomarkers.

Eva-Maria Hempe: Again, figuring out what works for which patients so that it feeds back into the early stages, but then also once you’re in trials, you have the right patients in your trials and you have a better chance of actually making it through phase three, doing efficacy. I said about all those different ways, how AI can help with the preclinical part. And there is actually real good data on that by now. So, and SILCO is really famous about this and they were smashing it. They had 22 developmental candidates between 2021 and 2024. And actually they were able to get on average to a developmental candidate within 13 months. So around 70 molecules synthesized per program. And the fastest was like nine months and the longest was 18 months.

Eva-Maria Hempe: And this is just like a huge, huge speed up to what you usually see, but these kinds of processes take years. Interesting, so that’s the preclinical phase where it’s really about the speed up and you can also go from target and lead identification over lead optimization in 46 days these days. So all of this is amazing. And I said before in the clinical studies, it’s then really about being better. And there was a paper which came out last year where they looked at AI discovered drugs. And for phase one, the success, probability of success was twice as high as for regular drugs. And it was still pretty bad, but it was twice as high. And then for phase two, it was in line with the averages, but for phase two, the numbers started to become quite small.

Eva-Maria Hempe: And for phase three, there wasn’t enough data. But if we assume this holds, if you assume you’re twice as successful in phase one, which is not unrealistic because phase one is all about safety and with better models, we get better idea of target effect, and then phase two and three about efficacy and a dosing on part, then this actually means we’re going from one in 10 drugs, making it to markets to two in 10 drugs. It’s still a lot, but it’s basically, it’s halving our cost per drug. And if a drug costs these days, on average $2 billion to make it to market, saving a billion dollars per drug. So this is huge. Your potential is huge, which I think is why we’re all still working on this despite all the problems we talked about of long timelines and difficulties to get funding.

Grant Belgard: Where are the biggest talent gaps in bio AI today?

Eva-Maria Hempe: I think it’s really about speaking multiple languages. And the question is also talent where? So we have and– and what keeps things from reaching or from reaching impact. So I think if you look at a lot of the biotech, tech bio, we still have the issue that the entire pharma ecosystem is set up in a particular way. Somebody said it the way, like it’s a coin flip. And we know that the coin is unfair. We know that heads gonna come with a 10% probability. Now what these companies are doing, they’re actually trying to improve the coin minting process. So by using AI, we’re trying to mint better coins. We’re trying to mint a coin, which has a 20% chance of heading up, landing heads up. But this is really hard to prove.

Eva-Maria Hempe: And the entire system, the people in the VCs, all their mindset is like a biotech investor mindset. And they’re looking for the things around a 10% coin flip probability. And it’s really hard to evaluate this. Is this really going to get us this lift up or not? And different to other areas of AI like quant trading where you have immediate feedback, you change something, okay, you’re gonna make more money. Great, let’s do more of this. Here, it’s almost the complete opposite of quant trading. You have like 10 years until you see whether it works or not. And I think that’s actually one of the biggest gaps.

Grant Belgard: Even with the 10 years, it’s small in, right? So it trickles through after 10 years.

Eva-Maria Hempe: And so, yes, I think we need to have more people who speak multiple languages of AI and of data science and of biology. But I think we’re starting to see some of that. But I think it’s really more the system as a whole and the incentives and the structures and just the fact that we’re dealing with biology, which takes 10 years to come. But I’m still optimistic.

Grant Belgard: What are your thoughts on community standards such as OpenFold and so on? Are there areas where there are glaringly obvious missing standards or areas that you think are still being held back by a lack of standards?

Eva-Maria Hempe: At NVIDIA, we are big believers in open source. So we think it’s the one way to really harness the power of community. And we are big believers in the community. NVIDIA is all about communities, about ecosystems and us doing our part to help the ecosystem develop, which is why so much of our software is actually open source because we believe in the power of this approach. And we really wanna support it to come to full fruition.

Grant Belgard: Well, it’s essential to save biotech and pharma, right? The internal rate of return on R&D has been abysmal below the cost of capital for many years now. And at last that turns up.

Eva-Maria Hempe: It’s actually interesting because of those $2 billion per drug or one and a half billion dollars per drug, only I think it’s around 300 or so are the actual cost. All the rest is the cost of the failed drugs and the cost of capital because the capital is just locked up for such a long time and you have so many failures all around. And the other thing I think, I don’t know, you’ve probably seen it, it’s called Eroom’s Law. If you take how many drugs $1 billion in research spending buys you, it’s a logarithmic downward over the last 70 years. This is not recent. This has been going on forever, but it’s just starting to get into areas where it’s just really, you just can’t continue this way. We just need a different way of doing things.

Eva-Maria Hempe: We just can’t continue spending more and more and more and getting less and less and less.

Grant Belgard: So shifting gears, let’s talk about your own journey. What pulled you from physics to health?

Eva-Maria Hempe: It was the impact. So I was sitting there in my lab. So I was doing quantum optic, which means I’m sitting in a dark lab because I was dealing with optics and lasers. So you don’t want daylight messing up your experiments. So you go in in the morning, it’s dark. You leave in the evening, it’s dark. And during the day, it’s dark. And I was just thinking to myself, what is this going to do for the world? And back then we kept saying, oh yeah, this could be used for quantum computing. But back then I was like, well, but this is going to be at least 15 years until anything useful. And I have to say, this has been more than 15 years ago by now. So I was just like, okay, is this really it? But then as with those decisions, usually two things have to come together.

Eva-Maria Hempe: And the other part, which was for the ignition to really change tack was just meeting the right person at the right time. So I met this girl and she was an electrical engineer by training. And she studied how procurement processes at the hospital affect patient safety from with this very scientific engineering frame of mind. And I just thought that it was fascinating. Like all the way I’ve been trained to think, which like I really liked the scientific method. I really liked this way of thinking, but applying it to real world problems. And that’s how I got to study healthcare service design.

Grant Belgard: Are there any insights from your PhD that you still use?

Eva-Maria Hempe: Yeah, I think it’s really that organizations are an interplay between structure and people. And that sounds very simple and very obvious, but if you’re designing an organization, you’re not actually designing an organization. You’re designing almost a scaffolding for the organization to grow around. You’re giving some structure, but an organization isn’t the org chart. It isn’t the policies. It isn’t the trainings. It’s the people which are populating those structures, which are interacting, which are meeting each other or not meeting each other. And I think that was a really important insight which has like, it pops up everywhere. Now, one of my big challenges at work is like how do I get enterprises to adopt AI?

Eva-Maria Hempe: That’s again, an organizational question. As much as a technological question, actually technological question is like, maybe not even half of it. A lot is really about how do you get people to adopt it? How I get people to use it? What are the incentives they’re listening to? Who has power in this organization? How is this organization really structured? So yeah, I still use some of the things I learned, I studied.

Grant Belgard: And what did you learn in your time with the NHS that you think tech sector often misses?

Eva-Maria Hempe: I think in the tech sector, it’s easy to look at everything through a technological lens that, oh yeah, we can improve this, we can do this. But a lot of my research and my work was about design thinking, which is very much empathy. You start with the end user, you immerse yourself into the end user. Ideally you get to observe, you get to shadow, but you get a real idea of what are people doing and what’s the real problems and how can technology help that? I think this empathy, this user-centric view is sometimes a little bit missing in tech. I think what we also discussed before, you’re creating a great tool and maybe the people you tested it with like it, but it has to fit into the workflow. It has to fit into the real life. It’s all about minimizing friction.

Eva-Maria Hempe: I was saying the other day, just like if you wanna drive real value in organization, it’s about having something that has as little a friction as possible and as much immediate value as possible. And then you’re gonna see adoption. If it’s high friction, it has to have even higher value. If it’s low value, it has to have even lower friction, but ideally it has both.

Grant Belgard: Can you tell us about your time at the World Economic Forum and how that impacted the work you do today?

Eva-Maria Hempe: Yeah, the forum really is about multi-stakeholder and what role policy plays. And again, about what are the right incentives and how can you align the incentives of multiple different parties towards a common goal. So what I did there, it was about the future of healthy. So how do you make staying healthy a business versus having people get sick first and then making them healthy again? I mean, that’s an established business model, but why are we there? Why can’t we just keep people healthy in the first place? And there it’s really about thinking through the food industry. How can we make it a better business for the food industry to sell healthy food? How can we make it better for the doctors to be paid to keep the patients healthy?

Eva-Maria Hempe: There’s models for that where they get basically paid per patient in their catchment area, but they don’t get paid for the procedures they do, but they get like a fixed fee. It has all its pros and cons, but really think through things from a joint value and joint incentive point of view. And like I said, again, when you’re trying to change big systems, whether it is an organization or whether here it is like a multi-organizational system, it’s really important. And this is something I think I couldn’t imagine a better place to learn how you navigate these things, how you deal with politicians, how you deal with all the different lobbyists and all the different interest groups and really try to drive towards a common goal. And I think there’s no better place than the forum to learn that.

Grant Belgard: Can you tell us about your time rowing in Cambridge and did that develop you in any way that’s useful today?

Eva-Maria Hempe: Yeah, I got to Cambridge twice. The first time I went to Cambridge, it was for a summer research as part of my master’s thesis. And I knew people and they made some connections for me. And so I was at Cambridge during the summer before the freshers arrived. And then the freshers, so the first year students all came in and all the clubs started recruiting and the rowing club started recruiting and they tried to recruit me. And I was like, yeah, no, I’m only here for a few more months it doesn’t make sense, I should still do it. And I didn’t do it. And then I came back to Germany where I was finishing my studies and everybody was like, oh, you were in Cambridge, did you row? I’m like, no. And then I really regretted it. I was like, well, I really should have.

Eva-Maria Hempe: So I promised myself if I make it back in for my PhD I’ll give rowing a go. And so I did, and initially I wasn’t that good. So I was in the second novice boat. I didn’t even make the first novice boat. I was in the second boat, but then I just kept at it. And I barely made the first boat in the next term. There’s three terms in Cambridge. And then in the third term, I was still in the first boat of my college, of my part of the university I was at. And then I was around for the summer. So I thought, okay, the university team is doing a summer program. I might as well try that. So I did that. And then they try to funnel you into joining the team full time. And I was like, well, Cambridge rowing.

Eva-Maria Hempe: The year, my first year I watched the Cambridge boat races and I was like, wow, it must be so nerve wracking and whatever. And then they were like, yeah, you did the summer program. Don’t you want to trial, like just try for the university? And I was like, okay, well, what’s the worst that could happen? I’d taken that lesson of where I hadn’t rowed and regretted it. I’m like, okay, I don’t want to regret. So I just went for it. And then I found myself on the starting line of that boat race, which I just watched a year before. So I went within 18 months from never having rowed in my life to rowing and winning a boat race. And I think the lesson here, as I said, there’s the one about no regrets.

Eva-Maria Hempe: I think the second one about that you’re just capable of a lot more than you give yourself credit for. And I think the third one also just about the power of habits and the power of persistence and the power of community. So there’s nicer things than getting up every single morning at five o’clock, going to the train station, going rowing, barely making it back for nine o’clock to go and to your lab and do your work. And then at five o’clock going back to row. But it’s incredibly disciplining because you only have from nine to five. There is just no, oh, I’ll do this later. You have to be done at five because then you have to leave and go train and you have to be there for training. You can’t skip training.

Eva-Maria Hempe: And so I thought that was actually really useful to fall into this rhythm and go along with it and also shape your environment in a way that helps you do the things you want to do. Because like I said, it’s just not like, I don’t want to get up at five, but I just have to. And then once you’re back from training, you actually feel pretty good. And of course winning the race, nothing feels as good as that. But even if I would have lost the race, I still like, yeah, it was interesting because just before the race, it was about an hour or two before the start. And I remember we were in the boat bay and did like a little circle of the whole crew. And until then I had a bit of nerves, but from that moment on, I was just calm. All the nervousness, all the nerves were just gone.

Eva-Maria Hempe: And I was just like, well, I put everything into this I could, I have no regrets. So whatever happens now on the water, I can look back at this day and I’m proud because I did whatever I could to get to this point. And I think that was interesting because the year before I thought those people must be so nervous when they sit on the start line. But actually when I sat on the start line, I was just calm, I was just ready to do this. And basically put in the work.

Grant Belgard: Why NVIDIA, what sealed the decision for you to join?

Eva-Maria Hempe: It’s because we are a $4 trillion company. No, of course not. Actually, when we joined, I wasn’t. When I joined NVIDIA, it wasn’t a $4 trillion company. No, it’s just, I couldn’t imagine another place right now where you’ll have this impact on the entire ecosystem of healthcare. We work with everybody. We’re the one AI company which works with everybody else. So I get to work with startups. I get to work with established companies. I’m on the forefront of what’s possible. And at the same time on the forefront of what’s possible to do an organization like the thing we thought before. I mean, on the one hand side, we’re looking at models which can design proteins based on 3D structures.

Eva-Maria Hempe: But on the other hand, we’re also looking at rolling out procurement agents because that solves a real problem in the organization today. So it’s just a really exciting place to be at the center of the action around AI and healthcare. And so in general, it just felt like a place where a lot of the things I’ve been doing in the past sort of all came together. Like the multi-stakeholder management of the forum, the strategizing of almost 10 years in consulting, the operationally leading a team and helping people and creating strategies and tactics to make your number, which I did at VMware. And yeah, it just wrapped into sort of this one package of doing something really exciting and really exciting in a field I’m super passionate about.

Grant Belgard: For early career computational biologists who were looking at entering industry, what three skills should they cultivate now?

Eva-Maria Hempe: It’s a bit difficult to say because I’m not a computational biologist, but I think it’s also maybe not so much about the computational and the biologist. I just assume people are well-trained in those fields. I think what’s really important is for them to listen, to sort of to listen where the problems are, what’s being done, where people struggle with. I think the other thing is to really understand value. So I think there’s a lot of interesting work. If you want to do really cool and interesting work, and maybe it’s a bit controversial, but then academia is the place to be. Like if you just are in for the cool, by all means, that’s what academia is supposed to be. If you’re going into industry, then you need to have a nose for value. You need to start to understand like what’s value.

Eva-Maria Hempe: And value can be very different things. Value doesn’t necessarily mean the biggest grossing drug. It can also just be in line with the research portfolio of the organization. It can be in line with individual values of particular managers, but you need to understand value. I think the last thing it’s about teamwork, because so many of these things by now become so difficult that you just can’t solve them alone. You’re dependent on working with others who are bringing complementary skills and complementary experiences. So I would say three things are listening, understanding value, and working well in a team.

Grant Belgard: For life science founders, when is it worth building their own models versus taking existing models or platforms?

Eva-Maria Hempe: So I think you have to be smart. So do you really have an edge? And AI, in my mind, I always think about in three elements. The one is data, compute, and algorithms. So compute, there are some people who have an edge because they can just buy compute for billions of dollars, but that might not be your edge as a founder. So then it probably leaves either algorithms or data. And if you have something there, yeah, you might want to go for it. But very often, actually, you don’t necessarily need to build a model from scratch. You might not even have enough data to build a good model from scratch. And it might be much more worthwhile for what you’re trying to do and you’re coming back to the point of value. What is the value you’re creating?

Eva-Maria Hempe: It might actually be better to stand on the shoulders of giants and just taking a foundation model and retraining it. And in general, I would always advocate for using frameworks out there because they make your work easier. So BioNeMo is not a model per se, but it’s also a framework which helps you do your models better. And I think you shouldn’t write your own data loader and you shouldn’t have tried to configure guardrails from scratch. Like you have, as a founder, you’re massively resource constrained. So try to think about what are the things where you can really differentiate and focus on those and then try to use platforms, existing tools for all the rest.

Eva-Maria Hempe: And I hope that people are taking something from this podcast is we have so much things out there which we’re putting out there, usually often as open source. We have frameworks and libraries and NIMs and all of this is intended to help you and avoid reinventing the wheel. Like if you’re doing medical imaging, you don’t need to write your own segmentation tool. Like this is all out there. Take it and then build a killer application on top of it. But be smart, look at what’s out there and NVIDIA can offer so much and your favorite AI engine, if you ask it, I have this particular problem, what are the latest NVIDIA frameworks? It should give you a whole list of libraries and frameworks you can use, whether it’s for data science or data frames, et cetera. There’s just so much out there.

Eva-Maria Hempe: I think the last thing for life science founders is as well look into Inception. So Inception is NVIDIA’s free virtual accelerator. So it gets you access to NVIDIA experts, which help you even better find the right tools and right frameworks, which make your money last longer. It gets you into a community of like-minded people and there’s also some programs about cloud credits and or discounts for hardware. So join Inception, look at what NVIDIA has and other people have put out there before you build it yourself and just be really smart about what really drives value.

Grant Belgard: What’s your boldest prediction for AI and drug discovery over the next five years?

Eva-Maria Hempe: I don’t know if it’s five years. I would hope it’s five years, but I think at some point we will look back at the way we do drug discovery today and it will seem as archaic and plainly said stupid as the alchemists trying to turn lead into gold. Like today, if you tell kids, oh, back in the middle ages, you had all those alchemists and they were cooking and the idea was lead is this less noble material and you can turn it into more noble material as gold. People are like, why? And I think we look at the same way a lot of things we do today in drug discovery and we’re just like, why did everyone ever think this is going to work?

Eva-Maria Hempe: And there are like on a more practical level, there’s really smart and really interesting things going on about virtual cells and like better predicting like the link between the genome and actually how cells behave. And then also not just cells because we’re not just cells, we’re whole tissues. So I think we’ll see a lot more understanding and understanding biology, at least to some extent. And I think that will get us to this point of alchemy and how could we have been so stupid.

Grant Belgard: What’s a learning resource you would recommend for every trainee?

Eva-Maria Hempe: I think it’s not a learning resource in the conventional way, but I would really encourage to go on build.nvidia.com because it just shows you what’s possible and you have all those different models and you can play with them, you can get an idea what they can do. And then you can also go to the blueprints and basically see how these are put together. So I think that’s a great resource. And then I would maybe pair that with like, I’m a big fan of perplexity, but also any other LLM agent of choice. I think they are great teachers. They can teach you anything. And the other day I used perplexity in voice mode. And so I was like making dinner and just having this really natural conversation. And there is no stupid question. There is no judging.

Eva-Maria Hempe: You can like ask it anything like just, can you please explain to me again how this works? And I sometimes also use it for some of the NVIDIA stuff. I’m like, okay, can we go deeper on RAPIDS? Can you explain the different libraries? Like how does this work? Why does this work? So I think it’s a great tool to learn about AI, but also just anything else you wanna learn. And it can also challenge you. You can actually also ask it to quiz you and to make sure you really understand things and you explain it back to the machine. The machine actually gives you feedback whether you got it right or you need to brush up a bit more.

Grant Belgard: Yeah, I was actually doing the same thing with a bit of yard work yesterday. Also highly recommend that, voice mode is great. Eva-Maria, thank you so much for joining us. It was great.

Eva-Maria Hempe: Thank you, I really enjoyed it.