The Bioinformatics CRO Podcast

Episode 72 with Sophia George

Sophia George, professor in the Division of Gynecological Oncology at the University of Miami Miller School of Medicine, discusses her research at the Sylvester Comprehensive Cancer Center investigating the genetics and biology of hereditary breast and ovarian cancer and working at the intersection of genomics, health equity, and cancer.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Sophia George

Sophia George is a professor in the Division of Gynecological Oncology at the University of Miami Miller School of Medicine and the principal investigator of the George Lab at the university’s Sylvester Comprehensive Cancer Center.

Transcript of Episode 72: Sophia George

Disclaimer: Transcripts may contain errors.

Coming Soon…

The Bioinformatics CRO Podcast

Episode 71 with Christiaan Engstrom

Christiaan Engstrom, founder and CEO of BLPN, discusses his experience building a space for authentic, non-transactional business networking in the life sciences.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Christiaan Engstrom

Christiaan Engstrom is founder and CEO of BLPN, an invite-only community for life science investors and senior executives to connect.

Transcript of Episode 71: Christiaan Engstrom

Disclaimer: Transcripts may contain errors.

Coming Soon…

The Bioinformatics CRO Podcast

Episode 70 with Joanne Hackett

Dr. Joanne Hackett, VP of Health Systems Services at IQVIA and Chair of the Board at eLife, discusses her hopes for the future of healthcare.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Dr. Joanne Hackett

Dr. Joanne Hackett is VP of Health Systems Services at IQVIA and Chair of the Board at eLife.

Transcript of Episode 70: Joanne Hackett

Disclaimer: Transcripts may contain errors.

Coming Soon…

Bioinformatics Done Right, Now: Streamline your research with computational biology support

Academics, Don’t Wait on the Queue: A Faster Path from Data to Publication


The email arrives: “Your sequencing data are ready.”

It’s the kind of sentence that makes a lab buzz. But after the first rush comes a familiar pause: Who will analyze this, and how long will it take? If you recognize yourself in that moment, The Bioinformatics CRO is for you.

The Shortest Distance Between Data and Figure

Our promise is simple: fast, publication-grade bioinformatics for academics at core‑competitive pricing—without the long waitlist or the learning curve.

  • Speed without shortcuts. In-house cores do important work, but they’re often backed up. We keep our queue short and our response times tight. Projects start quickly once scope is set.
  • Expert time, not training time. Our team is staffed by senior scientists who have shipped many analyses. That experience compresses timelines and reduces rework.
  • Pricing in the same neighborhood as cores. Hourly rates are similar, but our model is built to reduce idle time and cut the “waiting cost.”

In other words: you move faster, often with lower all‑in cost once you account for delays, rework, and the hours you spend shepherding a novice through their first pipeline.

Why Not Just Use a Trainee?

Postdocs and graduate students are talented. They are also busy. Courses, journal clubs, teaching, competing projects, and grant work carve away their hours. If your timeline is tight—or the analysis is non‑standard—asking a trainee to learn on the fly can turn weeks into months. By the time they’ve written code, defended choices, and redone figures for reviewers, the “cheap” path has quietly become expensive.

Working with us is different. We’ve already navigated the edge cases, the batch effects, the parameter cliffs, and the “looks great, but reviewers won’t accept it” traps. We deliver defensible results, clean methods text, and reproducible code.

A Diplomatic Word About Cores and Collaborators

Cores are steady partners, but queues are real, and revisions can be slow. Collaborator labs can be great, yet authorship and priorities get complicated. We’re designed to be your surge capacity and your clean handoff: fast starts, clear deliverables, and no unnecessary authorship entanglements.

How to Work With Us (and Save Money Doing It)

Two engagement styles both work well. Choose the one that matches your project and bandwidth.

1) Clear Scope → Accurate Estimate & Fast Delivery

If you know what you need—say, bulk RNA‑seq differential expression with pathway analysis and four figure-ready plots—tell us up front.

What you get: a tight statement of work, a realistic budget window, and a start date you can put on your lab calendar.
Best for: projects with defined questions, revision letters, or datasets similar to your previous work.

2) Engage With Us as You Go → Exploration & Iteration 

Not every dataset announces its secrets on day one. If the plan is exploratory, we’ll move in measured steps—share early readouts, discuss directions, and refine.

What we need from you: real engagement. Quick feedback keeps momentum high and scope aligned.
Best for: new modalities, mixed cohorts, or “we’ll know it when we see it” figure discovery.

A Cost‑Effective Division of Labor

To keep your budget focused on analysis (not cosmetics or copy), split the work like this:

  • Your lab handles:
    • Data and metadata hygiene (sample sheets, consistent IDs, clear conditions).
    • Figure polish for final submission (fonts, colors, journal‑specific formatting).
    • Manuscript prose (introduction, discussion, and related literature).
  • We handle:
    • QC and rigorous analysis (e.g., DE, clustering/annotation, integration, modeling).
    • Reviewer‑proof choices and statistics.
    • Figure‑ready plots and tables.
    • Methods text and code so everything is reproducible.

This division keeps costs lean and lets trainees contribute meaningfully without spending their semester learning an entire toolchain from scratch.

What to Expect From Us

  • Fast kickoff once scope is set. We schedule starts promptly and keep you posted.
  • PhD‑level analysis you don’t have to babysit. We make choices transparent and document them.
  • Figure‑ready outputs and clean methods. Drop them into your manuscript with minimal edits.
  • Reproducible artifacts. Notebooks, parameter files, and pipeline manifests live with your results.
  • Plain-language updates. Short check‑ins, clear next steps, and no jargon walls.

When We’re the Obvious Choice

  • Data in hand; publication clock ticking. You need figures in weeks, not semesters. 
  • Major revision lands. A reviewer asks for extra analyses or different thresholds. We execute fast and clean. 
  • Grant support. You want a credible analysis plan and methods you can defend. We can provide a letter of support too.

Common Questions

  • “Isn’t a student cheaper?”
    On paper, yes. In practice, hidden costs pile up: learning time, your guidance time, reruns after critiques, and the risk of delays. Our rate is core‑range, but our experienced team and shorter queue often make the real cost—and the stress—lower.
  • “Will you take authorship?”
    Only if you want us to and if our intellectual contribution merits it. Otherwise, we provide clean acknowledgments and thorough methods so credit remains where you intend.
  • “What about compliance and reproducibility?”
    We assume de‑identified data by default and return a reproducible package: QC summaries, parameter files, methods text, and code that stands up to reviewer scrutiny. 

A Short Field Guide to Faster Projects

Before kickoff

  • Write one paragraph that states your central claim or question.
  • List your cohorts/conditions, sample counts, and any known pitfalls.
  • Clean your metadata: consistent sample names, tidy spreadsheets, no mystery columns.

During analysis

  • Respond quickly to interim results; momentum matters.
  • If the path forks, choose one clearly—or approve a bounded exploration.

Before submission

  • Have your trainee apply journal style to figures (fonts, colors, panel letters).
  • Paste our methods text and citations; adjust voice as needed.
  • Use our code and QC notes to pre‑empt reviewer concerns.

The Quiet Luxury: Time

The most expensive thing in your lab is not the hourly rate of an analyst. It’s time—time before a scoop, before a grant deadline, before a trainee defends, before the field moves on. The Bioinformatics CRO trades in time saved without rigor lost. That is the value we offer: publishable certainty, delivered quickly, at a price you already recognize.

If the next dataset is knocking, let’s make the waiting the shortest part of your story.

Ready to move? Send us a brief description of your data and desired figures, or tell us you’d like to work iteratively. We’ll match the approach to the moment—and get you from data to done.

The Bioinformatics CRO Podcast

Episode 69 with David Scieszka

David Scieszka, founder and CEO of Vertical Longevity Pharmaceuticals, tells us about VeLo’s pioneering senolytic vaccine approach to clearing senescent cells and his quest for longer, healthier lives for everyone.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

David Scieszka

David Scieszka is founder and CEO of Vertical Longevity Pharma, which is currently pioneering a senolytic vaccine approach to targeting atherosclerosis and aging.

VeLo Pharma has just opened up a community investment round with no investor accreditation required: https://netcapital.com/companies/vertical-longevity

Vertical Longevity Pharma investment QR code

Transcript of Episode 69: David Scieszka

Disclaimer: Transcripts may contain errors.

Coming Soon…

Phil Ewels

The Bioinformatics CRO Webinar Series

February 18, 2026: Phil Ewels – Reproducible Bioinformatics at Scale: nf-core + Nextflow

Phil Ewels

​Phil Ewels is Product Manager for Open Source at Seqera. He holds a PhD in Molecular Biology from the University of Cambridge, UK. Phil joined Seqera in 2022, previously working at the National Genomics Infrastructure (NGI) at SciLifeLab in Stockholm, Sweden, where he became involved in the Nextflow project and co-founded the nf-core community. Phil’s career has spanned many disciplines from lab work and bioinformatics research in epigenetics, through to software development and community engagement. He is passionate about open-source software and has a soft spot for tools with a focus on user-friendliness. He is the author and maintainer of tools like MultiQC and SRA-Explorer, and helps lead the nf-core and Nextflow development teams.

In this live webinar, he gives an overview of Nextflow and an introduction to some of its new and exciting features for bioinformaticians looking to scale up their pipelines.

Add this event to your calendar: 11 am ET on February 18, 2026 (what time is this in my time zone?)

 

Google Calendar

Outlook Calendar

The Bioinformatics CRO Webinar Series – Reproducible Bioinformatics at Scale: nf-core + Nextflow

Join the live webinar here.

Recording available on February 19, 2026.

The BCRO Webinar

The Bioinformatics CRO Webinar Series

January 21, 2026: Jake Taylor-King

Jake Taylor-King is a co-founder of Relation Therapeutics.

Add this event to your calendar: 11 am ET on January 21, 2026 (what time is this in my time zone?)

 

Google Calendar

Outlook Calendar

The Bioinformatics CRO Webinar Series – Jake Taylor-King

Join the live webinar on our YouTube page.

Recording available on January 22, 2026.

Ania Wilczynska - The Bioinformatics CRO Webinar

The Bioinformatics CRO Webinar Series

November 11, 2025: Ania Wilczynska – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks  

Ania Wilczynska

Dr. Ania Wilczynska is Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their ioCell products and employing cutting-edge AI and machine learning solutions to data analysis. She has over a decade of experience in Bioinformatics and Data Science and over two decades in molecular, developmental and cancer biology.  Prior to joining bit.bio in 2020, she held positions at the University of Cambridge, MRC Toxicology Unit and the CRUK Beatson Institute (now CRUK Scotland Institute).

In this live webinar, she explores how modern bioinformatics must evolve from one-off analyses toward robust, interoperable platforms capable of integrating multi-study, multi-modal data at scale. Drawing on best-practices in data architecture, metadata design, workflow automation, and AI-ready infrastructure, she discusses evolving omics pipelines into a discovery engine.

Transcript of The Bioinformatics CRO Webinar Series – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks

Disclaimer: Transcripts may contain errors.

Grant Belgard: At the Bioinformatics CRO we help life science teams turn complex omics data into decision-ready insights providing flexible expert bioinformatics support from study design through analysis. As part of that mission our webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s session features Dr. Ania Wilczynska presenting “Thinking beyond the single data set: pragmatic solutions for scalable, AI-ready bioinformatics frameworks”. Ania is the Senior Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their IO cell products and employing cutting edge AI and machine learning solutions to data analysis. She has over a decade of experience in bioinformatics and data science and over two decades in molecular, developmental, and cancer biology. Prior to joining bit.bio in 2020, she held positions at the University of Cambridge MRC toxicology unit and the CRUK Beatson Institute. In this live webinar, she will explore how modern bioinformatics must evolve from one-off analyses towards robust interoperable platforms capable of integrating multi-study, multimodal data at scale, drawing on best practices in data architecture, metadata design, workflow automation, and AI ready infrastructure. She will discuss evolving omics pipelines into a discovery engine. We’re live streaming this on YouTube and LinkedIn. Please drop your questions in either chat or email them to [] and we’ll bring them into the discussion. Ania, over to you.

Ania Wilczynska: Thanks very much, Grant. It’s great to be here. Are we sharing? Oh, here we go. Okay. Right. So, welcome everybody. today I’ll be talking to you about how teams can move beyond single bioinformatics data sets toward scalable AI ready bioinformatics frameworks. the talk is grounded in our experience building such systems at bit.bio but really should be equally applicable across academia and industry. We’ll talk about principles that have guided our thinking over the years and highlight cultural changes in how we need to think about data and workflows. So everyone in bioinformatics faces scaling challenges. And this is going to be about practical ways to solve them. So just as an overview of our talk we’ll first start with stating the problem of data growth. We’ll discuss some principles of scalable design. We’ll talk about infrastructure so automation and building a platform integration of data focusing a lot on metadata and using SOMA objects as an example of data integration and then we’ll go into discussing AI workflows human in the loop and how we integrate bioinformatics data in the new AI world and really how we move into creating AI native data sets.

Ania Wilczynska: So a lot of bioinformatics still operates on single studies and ad hoc analyses and of course modern AI and ML requires scale, structure, and reproducibility. So the question is really how do we evolve bioinformatics platforms to address these questions. Data generation currently outpaces analysis and every omic imaging and metadata stream grows exponentially. So classical ML and now large language models give us tools that turn data into insight but only if our systems are consistent and reproducible. Thus, we can’t treat these data sets as isolated projects anymore. And we need to think about platforms that integrate data, automate quality control, which is obviously the first and very important step. And enable us to use all of the information throughout our organization, be it again industry or academic. And this will be the thing that will enhance discovery, precision and scalability. And this is the area where reproducibility, standardization and machine intelligence intersect. So we need to treat data systems as long lived infrastructure not one-off workflows. And by building structured and automated systems, first of all, this is very simple but very important to everybody. We reduce costs. we do accelerate discovery and then create data that as a consequence AI can actually learn from. So this is building into the future. and if analyses can’t be repeated, they can’t be automated. So platform thinking means designing for reuse. Every data set, every model, every workflow should be modular and interoperable because reproducibility means scalability.

Ania Wilczynska: So now moving on into what scalability can mean. So the first principle that we use is this simplicity scales. And we build modular pipelines where the complexity is contained within the module, but the interfaces between the modules are really clean and clear, which then as a consequence means that the complexity is localized to a module that can be easily interchanged. new platforms can be plugged in easily. Now what’s really key to highlight and we’ll be going into more detail on this shortly is consistent language and naming and shared hierarchies are really important in this because this is how we have clarity both across data sets and across teams.

Ania Wilczynska: And finally, by designing for API and cloud integration, we can future-proof our systems so that new technologies can be onboarded very quickly. So, the princ- the take-home from here is design for evolution, right?

Ania Wilczynska: So, this is how this can look like in practice. So automation, end-to-end automation, in our case connects experimental metadata compute analysis storage and dashboards in a continuous loop. It allows multiple analyses to run in parallel. This is instantly reproducible when new data sets arrive. And so the outcome is that scientists including bioinformaticians spend more time interpreting results and less time essentially babysitting pipelines. So this infrastructure supports hypothesis generation and iteration. And it creates a complete cycle from data to new insights. The automation of course doesn’t replace scientists, it amplifies them. And so by automating the flow of data ingestion to reporting to API access we iterate faster and keep quality consistent. So the schematic on the right shows you how we automate the full bioinformatics cycle from data to insight. We start with metadata capture in Benchling as well as our in-house built app. And we ensure that every sample and condition is traceable. This is really key. Next pipelines run automatically with the use of AWS Batch and Lambda and these are scalable as new data volumes arrive. The results are stored in AWS S3 then linked through APIs to feed dashboards and AI tools and LLM agents can summarize the results, flag patterns, QC issues and the scientists interact with the data through dashboards. And so the key idea is that the loop data in analysis inside out runs reproducibly at scale. And it frees our time to focus on interpretation rather than execution.

Ania Wilczynska: Right? So, I’ve mentioned metadata quite a lot already because in our view it really is the connective tissue behind all of the data sets and capturing rich technical biological provenance metadata early allows us to integrate across studies, perform batch correction, and reuse analyses efficiently. We employ fair principles, define once, reuse everywhere. This is really essential. And standardized metadata allows for not just traceability, but also creates a central data store that ensures findability and reuse and structured data feeds directly into AI and ML tools. So metadata can of course be vast. One thing that we found that’s been really important for automation of pipelines, tracking of samples has been a unified sample naming system. Again it sounds pretty trivial. It’s- it takes a little bit of engineering and a little bit of cultural change to to deploy, but it’s been extremely important for us. Again, as a relatively trivial example. So again, just to reiterate that metadata turns messy data into machine readable knowledge.

Ania Wilczynska: So, a slightly busy slide. But this is just an overview of what our platform looks like in terms of integrating research and again using metadata as a foundation as well as automation. So our in-house bioinformatics platform connects this metadata with the sample tracking analysis pipelines QC and reporting through a unified database. it provides live links, data sanitation, and API access, enabling researchers to explore, analyze, and develop AI workflows directly. And as a result of this, we have a self-service interactive research environment where data flows seamlessly from experiment to model. And the way we think about this is that we move from a data set to really a living research system. And we use this platform internally to handle everything from single cell to genotyping to plasma design. and we emphasize empowering users and bridging all these systems.

Ania Wilczynska: Okay. So now we’re, now we start moving into making AI — sorry — making data AI ready. So of course data alone isn’t enough. it needs to be transformed into structured knowledge and we do that by explicitly extracting relationships from our experimental data metadata as well as publications. A lot of our work relies on relies on external open source data and we codify these relationships into a knowledge graph and now we can start to infer new connections using AI. So this structured understanding supports predictive biology which is what we as a company do. And we hope that this will give us the ability to anticipate outcomes of experiments rather than just to measure them.

Ania Wilczynska: So AI now helps us to ask better questions about the data that we are actually generating, generating in house and the workflows that we’re developing using our data as well as again like I said external data from publications from open source data sets is — the loop consists of first defining a question then again aggregating lots of data moving through AI agent synthesis through human review and I cannot stress enough that at how important it is at this stage to have the human in the loop. We- we’ll talk about that again in a second. And finally updating the knowledge graph and iterating again. The human element is very important in terms of both ensuring quality scientific rigor and of course eliminating faulty or hallucinated information. So again to emphasize we’re not at the stage yet of replacing the scientist were augmenting their ability to work. And of course this reduces lag between experiments and insights. So every iteration enriches the knowledge base and therefore improves the outcomes for the next round.

Ania Wilczynska: So, the more we automate retrieval and summarization, the more time our scientists have to focus on creative reasoning and high complexity tasks. So, this is really what this simple schematic at the right is showing you. We are less a lot less focused on the medium and low complexity tasks thanks to both our automated modular pipelines as well as the plugging in agentic AI to be essentially better scientists. Moving a little bit away from the engineering into the more creative science. And this of course implies huge efficiency gains.

Ania Wilczynska: So once again emphasizing the human in the loop. but on the engineering side of things we implement all these principles through a retrieval augmented generation or RAG stack. It allows the AI models to query internal data safely without retraining or exposing to sensitive — exposing sensitive information. Again the architecture is modular. This is obviously a theme. And each of the agents, be it a specific bioinformatics agent or imaging agent or a developer agent, specializes in a particular domain. And this is all coordinated by the human in the loop scientist. So of course this dramatically accelerates tasks that used to take hours that can now be done in seconds. The structure is secure, modular, with replaceable components. So again this is how we think about scalable AI in kind of pragmatic production. All right. So circling back in a way to something a little more formally bioinformatics focused. I’m going to talk to you a little bit about how we scale to multi-data set and multimodal analysis. And this will be mainly focused around single cell data sets. Because this is really an area where the concept of scale is quite obvious and quite a pain point for a lot of researchers. So single cell data sets scale to millions of cells. And integration of these data sets becomes a core challenge for many reasons. But a lot of it is because it is a data engineering problem not just a problem of statistics. So internally at Bit.Bio, we routinely handle data sets from millions of cells across multiple studies and modalities. And to integrate these data sets successfully, we have to normalize technical variation while preserving biology of course and build models that scale efficiently. So data sets need to be aligned across batches, labs, modalities, be that RNA or ATAC-seq, spatial data, protein data, imaging, what have you. And new algorithms do scale to millions of cells and to multiple modalities. There are of course integration tools such as Seurat, Harmony. This is all available open source and scales to unprecedented levels. But the volume of the data is still a huge challenge even when it comes to just loading the data objects for analysis. So the way we’re currently addressing this question is with the use of SOMA. So that stands for stack of matrices annotated. It’s a new open standard from the Chan Zuckerberg Initiative designed exactly for this challenge. So, it provides an array based data format that supports multimodal data sets, again RNA, ATAC-seq, and so on, at a massive scale. It’s fully interoperable across R, Python, C++. And it enables out-of-core access to data aggregations much larger than single host main memory, ensuring distributed computation over data sets. So SOMA provides a building block for higher level API that may embody domain specific conventions or schema around annotated 2D matrices like cell atlases. So for us adopting SOMA has meant that we can first of all store huge amounts of data in one place, slice it quickly and any way we like and share and finally share the data reproducibly preparing it for AI training or in or more simply retrieval. So what this looks like in practice as an, as one example is we’ve integrated a number of perturbation screens where both the technology in terms of sequencing as well as conditions were very different and about 2 million single cells in total have been integrated from our own data. We’ve also been able to put this together with pseudo bulks for each of the screens as well as pseudo bulks from the open- source 44 million single CELLxGENE census data set. This is all integrated in a unified SOMA data layer. What this means is that for all of these data we have a consistent schema for metadata. One of the important things to highlight here is we have, we’re using a unified gene annotation which does require some data wrangling as especially external data sets can use very different annotations. Now that everything is put together, we can easily query slices of the data in seconds across different data sets instead of waiting for minutes or sometimes even hours for our data to load. And this is really the first step towards truly AI native data sets because it- the data is structured is standardized and is really ready for automated reasoning. And with that a kind of whistle stop tour of our thinking. I’ll end. So just three principles to summarize. Treat data as infrastructure not a byproduct. Make metadata-first design non-negotiable. And realize that AI readiness is an outcome and emerges naturally from reproducibility and structure. So the goal is not to automate everything but to build systems that let us scale and scale the scientific discovery. So we need to work towards these scalable interoperable systems instead of just thinking about individual scripts and creating silos.

Ania Wilczynska: And thank you very much.

Grant Belgard: Thank you Ania. As a remember — as a reminder, live viewers can submit questions to the live chat on YouTube or LinkedIn. To kick us off, how do you ensure reproducibility across so many heterogeneous pipelines?

Ania Wilczynska: Yeah so we have I think over 30 different sequencing technology pipelines as of my last counting. And so of course deliberate design is very important. In terms of particular pipelines I cannot emphasize the need for containerization enough. having versioned pipelines, fixed parameters and in it — just to sound like a broken record, very well structured and deliberate metadata capture.

Ania Wilczynska: Having a centralized way of sample submission is also, has also been really important for be, for us being able to very quickly version our pipelines as well. We have relatively rigorous testing approaches as well. But yeah, so containerization versioning and metadata first and foremost.

Grant Belgard: How do you align or normalize data sets across modalities and platforms?

Ania Wilczynska: Yeah. So batch correction is obviously a a big nightmare. And we’ve already spoke about spoken about tools like Harmony, Seurat. There are plenty of publications that talk about various pitfalls of the just single cell integration tools. But again metadata is extremely important. And for example, in our hands, thinking about integrating imaging with transcriptomics is a nontrivial problem especially in the absence of spatial data. So we do single cell but we don’t do spatial transcriptomics. And we’ve spent a lot of time thinking about how we can integrate some ML approaches to to image analysis with our transcriptomics data. And once again, unsurprisingly, metadata, excellent sample tracking, and integration of the, of systems, including, ELN’s, ELN systems like Benchling has been really key to this and sort of deliberate. So there’s also an element of deliberate experimental design again thinking beyond a single experiment. Because you know and I appreciate that academic labs will be less naturally used to thinking about consistent experimental designs because that’s not kind of the core way of thinking in academic labs. However, I still think that asking the question “does it scale” is extremely important also in academia. I mean in my academic career I found that the lack of the question of “does it scale” has meant that we often missed a lot of opportunities to integrate data sets because everything about them was just incompatible.

Grant Belgard: Yep. What’s the advantage of SOMA over existing HDF5 or anndata approaches?

Ania Wilczynska: Yeah, so it’s really the out-of-core scalability, the fact that there is multimodal support and the interoperability. So it’s, it is really from everything that that we’ve seen so far in actually implementing these huge SOMA objects. It’s the next step towards big data rather than single experiments. We’ve — you know I — this may sound like a plug, but we’re, yeah, we’re really excited about how this is enabling us to just to iterate through computational experiments if you like very quickly.

Grant Belgard: How is AI readiness different from just automation?

Ania Wilczynska: Well, so AI readiness is really it means a different way of thinking about structure and about semantics. You know automation, with automation reproducibility is sort of your main output and your main gain. Whereas AI readiness means that the data is discoverable and learnable. And so it does require cross experiment but also cross function thinking. I think that’s another you know thing that people often disregard is how important it is to think about data and about computational biology not just within the computational biology function. But also make sure that the wet lab scientists or indeed in industry other functions understand how everything fits together in a data stream.

Grant Belgard: So related to that what cultural or organizational changes are required for this to be successful?

Ania Wilczynska: Primarily cross functional collaboration. And I think a little bit of mix of evangelizing and education can really go a long way. So we work both on the, on embedding AI workflows into bioinformatics but we also work with other teams for example you know the commercial team in our company to create AI workflows and that of course it helps the business understandably but it also means that there is a lot more understanding across the company as to why such workflows are important, why data is important and why data structures are important. And again this is, this does require a little bit of outreach, a little bit of evangelizing. Structured metadata and deliberate experimental design can at first seem like a little bit of an overhead in the lab because oh it’s another thing I need to capture.

Grant Belgard: We’ve never seen that before, have we?

Ania Wilczynska: Indeed. But showing people the value of that rather than going, “Oh, ping, my whizzy whizzy machine gave you a new result,” but rather going, “Well, because of that overhead, we’ve now been able to bring three data sets together. One that we did three years ago and one that we did now, and now we have a better outcome.” And again, this is, you know, this is all kind of cultural shift, but I found that, you know, that a little goes a long way in that respect. So, so yeah, a bit of showing by example, a bit of evangelizing and also treating colleagues as partners.

Grant Belgard: What’s the next step beyond AI native data sets?

Ania Wilczynska: Well, of course, in the utopian brave new world, it is AI scientists. I don’t think we’re quite there yet. Or at least so we’re all telling ourselves or we’ll all be out of jobs. But really it’s the closed feedback loops, data, models, experiments, self-improving hypotheses. And I think that’s really where things are heading very quickly.

Ania Wilczynska: But there is also well I guess everyone’s talking about it now. There’s also a lot of hype about what AI can do. But I think a lot of what we’re all in a way promising ourselves is is really bottlenecked by data.

Grant Belgard: And so I’m hearing it’s really essential to train people to think in terms of cycle time, right?

Ania Wilczynska: Yeah.

Grant Belgard: Uh, so we have an emailed question. Uh, how can these platforms be translated into the clinic or are there regulatory requirements that need different setups of the platforms?

Ania Wilczynska: Yeah. So, everything I’ve talked about, just to be clear, is in a R&D preclinical setup. I think the important thing to remember about regulatory requirements is that everything is extremely slow for good reason. And there is very little at least as far as I’m aware existing regulation around AI tools. I think that will probably take quite some time. Which again brings me back to this to this idea that you know the AI tools are — I mean we’re all blown away by stuff every day but because these, the regulatory principles don’t really yet exist the experimentation is still necessary. And so again I — thinking about the fact that it’s not just the tool the data has to come before it but also will for some time come after it is very important. Of course on the other hand in terms of you know just building automated modular pipelines you know there are the a lot of the cloud platforms provide certain standards so it you know we’re sort of working towards it but I think we shouldn’t expect the really novel solutions to be adopted all that quickly.

Grant Belgard: Well, Ania, I think we’re at time, but thank you so much for joining us. Um, and the series will resume January 21st at 11:00 a.m. Eastern with Jake Taylor-King from Relation Therapeutics, followed by Phil Ewels from Seqera on February 18th at 11:00 am Eastern. Uh, mark your calendars and thank you everyone for joining us today.

The Bioinformatics CRO Podcast

Episode 68 with Caspar Barnes

Caspar Barnes, founder and CEO of AminoChain, tell us about his mission to make biospecimen sourcing transparent, ethical, and efficient.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Caspar Barnes

Caspar Barnes is founder and CEO of AminoChain, a decentralized biobanking protocol with a mission to make biospecimen sourcing more transparent, ethical, and efficient.

Transcript of Episode 68: Caspar Barnes

Disclaimer: Transcripts may contain errors.

Coming Soon…

Nick Wisniewski

The Bioinformatics CRO Webinar Series

October 22, 2025: Nick Wisniewski – AI-First Drug Discovery Pipelines

Nick Wisniewski

Dr. Nicholas Wisniewski is an expert on AI in drug development and regenerative medicine.

In this live webinar, he discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, thus shortening timelines from concept to clinic.

Transcript of The Bioinformatics CRO Webinar Series: AI-First Drug Discovery Pipelines

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the inaugural seminar in The Bioinformatics CRO webinar series. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision-ready insights providing flexible expert bioinformatics support from study design through analysis and reporting.

With that mission in mind, we’re launching The Bioinformatics CRO Webinar Series, a practical forum for sharing tools, workflows, and real world lessons from the front lines of modern bioinformatics. Let’s kick off our first session and welcome Nick Wisniewski.

Nick is an expert in applying artificial intelligence to the life sciences. He earned his PhD in biophysics from UCLA where he later served as a faculty member developing machine learning methods for imaging and multiomics data. In 2016, he joined the founding team at Verge Genomics, pioneers in AI-driven drug discovery, and has since helped launch more than four more biotech startups spanning diagnostics, a smart pill, and a cell therapy. More recently, he served as vice president of bioinformatics and data science at Stemson Therapeutics in San Diego. In this live webinar, Nick discusses how fully AI-driven platforms are moving beyond target identification to generate, validate, and optimize novel compounds, the shortening timelines from concept to clinic. Feel free to put your questions in the chat as we’ll have a Q&A afterwards.

And Nick, over to you.

Nick Wisniewski: Thanks a lot, Grant. Really excited to be on this inaugural webcast and looking forward to the rest of the series as always a big fan of The Bioinformatics CRO so I am very happy to support this development and look forward to more in the future. The talk that I’m going to deliver today as Grant mentioned is going to be on AI first drug discovery pipelines I think there’s a lot of movement happening in the space clearly a lot of excitement and discussion amongst investors and tech bio people and a lot of new algorithms and methods coming out every day that it’s hard to keep up with.

So the purpose of this talk is to kind of give an overview of all of that development that’s happening as well as an insight into how these machines are learning and where this is all going into the future.

So to start with I think it’s good to point out that the current state of drug discovery is one in which less than one in 10 drugs succeed. And the cost of developing a drug can be up to $2 billion given given the failures that occur and the portfolio approach that that happens and that money gets spent over a period of up to a decade.

So the impact that AI can have on the drug discovery process can happen in multiple ways. One is the increase in the accuracy or the success rate of the drugs and the second is in the cycle time. If you can test more drugs faster, you can kind of overcome some of this challenge to understand where we’re at and what the impact of AI is. I think it’s important to just start with a review of the traditional drug discovery pipeline as we know it.

It’s largely discussed as a waterfall type process where you have a left to right kind of movement through a number of different phases starting with target identification and target validation, compound screening to identify hits and then hit to lead, getting to the lead and optimizing the lead and then all sorts of preclinical testing to understand toxicity and other stuff of that nature.

And then it goes into the clinical development stage where you have the phase one, two, and three trials. And you know the the loss rate as you go through what what is the chances of success starting out very early when you’re still validating a target can be only 3% of molecules are are going to be successful in the clinic. So that’s you know quite a low rate.

If AI can improve that to 5% it would make a huge difference. We don’t need to necessarily get to 100% although that would be even better but I think the other important part of this pipeline is that embedded into it are a number of different design make test and analyze cycles and so we often think of these in terms of synthesizing molecules but they form the standard feedback loop in in the molecular optimization process and so it mainly happens between hit discovery and lead optimization with each iteration lasting maybe weeks to months and to get to an optimized outcome you might need three to 10 different iterations. So that can really be a bottleneck that AI can address in the drug discovery process.

There are a number of traditional computational tools that are being used and have been being used for quite some time.

So in the initial stage of target identification you know we ask the question which protein which gene what what is the target and a lot of the early tools you’ll recognize as things like ingenuity pathway analysis and WGCNA a lot of these matured you know mainly in the 2000s mid 2000s and have been used with with a fair degree of success since then once you get more into the drug development stage where you already have a target and now you’re trying to design molecules to hit that target.

This is where the rest of the tools come in.

So things like virtual screening and docking have also been concepts that have been around for quite some time. And so this is which molecules are going to bind to that target. So things like AutoDock and and the Schrodinger Glide emerged starting in the the 1980s but you know growing more popular towards the late 90s. And then another question is which molecules are bioactive? So maybe you can bind but you can’t get any sort of activity out of it.

This is where quantitative structure active activity relationship models which go back to the 1960s. They’re largely just regression models and they’ve been updated over time to integrate machine learning methods to to kind of improve some of that but they’re you know a mainstay of the process and and maybe alongside of those there’s the pharmacophore modeling and shape matching and this is kind of trying to understand what’s the geometry of the molecule that’s required for bioactivity 3D shape matching and distance metrics between molecules are all quite useful and they allow us to to filter candidates you know going back to the the 90s again more recently there’s been a lot of movement in molecular dynamics and free energy calculations and this is you know more physics based trying to understand how energetically or thermodynamically favorable the ligand- target complex is and so these are some some simulation techniques that that matured you know maybe 15 years ago or so understanding stability of these bindings and then quite importantly once once you can simulate a lot of these things and think you’ve identified a molecule of course what is extremely important is the properties of that molecule once it enters into a body will the molecule be absorbed and safe will cross the blood brain barrier. These are things that are known as ADMET properties and largely we want to think about the toxicity absorption distribution metabolism excretion and so forth.

Now I’d say the first stage of the development of AI has been to start just modularly replacing out each of those different phases with let’s say deep learning components. And so maybe the one that that we haven’t talked about so far is imaging which is very useful in in the target ID step where you can do more phenotypic level understanding and protein localization within cells and I think those models have been very powerful and very influential in in terms of of the rest in terms of target ID you know we we’ve got the the gene former class of inference algorithms out now in terms of protein structure.

Alpha fold has broken that field wide open. And then you know the rest of them often come with straightforward replacements. So DiffDock is now a big replacement in molecular docking. Maybe new ones are having to do with the denovo molecule generation like MegaMolBART which is now integrated I think in in Nvidia’s BioNeMo. We have some deep learning tox predictions and retrosynthesis planning which is good to help you find the easiest path to synthesize a molecule. But maybe some of the more exciting ways to think about things have to do with experimental planning. And I’m going to talk a little bit more about that in a few slides.

But things that aren’t represented here I would say are the most latest developments. One came out just yesterday being Claude for life sciences. And this is I think very exciting. You know if you’re a bioinformatician or a programmer, you’ve likely been using Claude now for a bit of time for programming and tasks of that nature.

So extending that now into integrations with common lab tools like Benchling and partnering with with institutes like the Broad Institute and 10X Genomics to help facilitate access to data in those platforms and algorithms in those pipelines as well as PubMed to really facilitate searching the literature and getting back good intelligence on targets you find and on drugs that design. That’s going to be highly influential. It promises right now to be able to analyze single cell RNA sequencing data, which is going to be great for democratizing access to that data source. And really interestingly, it’s promising to help prepare regulatory documents, which may be one of the the biggest bottlenecks in the real world into putting together a pipeline and and accelerating it. This stuff takes a lot of time.

Similar developments are coming out of partnerships with Nvidia now more and more every day. Again with Benchling maybe leading the way. Benchling launched their Benchling AI recently and as part of that Nvidia is integrating its NIM microservices into Benchling. So this offers access to all the optimized GPU implementations of things like openfold 2 for protein structure prediction, and I think the other models like the ADMET models and everything are coming shortly. So that’s also very exciting.

But to return to some of the other bottlenecks that are being addressed by AI, let’s go back to the DMTA cycles. So going from design to make is the the first half of the cycle where you know you present a chemist with a bunch of designs and tasked with making those molecules and it may not be immediately obvious how to make them. You may have some information on how it’s done, but along the way, what you learn is the feasibility of synthesis and any constraints that might exist for future design choices.

So in learning that you can already make the first step into thinking well what would I do with a machine learning model like that retrosynthesis model? Well, you can update kind of based on your learnings from that step and retrain your generative design models with those new constraints. And then likewise when you go to test these molecules in a series of biological assays and the ADMET profiling and other stuff like that you learn a lot about the potency, selectivity, toxicity, off-target effects, everything you can measure about these drugs that you may have had predictions for from the QSAR, docking, pharmacophore, ADMET models but now you have new data and you can go back and update all of those models in real time to improve the predictions on the next iteration of the cycle.

So you start to see how a more continuous learning framework can arise from the existing cycles that exist in the drug discovery pipeline. And this starts to hint at the next transition that’s coming in the field where we currently think of AI as more of a tool which we’ll call augmented AI where you have module assistance for each of these different steps of the pipeline and it’s still there being controlled by humans and informing humans empowering humans.

The next step that things are moving towards is this AI first regime where you have some sort of orchestrated autonomous learning cycle. And here the AI acts as the central control architecture orchestrating not just the DMTA loop but you can extend that loop into into different feedback loops and you can start then thinking about these closed loop continuous learning cycles. Combined with automation wet lab automation bioinformatics automation and everything you need to be self-contained. This is I think the crux of the idea that we hear a lot in terms of lab-in-the-loop.

This is a concept being popularized across a number of different institutions — at Genentech, Aviv Regev has put together a team that is exploring a lot of lab-in-the-loop operations. Nvidia is highly supporting lab-in-the-loop architectures and this is kind of the main goal of getting to a continuous learning closed loop architecture where the AI proposes novel molecules. It synthesizes and tests them automatically, gets the assay results and feeds them back to update the model for subsequent iterations.

I think as we’re moving into this regime, it’s important to understand some of the key machine learning paradigms. And so I’m going to talk a bit more about those in the in the next slides but I’ll introduce them here in terms of you may have heard of things like active learning, Bayesian optimization, reinforcement learning and so forth. And then the third component which I mentioned earlier is the automation component.

So right now there’s a range you don’t necessarily need automation in order to build these loops. You can have human in the loop doing the experiments.

But of course the hope is that by having automated experimentation you maybe reduce some of the variability in the experiments increase some of the reproducibility as well as the speed at which you can experiment. You can run all day and night highly parallelize things and so it’s going to scale a lot better.

So thinking about how we’re making this transition I think organizational principles are one of the big bottlenecks. There’s a big issue that we all face with adopting new technology in terms of understanding how it works and deciding to what extent we can trust the decisions that it’s making.

As we work as programmers with these AI tools like Claude, Cursor, Codex, we see and get immediate feedback on how well it solves problems. How many iterations and recorrections we need in order to keep it on track and do what it needs to do. And we can gain some sense of how much we can trust the decision-making that’s happening.

It’s a little bit harder in drug discovery because I think primarily the cycles are so long. So benchmarking these tools is very difficult if it takes five years in order to get something up running molecule created and then get it through trials before you know that it works. Of course it’s a very long feedback cycle and it can take quite a while to develop that kind of trust.

Moreover, we’re handing more and more decision-making over to the AI where traditionally humans maybe directors director level people are making these decisions and that introduces some accountability questions and other organizational problems. So I think one of the most important things that we can do in order to help facilitate trust in the decision-making is understand at a basic level how these decisions are being made. If we’re going to let AI determine what experiments to do next and where to allocate those resources, it probably helps to understand a little bit how it’s making those decisions.

So again I’ll introduce briefly the concepts of active learning, Bayesian optimization and reinforcement learning as kind of the three main techniques right now that you see in these sorts of systems where active learning is one in which the AI understands a bit about what it doesn’t know or what it’s most uncertain about and then it targets experiments in order to learn the most it can in the next set of experiments. So this is a fairly straightforward concept. I think scientists think in much the same way.

Bayesian optimization is maybe more product focused. You’re trying to optimize some property of a molecule and you kind of have to navigate a search space, do some hill climbing on a landscape that you’re inferring while you’re climbing it. And this is a method that’s used in order to find kind of the most potent drug out of a large set of molecules without having to test them all.

And then reinforcement learning, something that we read a lot about these days in terms of particularly the LLMs, is a method of learning that’s really finding trajectories through that space. It’s trying to optimize a sequence of decisions or what’s called a policy in order to optimize long-term gains. And you know that’s very computationally expensive, maybe not as efficient. But it has some strengths over the previous two particularly Bayesian optimization in terms of parallelization capabilities and ability to explore the space in a more efficient way but also has some some drawbacks in terms of inability to learn in sparse spaces and so forth.

So I’ll just show kind of a graphic example of active learning. I’ve got the other two but for the sake of time we’ll skip over it. You know, imagine you’re trying to do a classification task where you’ve got, you know, red team on the right of class one and blue team on the left of class zero, whether these are, you know, whatever you want to call them, toxic, non-toxic, and then you’ve got a bunch of unmeasured molecules.

So each dot is a molecule here. The ones in white are ones that we don’t have any data on yet. The ones in orange are also ones we don’t have data on yet. But in fitting the boundary between red and blue, we find there’s a bunch of unfitted molecules along that boundary. And we color this orange to point out that these are maybe the most uncertain in the whole model.

From the model’s point of view, learning what these are would likely have the most impact on what that decision boundary is. And so you go forth and you test those with the idea that you’re really trying to choose the next experiment in such a way that it can have the maximal impact on your prior beliefs about what that boundary should be. It maximizes your information gain.

And so I think that may help a little bit better in understanding how these lab-in-the-loop systems are working. As a result, I think that waterfall topology of the standard drug development pipeline, we’re going to start seeing change a little bit. It’s going to become possible to flatten and collapse different stages into each other where they all share maybe a group of objective functions that you can optimize simultaneously which can dramatically shorten certain stages of the pipeline.

At the same time, we can also merge and parallelize certain loops. So you can you can do all sorts of different DTMA loops at the same time, integrating all of that feedback instead into what’s called a continuous feedback mesh where you have a bunch of models all kind of conditionally dependent on each other all being updated whenever new data comes in being concurrently influencing each of their predictions for for the next cycle. And you know the one of the most important changes in this process is the shift of the human role into that of a supervisor.

So as humans shift from the decision makers and the gatekeepers to the supervisors you know they’re going to start overseeing these autonomous loops monitoring them to make sure things are on track and only intervening strategically while the AI is handling the rest of the routine iteration and optimization.

We’re starting to see real world examples and commercial platforms. So, if you want to build a startup and design one of these systems yourself, of course, you’re free to do so. Many of those tools are open-source but putting them together can require a lot of effort, a lot of engineering. There are commercial platforms that can be licensed and there are places that are are using these that we can use to benchmark success rates. So I think Insilico Medicine is maybe at the forefront of this. They have a a fully automated robot robotic lab. 31 active programs and claiming to achieve concept to phase one in under two and a half years. And there’s a platform Pharma.ai you can license from them or form partnerships. Similarly with Iktos this is another big player in the field. They have a a similar licensable software-as-a-service program. Recursion and Exscientia are our big players in the field that everybody’s watching to understand how the progress and whether we can speed stuff up is actually progressing.

Isomorphic Labs of course is deep in this and all of the commercial platforms and tools from Schrodinger to the Nvidia and AWS tools the HuggingFace models and I’ll even note new lab automation as a service like Strateos offers these cloud labs where you can control the automation.

So to conclude with a future outlook, I think we’re moving towards this period where autonomous AI scientists are going to start leading a lot of the process. They’re going to be able to design, synthesize, and test molecules in these closed loop cycles. And this is going to improve data quality and integration with every step. The data sets are going to become more unbiased and more accessible. And I think again the key component here is human trust and collaboration which is definitely going to take some time to develop. And I think that may be the most interesting part of of the of the path forward that we’re going to experience in the coming years.

So with that I’ll kind of conclude the talk. Again bringing it back to to Grant and The Bioinformatics CRO. Again, happy to kick off this this inaugural webinar. There’s going to be three more following me over the coming months and I’ll turn it over back to Grant to introduce those speakers and tell us what’s coming.

Grant Belgard: Nick, thank you so much. Yeah, so our next webinar will be broadcast at the same time, 11 am Eastern on Tuesday, November 11th. We’ll be joined by Ania Wilczynska, senior director of Bioinformatics and AI at Bit.bio. So hope to see all of you there. But Nick, questions. So, everyone watching, you can put questions, in the chat and we can kick off with, what do you think of the new foundation models for single cell analysis? Are they having an impact on drug discovery?

Nick Wisniewski: Yeah, they’re very interesting. I was very excited when I saw Geneformer and SCGPT come out and I think there’s been a lot of adoption of these at new startups. This is a big part of the new phase of AI target discovery.

So I think the things that they bring to the table that are fantastic are moving things into the transfer learning paradigm where you know you can bring in a whole bunch of knowledge and do zero-shot predictions on your data without having to to train or learn from external data sets that’s already been done for you.

It also gives you a good way of representation learning and so it gives you a bit of a new representation by which to learn stuff. I think the benchmarking of these things hasn’t shown much more than maybe moderate increases in performance in cases like predicting drug perturbations. I think the benchmarks are still showing no clear improvement over linear methods which is a bit surprising, and I think it’s it’s important to look at that and wonder whether or not that’s telling us something about the data about the algorithms and about biology. I think there’s something to learn there.

My guess is we’re probably coarse graining somewhere, whether that is in the molecule set that we’re using. I think there’s been some recent studies showing that maybe you need to know what the phosphorylation state of every intermediate molecule is what that chemistry mess actually happening in the cell looks like. And that by just measuring broad activity, you may be fine graining or coarse graining too much. And the other is maybe we’re course graining in time and that there’s dynamics that need to be learned that aren’t being captured by our snapshots. Yeah. They’re always very often focused on steady state. Yeah.

Grant Belgard: What do you think the impact of Claude for life science will be in drug discovery? Speaking of recent developments.

Nick Wisniewski: Yeah. This is you know I spent a lot of time looking at it yesterday after I saw the launch. I don’t know if you’ve had had much time to explore it.

Grant Belgard: Few minutes.

Nick Wisniewski: Yeah. I mean it the integrations it’s made I think are fantastic.

Like you know we tend to think particularly in bioinformatics in terms of some of the scientific questions you know these foundation models and stuff like that but when you actually work in in the pipeline and in the lab you notice the overhead in terms of connecting different systems particularly ELN’s like Benchling, the inconsistent metadata that you might find across experiments and the the access to data is a real bottleneck for bioinformaticians in order to to get the data, synthesize it, harmonize it and move forward.

So I think it has the capability to really have huge impact in the way that bioinformaticians work as well as biologists because it gives them access to a lot of this data. I think there, you know, probably other questions having to do with reproducibility that come out of these tools. Every time in the past where we’ve seen access to tools whether it was you know buttonclick testing of p values that you could throw models at everything you saw you know an increase in p hacking and loss of reproducibility and stuff like that so it’s going to be very interesting to see the impact on actual science that Claude has.

My impression of that largely comes maybe from experience and that you know working with Claude when you’re programming you get a lot of “you are absolutely right!”. Let me and I can say you know most of the time I’m not absolutely right. And in my 20 year career working in bioinformatics and biology, I don’t think I’ve really ever said those words aloud in the practice of doing biology. So, you know, a lot of biology comes from pushing back and creating a lot of counter scenarios and debunking ideas rather than the narrative driven science. And so we’re going to see, I think, how Claude navigates that space and whether it’s a positive contribution in that sense.

Grant Belgard: Yeah, that’s a really good point. You can certainly imagine someone running the same query a few times until they get their their favorite gene showing up in a list and running with that, right? So we have a question from the chat. What do you think the timelines are on the transition to AI scientists?

Nick Wisniewski: This is a great question. So you know, of course, predictions are always hard, especially when they’re about the future. And these timelines are, of course, maybe the most contentious part of the AI field because there’s so much hype around them and the fundraising that goes into things. There’s a number of different influences on the timelines that I think go beyond just the development of the tools.
The adoption of these tools is going to be slower than they can be developed particularly at large institutions which you know use most of the resources in the field. You know for good reason big pharma is going to be slower to adopt these systems than the startups.

So I think we’re going to see more development happening in the startup space than in the big players probably with a continued pattern of then acquisitions whenever somebody’s successful. Given the current funding environment for startups you know factoring that in there may be a delay and that delay particularly in The States may be overshadowed elsewhere. So, you know, I read a lot these days about how far ahead in automation places like China are in terms of biotech research. And so, I wouldn’t be surprised if we start seeing the first successful closed loop continuous learning labs emerging from somewhere like there rather than San Francisco.

But in terms of then guessing an overall timeline given those factors and still the need for some development in the automation robotics and the manufacturing of those so that we can get them into labs cheaply here. I think we’re still looking at like 5 to 10 years before we get to these systems even though the capabilities to do this may come a lot sooner.

Grant Belgard: And what’s the one misconception about AI first pipelines you’d like to correct before we wrap?

Nick Wisniewski: Yeah, that’s a great question. I think again the idea that they may be a magic bullet. I think there’s a lot of hope that it’s going to improve reproducibility, reduce variation, and accelerate the speed of research.

But also given the fact that we’re seeing only modest improvements in terms of performance over linear models and stuff like that, it still depends on having the right set of molecules, knowing if you need dynamic real-time data as opposed to snapshot data like we’ve been using. And so it may not be I think if we institute it right now given the the same tools that we’ve been using it it may fall flat in terms of delivering on its promises and I think we need to to also incorporate that ability to question whether or not the data that’s being posed to it is well posed. And I think from a scientist point of view, this is often the ground floor when you approach a problem is asking, is the problem well posed?

And until we build in that base level intuition into these things, it’s easy to start optimizing or overoptimizing something that shouldn’t be optimized in the first place. I think that’s a really good cautionary note.

Grant Belgard: Well, Nick, thank you so much for joining us and all our viewers, thank you for joining. We’ll see you November 11th. Bye-bye.