Ania Wilczynska - The Bioinformatics CRO Webinar

The Bioinformatics CRO Webinar Series

November 11, 2025: Ania Wilczynska – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks  

Ania Wilczynska

Dr. Ania Wilczynska is Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their ioCell products and employing cutting-edge AI and machine learning solutions to data analysis. She has over a decade of experience in Bioinformatics and Data Science and over two decades in molecular, developmental and cancer biology.  Prior to joining bit.bio in 2020, she held positions at the University of Cambridge, MRC Toxicology Unit and the CRUK Beatson Institute (now CRUK Scotland Institute).

In this live webinar, she explores how modern bioinformatics must evolve from one-off analyses toward robust, interoperable platforms capable of integrating multi-study, multi-modal data at scale. Drawing on best-practices in data architecture, metadata design, workflow automation, and AI-ready infrastructure, she discusses evolving omics pipelines into a discovery engine.

Transcript of The Bioinformatics CRO Webinar Series – Thinking beyond the single dataset: pragmatic solutions for scalable, AI-ready bioinformatics frameworks

Disclaimer: Transcripts may contain errors.

Grant Belgard: At the Bioinformatics CRO we help life science teams turn complex omics data into decision-ready insights providing flexible expert bioinformatics support from study design through analysis. As part of that mission our webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s session features Dr. Ania Wilczynska presenting “Thinking beyond the single data set: pragmatic solutions for scalable, AI-ready bioinformatics frameworks”. Ania is the Senior Director of Bioinformatics and AI at bit.bio. Her team is focused on understanding the gene regulatory code that defines their IO cell products and employing cutting edge AI and machine learning solutions to data analysis. She has over a decade of experience in bioinformatics and data science and over two decades in molecular, developmental, and cancer biology. Prior to joining bit.bio in 2020, she held positions at the University of Cambridge MRC toxicology unit and the CRUK Beatson Institute. In this live webinar, she will explore how modern bioinformatics must evolve from one-off analyses towards robust interoperable platforms capable of integrating multi-study, multimodal data at scale, drawing on best practices in data architecture, metadata design, workflow automation, and AI ready infrastructure. She will discuss evolving omics pipelines into a discovery engine. We’re live streaming this on YouTube and LinkedIn. Please drop your questions in either chat or email them to [] and we’ll bring them into the discussion. Ania, over to you.

Ania Wilczynska: Thanks very much, Grant. It’s great to be here. Are we sharing? Oh, here we go. Okay. Right. So, welcome everybody. today I’ll be talking to you about how teams can move beyond single bioinformatics data sets toward scalable AI ready bioinformatics frameworks. the talk is grounded in our experience building such systems at bit.bio but really should be equally applicable across academia and industry. We’ll talk about principles that have guided our thinking over the years and highlight cultural changes in how we need to think about data and workflows. So everyone in bioinformatics faces scaling challenges. And this is going to be about practical ways to solve them. So just as an overview of our talk we’ll first start with stating the problem of data growth. We’ll discuss some principles of scalable design. We’ll talk about infrastructure so automation and building a platform integration of data focusing a lot on metadata and using SOMA objects as an example of data integration and then we’ll go into discussing AI workflows human in the loop and how we integrate bioinformatics data in the new AI world and really how we move into creating AI native data sets.

Ania Wilczynska: So a lot of bioinformatics still operates on single studies and ad hoc analyses and of course modern AI and ML requires scale, structure, and reproducibility. So the question is really how do we evolve bioinformatics platforms to address these questions. Data generation currently outpaces analysis and every omic imaging and metadata stream grows exponentially. So classical ML and now large language models give us tools that turn data into insight but only if our systems are consistent and reproducible. Thus, we can’t treat these data sets as isolated projects anymore. And we need to think about platforms that integrate data, automate quality control, which is obviously the first and very important step. And enable us to use all of the information throughout our organization, be it again industry or academic. And this will be the thing that will enhance discovery, precision and scalability. And this is the area where reproducibility, standardization and machine intelligence intersect. So we need to treat data systems as long lived infrastructure not one-off workflows. And by building structured and automated systems, first of all, this is very simple but very important to everybody. We reduce costs. we do accelerate discovery and then create data that as a consequence AI can actually learn from. So this is building into the future. and if analyses can’t be repeated, they can’t be automated. So platform thinking means designing for reuse. Every data set, every model, every workflow should be modular and interoperable because reproducibility means scalability.

Ania Wilczynska: So now moving on into what scalability can mean. So the first principle that we use is this simplicity scales. And we build modular pipelines where the complexity is contained within the module, but the interfaces between the modules are really clean and clear, which then as a consequence means that the complexity is localized to a module that can be easily interchanged. new platforms can be plugged in easily. Now what’s really key to highlight and we’ll be going into more detail on this shortly is consistent language and naming and shared hierarchies are really important in this because this is how we have clarity both across data sets and across teams.

Ania Wilczynska: And finally, by designing for API and cloud integration, we can future-proof our systems so that new technologies can be onboarded very quickly. So, the princ- the take-home from here is design for evolution, right?

Ania Wilczynska: So, this is how this can look like in practice. So automation, end-to-end automation, in our case connects experimental metadata compute analysis storage and dashboards in a continuous loop. It allows multiple analyses to run in parallel. This is instantly reproducible when new data sets arrive. And so the outcome is that scientists including bioinformaticians spend more time interpreting results and less time essentially babysitting pipelines. So this infrastructure supports hypothesis generation and iteration. And it creates a complete cycle from data to new insights. The automation of course doesn’t replace scientists, it amplifies them. And so by automating the flow of data ingestion to reporting to API access we iterate faster and keep quality consistent. So the schematic on the right shows you how we automate the full bioinformatics cycle from data to insight. We start with metadata capture in Benchling as well as our in-house built app. And we ensure that every sample and condition is traceable. This is really key. Next pipelines run automatically with the use of AWS Batch and Lambda and these are scalable as new data volumes arrive. The results are stored in AWS S3 then linked through APIs to feed dashboards and AI tools and LLM agents can summarize the results, flag patterns, QC issues and the scientists interact with the data through dashboards. And so the key idea is that the loop data in analysis inside out runs reproducibly at scale. And it frees our time to focus on interpretation rather than execution.

Ania Wilczynska: Right? So, I’ve mentioned metadata quite a lot already because in our view it really is the connective tissue behind all of the data sets and capturing rich technical biological provenance metadata early allows us to integrate across studies, perform batch correction, and reuse analyses efficiently. We employ fair principles, define once, reuse everywhere. This is really essential. And standardized metadata allows for not just traceability, but also creates a central data store that ensures findability and reuse and structured data feeds directly into AI and ML tools. So metadata can of course be vast. One thing that we found that’s been really important for automation of pipelines, tracking of samples has been a unified sample naming system. Again it sounds pretty trivial. It’s- it takes a little bit of engineering and a little bit of cultural change to to deploy, but it’s been extremely important for us. Again, as a relatively trivial example. So again, just to reiterate that metadata turns messy data into machine readable knowledge.

Ania Wilczynska: So, a slightly busy slide. But this is just an overview of what our platform looks like in terms of integrating research and again using metadata as a foundation as well as automation. So our in-house bioinformatics platform connects this metadata with the sample tracking analysis pipelines QC and reporting through a unified database. it provides live links, data sanitation, and API access, enabling researchers to explore, analyze, and develop AI workflows directly. And as a result of this, we have a self-service interactive research environment where data flows seamlessly from experiment to model. And the way we think about this is that we move from a data set to really a living research system. And we use this platform internally to handle everything from single cell to genotyping to plasma design. and we emphasize empowering users and bridging all these systems.

Ania Wilczynska: Okay. So now we’re, now we start moving into making AI — sorry — making data AI ready. So of course data alone isn’t enough. it needs to be transformed into structured knowledge and we do that by explicitly extracting relationships from our experimental data metadata as well as publications. A lot of our work relies on relies on external open source data and we codify these relationships into a knowledge graph and now we can start to infer new connections using AI. So this structured understanding supports predictive biology which is what we as a company do. And we hope that this will give us the ability to anticipate outcomes of experiments rather than just to measure them.

Ania Wilczynska: So AI now helps us to ask better questions about the data that we are actually generating, generating in house and the workflows that we’re developing using our data as well as again like I said external data from publications from open source data sets is — the loop consists of first defining a question then again aggregating lots of data moving through AI agent synthesis through human review and I cannot stress enough that at how important it is at this stage to have the human in the loop. We- we’ll talk about that again in a second. And finally updating the knowledge graph and iterating again. The human element is very important in terms of both ensuring quality scientific rigor and of course eliminating faulty or hallucinated information. So again to emphasize we’re not at the stage yet of replacing the scientist were augmenting their ability to work. And of course this reduces lag between experiments and insights. So every iteration enriches the knowledge base and therefore improves the outcomes for the next round.

Ania Wilczynska: So, the more we automate retrieval and summarization, the more time our scientists have to focus on creative reasoning and high complexity tasks. So, this is really what this simple schematic at the right is showing you. We are less a lot less focused on the medium and low complexity tasks thanks to both our automated modular pipelines as well as the plugging in agentic AI to be essentially better scientists. Moving a little bit away from the engineering into the more creative science. And this of course implies huge efficiency gains.

Ania Wilczynska: So once again emphasizing the human in the loop. but on the engineering side of things we implement all these principles through a retrieval augmented generation or RAG stack. It allows the AI models to query internal data safely without retraining or exposing to sensitive — exposing sensitive information. Again the architecture is modular. This is obviously a theme. And each of the agents, be it a specific bioinformatics agent or imaging agent or a developer agent, specializes in a particular domain. And this is all coordinated by the human in the loop scientist. So of course this dramatically accelerates tasks that used to take hours that can now be done in seconds. The structure is secure, modular, with replaceable components. So again this is how we think about scalable AI in kind of pragmatic production. All right. So circling back in a way to something a little more formally bioinformatics focused. I’m going to talk to you a little bit about how we scale to multi-data set and multimodal analysis. And this will be mainly focused around single cell data sets. Because this is really an area where the concept of scale is quite obvious and quite a pain point for a lot of researchers. So single cell data sets scale to millions of cells. And integration of these data sets becomes a core challenge for many reasons. But a lot of it is because it is a data engineering problem not just a problem of statistics. So internally at Bit.Bio, we routinely handle data sets from millions of cells across multiple studies and modalities. And to integrate these data sets successfully, we have to normalize technical variation while preserving biology of course and build models that scale efficiently. So data sets need to be aligned across batches, labs, modalities, be that RNA or ATAC-seq, spatial data, protein data, imaging, what have you. And new algorithms do scale to millions of cells and to multiple modalities. There are of course integration tools such as Seurat, Harmony. This is all available open source and scales to unprecedented levels. But the volume of the data is still a huge challenge even when it comes to just loading the data objects for analysis. So the way we’re currently addressing this question is with the use of SOMA. So that stands for stack of matrices annotated. It’s a new open standard from the Chan Zuckerberg Initiative designed exactly for this challenge. So, it provides an array based data format that supports multimodal data sets, again RNA, ATAC-seq, and so on, at a massive scale. It’s fully interoperable across R, Python, C++. And it enables out-of-core access to data aggregations much larger than single host main memory, ensuring distributed computation over data sets. So SOMA provides a building block for higher level API that may embody domain specific conventions or schema around annotated 2D matrices like cell atlases. So for us adopting SOMA has meant that we can first of all store huge amounts of data in one place, slice it quickly and any way we like and share and finally share the data reproducibly preparing it for AI training or in or more simply retrieval. So what this looks like in practice as an, as one example is we’ve integrated a number of perturbation screens where both the technology in terms of sequencing as well as conditions were very different and about 2 million single cells in total have been integrated from our own data. We’ve also been able to put this together with pseudo bulks for each of the screens as well as pseudo bulks from the open- source 44 million single CELLxGENE census data set. This is all integrated in a unified SOMA data layer. What this means is that for all of these data we have a consistent schema for metadata. One of the important things to highlight here is we have, we’re using a unified gene annotation which does require some data wrangling as especially external data sets can use very different annotations. Now that everything is put together, we can easily query slices of the data in seconds across different data sets instead of waiting for minutes or sometimes even hours for our data to load. And this is really the first step towards truly AI native data sets because it- the data is structured is standardized and is really ready for automated reasoning. And with that a kind of whistle stop tour of our thinking. I’ll end. So just three principles to summarize. Treat data as infrastructure not a byproduct. Make metadata-first design non-negotiable. And realize that AI readiness is an outcome and emerges naturally from reproducibility and structure. So the goal is not to automate everything but to build systems that let us scale and scale the scientific discovery. So we need to work towards these scalable interoperable systems instead of just thinking about individual scripts and creating silos.

Ania Wilczynska: And thank you very much.

Grant Belgard: Thank you Ania. As a remember — as a reminder, live viewers can submit questions to the live chat on YouTube or LinkedIn. To kick us off, how do you ensure reproducibility across so many heterogeneous pipelines?

Ania Wilczynska: Yeah so we have I think over 30 different sequencing technology pipelines as of my last counting. And so of course deliberate design is very important. In terms of particular pipelines I cannot emphasize the need for containerization enough. having versioned pipelines, fixed parameters and in it — just to sound like a broken record, very well structured and deliberate metadata capture.

Ania Wilczynska: Having a centralized way of sample submission is also, has also been really important for be, for us being able to very quickly version our pipelines as well. We have relatively rigorous testing approaches as well. But yeah, so containerization versioning and metadata first and foremost.

Grant Belgard: How do you align or normalize data sets across modalities and platforms?

Ania Wilczynska: Yeah. So batch correction is obviously a a big nightmare. And we’ve already spoke about spoken about tools like Harmony, Seurat. There are plenty of publications that talk about various pitfalls of the just single cell integration tools. But again metadata is extremely important. And for example, in our hands, thinking about integrating imaging with transcriptomics is a nontrivial problem especially in the absence of spatial data. So we do single cell but we don’t do spatial transcriptomics. And we’ve spent a lot of time thinking about how we can integrate some ML approaches to to image analysis with our transcriptomics data. And once again, unsurprisingly, metadata, excellent sample tracking, and integration of the, of systems, including, ELN’s, ELN systems like Benchling has been really key to this and sort of deliberate. So there’s also an element of deliberate experimental design again thinking beyond a single experiment. Because you know and I appreciate that academic labs will be less naturally used to thinking about consistent experimental designs because that’s not kind of the core way of thinking in academic labs. However, I still think that asking the question “does it scale” is extremely important also in academia. I mean in my academic career I found that the lack of the question of “does it scale” has meant that we often missed a lot of opportunities to integrate data sets because everything about them was just incompatible.

Grant Belgard: Yep. What’s the advantage of SOMA over existing HDF5 or anndata approaches?

Ania Wilczynska: Yeah, so it’s really the out-of-core scalability, the fact that there is multimodal support and the interoperability. So it’s, it is really from everything that that we’ve seen so far in actually implementing these huge SOMA objects. It’s the next step towards big data rather than single experiments. We’ve — you know I — this may sound like a plug, but we’re, yeah, we’re really excited about how this is enabling us to just to iterate through computational experiments if you like very quickly.

Grant Belgard: How is AI readiness different from just automation?

Ania Wilczynska: Well, so AI readiness is really it means a different way of thinking about structure and about semantics. You know automation, with automation reproducibility is sort of your main output and your main gain. Whereas AI readiness means that the data is discoverable and learnable. And so it does require cross experiment but also cross function thinking. I think that’s another you know thing that people often disregard is how important it is to think about data and about computational biology not just within the computational biology function. But also make sure that the wet lab scientists or indeed in industry other functions understand how everything fits together in a data stream.

Grant Belgard: So related to that what cultural or organizational changes are required for this to be successful?

Ania Wilczynska: Primarily cross functional collaboration. And I think a little bit of mix of evangelizing and education can really go a long way. So we work both on the, on embedding AI workflows into bioinformatics but we also work with other teams for example you know the commercial team in our company to create AI workflows and that of course it helps the business understandably but it also means that there is a lot more understanding across the company as to why such workflows are important, why data is important and why data structures are important. And again this is, this does require a little bit of outreach, a little bit of evangelizing. Structured metadata and deliberate experimental design can at first seem like a little bit of an overhead in the lab because oh it’s another thing I need to capture.

Grant Belgard: We’ve never seen that before, have we?

Ania Wilczynska: Indeed. But showing people the value of that rather than going, “Oh, ping, my whizzy whizzy machine gave you a new result,” but rather going, “Well, because of that overhead, we’ve now been able to bring three data sets together. One that we did three years ago and one that we did now, and now we have a better outcome.” And again, this is, you know, this is all kind of cultural shift, but I found that, you know, that a little goes a long way in that respect. So, so yeah, a bit of showing by example, a bit of evangelizing and also treating colleagues as partners.

Grant Belgard: What’s the next step beyond AI native data sets?

Ania Wilczynska: Well, of course, in the utopian brave new world, it is AI scientists. I don’t think we’re quite there yet. Or at least so we’re all telling ourselves or we’ll all be out of jobs. But really it’s the closed feedback loops, data, models, experiments, self-improving hypotheses. And I think that’s really where things are heading very quickly.

Ania Wilczynska: But there is also well I guess everyone’s talking about it now. There’s also a lot of hype about what AI can do. But I think a lot of what we’re all in a way promising ourselves is is really bottlenecked by data.

Grant Belgard: And so I’m hearing it’s really essential to train people to think in terms of cycle time, right?

Ania Wilczynska: Yeah.

Grant Belgard: Uh, so we have an emailed question. Uh, how can these platforms be translated into the clinic or are there regulatory requirements that need different setups of the platforms?

Ania Wilczynska: Yeah. So, everything I’ve talked about, just to be clear, is in a R&D preclinical setup. I think the important thing to remember about regulatory requirements is that everything is extremely slow for good reason. And there is very little at least as far as I’m aware existing regulation around AI tools. I think that will probably take quite some time. Which again brings me back to this to this idea that you know the AI tools are — I mean we’re all blown away by stuff every day but because these, the regulatory principles don’t really yet exist the experimentation is still necessary. And so again I — thinking about the fact that it’s not just the tool the data has to come before it but also will for some time come after it is very important. Of course on the other hand in terms of you know just building automated modular pipelines you know there are the a lot of the cloud platforms provide certain standards so it you know we’re sort of working towards it but I think we shouldn’t expect the really novel solutions to be adopted all that quickly.

Grant Belgard: Well, Ania, I think we’re at time, but thank you so much for joining us. Um, and the series will resume January 21st at 11:00 a.m. Eastern with Jake Taylor-King from Relation Therapeutics, followed by Phil Ewels from Seqera on February 18th at 11:00 am Eastern. Uh, mark your calendars and thank you everyone for joining us today.