The Bioinformatics CRO Podcast

Episode 67 with Manos Metzakopian

Manos Metzakopian, co-founder and CEO of CellCodex, joins us to discuss CellCodex’s mission to provide high-quality, scalable cellular perturbation data, ready to train advanced AI models for biology.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Amazon, YouTube, Pandora, and wherever you get your podcasts.

Manos Metzakopian

CellCodex is a CRO that generates AI-ready perturbation data at scale. Our founder and podcast host, Grant Belgard, is also a co-founder and the CTO of CellCodex.

Transcript of Episode 67: Manos Metzakopian

Disclaimer: Transcripts are automated and may contain errors.

Grant Belgard: Welcome to the Bioinformatics CRO Podcast. I’m your host Grant Belgard and today I’m joined by Manos Metzakopian. Today’s episode is special. We’re using this conversation to introduce CellCodex to the world. Full disclosure, I’m a co-founder and the CTO of CellCodex and Manos is co-founder and CEO. We’ll explore what the company is setting out to do, the scientific and engineering choices behind it, and Manos’ path to this point and practical advice for anyone building at the intersection of wet lab and AI. Let’s dive in.

Manos Metzakopian: Wow, this is amazing. Thank you for the invite.

Grant Belgard: So what would you like listeners to know about CellCodex?

Manos Metzakopian: When we started CellCodex, we imagined a world where there’s abundance of drug targets and basically that there is a cure for every disease. And a major development that happened in the recent years was artificial intelligence gaining this capability of taking large sets of data and providing such solutions. That happened with large language models, with ChatGPT, where all text has been collected and you can now interrogate all text that has been around for use and you can gain a lot of speed in your daily tasks. So imagine if you had an AI model for biology, for discovering new drugs. And that model helps you increase drug target discovery efficiency, but also efficiencies going to the clinic and increasing your chances of success once you go to the clinic. Because at the moment, most of the drugs that reach the clinic fail. And there’s a lot of iteration that goes into drug discovery.

Manos Metzakopian: So AI has the potential of solving these problems. Now, for biology, there isn’t this counterpart of data sets that was there for ChatGPT and text. And there is a big need for data so that the right AI models are trained to realize this future. And yeah, and this is why CellCodex has been brought to the forefront as it’s been created. It’s to solve biology’s biggest bottleneck, which is data. And AI, as I said, has the power to transform drug discovery, but it needs the right kind of biological data, systematic, reproducible, and at scale. And that’s what we want to deliver. Our vision is to accelerate the arrival of the world where every disease is curable. And the first step is giving model builders, AI model builders, and drug target hunters the right fuel, which is the data.

Grant Belgard: So CellCodex is a CRO that generates AI-ready perturbation data at scale.

Manos Metzakopian: That’s correct.

Grant Belgard: So what problem in biology or drug discovery feels most urgent to address right now, and why start there?

Manos Metzakopian: So at the moment, because of the arrival of AI models that can solve these big problems, the creation of superior AI models is moving at a very fast pace, almost at the pace of weeks and months. Whereas a data generation that can feed these models and allow them to be trained, it’s still very slow. And it’s moving at a speed that is not satisfactorily reaching the speed of model creation and testing. So the most urgent gap is reproducible perturbation data we have. And we have plenty of observational data at the moment. However, these are snapshots of what cells look like. So from observational data in biology, we have almost 14 times the amount of data that was there to train ChatGPT. However, it’s the quality of the data and the kind of data that is available that is important.

Manos Metzakopian: And unfortunately, we do not have that right type of data, the perturbation data, the intervening data in cell identity, cell state, and cell function. Without that, AI can’t move from correlation to causation. We started there at CellCodex to create large-scale perturbation data to solve this problem and to allow AI, artificial intelligence, to realize its promise, speed up drug discovery.

Grant Belgard: When you imagine the ideal outcome of this effort five years from now, what does success look like to the end user?

Manos Metzakopian: The success rate for the success is very simple for the end user. It looks like there’s faster drug discovery programs, fewer dead ends, more success in drug target identification, and higher success in the clinic. And for those to be powered by our AI enabling data sets. That’s how I see our success in five years from now.

Grant Belgard: What kinds of decisions do you hope our work helps people make faster or confidently?

Manos Metzakopian: There is a lot of work that goes into drug discovery that takes many, many years. And we can speed it up at the rate of weeks and months to be able to make these decisions in weeks and months versus years. And that includes which targets to pursue, which mechanisms are causal, which disease models are worth investing in. Right now, those decisions often take years and huge budgets, and we want to make them faster, cheaper, and with higher confidence.

Grant Belgard: What milestones are you most comfortable sharing at this stage, and what should listeners watch for next?

Manos Metzakopian: So a major milestone for us is that we’ve set up and are continuing to build our platform, the CellCodex platform at the Babraham Research Campus at the moment, where we are going to launch our first collaborations, partner projects, client projects. We also want to publish benchmarking data sets that show what AI-grade data really looks like. So listeners should watch for collaborations where our data sets are powering new models or enabling novel drug targets and emerging new drug targets due to our data sets and enabling of client models, our AI models.

Grant Belgard: When building virtual models of cellular behavior, what principles guide how you define the unit of prediction or simulation?

Manos Metzakopian: So we think of units not as just a number of cells that are being evaluated as it’s being done in observational data, but we are also thinking of cell states, cell states under perturbation. So a meaningful unit for us isn’t just a cell at rest or it’s just in its normal environment. It’s a cell that is responding to a defined change. This is the building block for causal AI modeling, I would say.

Grant Belgard: Where do you draw the line between a correlational model that’s useful and a model that supports causal reasoning?

Manos Metzakopian: So a correlation model is useful for pattern recognition, but causation comes when you’ve systematically perturbed the system. So our role here is to generate that causal data so customers can build models that go beyond what co-occurs and moves to what actually drives change. For example, in disease, point mutations can lead to changes in cell state, and these are not just co-occurring mutations. They are driving the change. So we are interested in data sets that empower models that can quickly identify mechanisms that actually drive change in cells.

Grant Belgard: In your view, what types of measurements provide the most leverage for learning cell state transitions under perturbation?

Manos Metzakopian: For us, at the moment, we need single-cell multi-omics, and we have two major capabilities to sequence RNA, cells messenger RNA at scale, but also to acquire epigenetic changes, the epigenetic landscape in the cell through ETAC sequencing. So that captures which areas of the genome are open, and so you know which genes are expressed, but also correlate those to which areas of the genome are open as well. So these two data sets provide, number one, which genes are expressed, how are they changed under perturbation, and very importantly, which features of the epigenome change. So when you sequence, when you have ATAC sequencing, you can also correlate the changes in many features of the genome to the gene expression changes as well. So that adds a lot more information to interpret causation versus correlation.

Grant Belgard: How do you think about biological context, cell type, state, microenvironment, when designing a modeling target?

Manos Metzakopian: That’s a very, very important question. This is essence of what we do in CellCodex. So in CellCodex, we have the functional genomics capabilities to produce these large-scale perturbation data sets through our genetic screening approaches and gene editing technologies. However, the foundation that can lead to the right type of data are the models that we would use to generate these data sets. And so we design our experiments according, of course, to what the clients would need with the cell identity and function in mind, developmental states, co-cultures, which cells need to be together in the dish, and the microenvironment that they are supposed to be growing in. So you take all of that together, and then you have your human cellular models that you would use for your perturbation experiments.

Manos Metzakopian: And if you want to think about it, a perturbation in a neuron means something very different than a perturbation in a fibroblast. So that’s cell identity. So we co-design with customers to choose the context that matters to their question and to problems that their AI models would want to tackle.

Grant Belgard: What would you count as a falsifying result that sends you back to the drawing board?

Manos Metzakopian: A very important thing is the quality of the data, and a lot of it goes into data reproducibility. So we put measures, quality control measures, in place at every step of our platform, so our data sets are reproducible across batches. It would be very challenging if we don’t have batch-to-batch reproducibility. If you think about the cellular models that we are using, so every time we perform tissue culturing and using the cellular models to produce the data, they need to be the same and reproducible. And the data sets that are coming out of these models, the perturbation data sets, need to be reproducible. So we have very strict metrics around that. I would say that would be one of the major falsifying results that can happen in the platform. And we have very stringent mitigation strategies for that.

Grant Belgard: And when you plan data for model training, what are the first three design decisions you lock in and why those?

Manos Metzakopian: Most important thing is the context of which cell models to use, because that’s, if you think about disease, they don’t happen in isolation. They happen in a specific context with specific cell types involved and cell-cell communications happening there. So the cell models to use are one of the first decisions we need to make. Of course, they need to be applicable to our screening strategies as well. And then which perturbations to apply? Is it a gain-of-function perturbation, like using CRISPR activation, or is it a knockdown perturbation, or are we looking at completely knocking out the gene? So which perturbations, and depending on the experiments, the different scales. So we might need a few million cells for an experiment, or hundreds of millions of cells for an experiment.

Manos Metzakopian: So if you’re thinking of foundation models, for example, versus very specifically trained models that would need fewer cells. And which readouts, right? If you’re thinking of omics readouts, which of those readouts, so that you can balance resolution, cost, and downstream utility.

Grant Belgard: What’s your approach to quality control from sample prep through to process matrices?

Manos Metzakopian: Our approach is to have quality control steps embedded in every part of our platform and our process. And to have the right type of standard samples or tests in every component of our measurement. So that then we can always have a good measure of the quality of the data that’s coming through. So from tissue culturing and the cells, quality of the cells that we are using for our perturbations, the quality of the material that we are extracting from the cells, and finally, the quality of data that is being extracted from our cellular models. I would say we have very clearly defined pass-fail criteria up front for customers to know what they’re going to be getting regarding quality of experiments and data.

Grant Belgared: What’s your stance on foundation-style pre-training versus task-specific architectures?

Manos Metzakopian: So I think both are going to be important. You will have customers that are looking for models that can generalize very broadly. So they would be building foundation models, and those would require vast, diverse data sets. So there’s going to be breadth and depth required for such models. And task-specific models will need more precisely curated data sets coming from very specific contexts. And in both cases, that will decide the number of cell types and complex cultures that we would be using to deliver the data sets for both types. Foundation models will have quite broad utility, but the task-oriented ones would be more specific. And we will be producing data sets for both types of model training approaches.

Grant Belgard: What does a convincing benchmark look like to you for the model understands a cell response?

Manos Metzakopian: It can be covered by just one word, I would say, replication and validation. So if, sorry, two words, replication and validation. And that is that we are able to reproduce the same perturbations providing the same data. So that would be replication. And also validation, the outputs of the models that can be validated in turn. So I think these are going to be very important benchmarking tools that we have. That’s how I would think of it in simplistic terms.

Grant Belgard: How do you separate evaluation of biological plausibility from pure predictive accuracy?

Manos Metzakopian: So I would say that it’s very important to focus on biological plausibility. Because if the data itself isn’t biologically valid, accuracy metric will matter in the sense. So it has to be applicable to the scientific challenge that the client wants to tackle. So I would put a lot of focus on biological plausibility, initially, especially in scientific design, in the experimental design.

Grant Belgard: What forms of external validation replication blinded test challenges feel most meaningful?

Manos Metzakopian: It would be great if independent labs can replicate. If you think of it from a replication point of view, if different labs can generate the same data with the same approach, that provides a lot of confidence. But in our case, I think we would think of it as customers successfully using our data to build their models and generalize to new biology. So if they are able to use our data, generalize to new biology, and identify targets, solve their biological problems, and expedite the therapeutic discovery path and increase its effectiveness, then I would say that’s the most meaningful external validation.

Grant Belgard: Who stands to benefit first from this work, and how might they plug it into existing workflows?

Manos Metzakopian: I think at the moment, there is a race happening of different entities and institutions and consortiums and consortia that are working towards delivering a model that can solve a lot of the drug discovery problems. And that includes biopharma teams, and that includes biopharma teams, and biopharma teams, and consortia, and so on. But I think what is currently being understood that it’s not going to be a one-dataset-fits-all. It’s going to be models that are going to be trained to solve specific problems, and they’re going to be requiring their bespoke data sets to be trained with. And so I think it’s going to be less of a race towards the best model, but more of a joint effort to generate the right data for the right models and solve pressing issues in the world. And I think that that day is upon us, for sure.

Grant Belgard: What kinds of collaborations or partnerships would be most impactful at this stage?

Manos Metzakopian: Companies that have bottlenecks in their pipelines where our data can actually resolve that issue.

Grant Belgard: How do you weigh openness, sharing resources, or benchmarks against the need to build a durable business?

Manos Metzakopian: I would say it’s very important to make sure that we are leading in the space of high-quality data sets, AI-grade data sets. And we should think of best ways of sharing benchmarks and best practices openly. However, the large-scale perturbation data sets are contract-delivered, and so there needs to be a balance that ensures both impact and sustainability.

Grant Belgard: What drew you personally to this specific problem space?

Manos Metzakopian: I have always been involved in projects and challenges that require large data and perturbation data. Most recently, we’ve used this know-how in the cell programming field. So to democratize cell types for drug discovery research and cell therapies, and that never required large-scale data sets and so on. And during my time solving these problems in academia and in industry, I realized that the potential for AI to solve the drug discovery bottleneck and lead to a world where there are cures for every disease requires us to rethink the way that we produce data, the quality of the data, its reproducibility and its scale, and the context at which it is delivered.

Manos Metzakopian: And so as I was progressing in my academic and industry career, I’ve realized that setting up a platform like this, which is CellCodex in this case, to generate AI-grade data is timely and very, very important to do so now, where we are at the verge of arising to artificial intelligence-enabled solutions in therapeutics.

Grant Belgard: Looking back, what set of experiences most shaped how you approach leading a science-driven company?

Manos Metzakopian: The most important experience that I had during my academic career and my industry career is managing people effectively, making sure that we are all goal-driven, we are ambitious, and we are enjoying what we’re doing. And in my academic career, I’ve mentored PhD students, master’s students, and postdocs, research assistants, and technicians. It led to amazing work where we’ve published over 30 scientific manuscripts in the fields of genetic screening, cell engineering, and drug target discovery. And similarly, in industry, leading larger teams, the most important thing that leads to success is the team, the people that are involved in driving the work and the goals that we set ahead of us. So I think goal-setting and the people that are along for the journey are the most important pieces of the puzzle.

Grant Belgard: How do you structure your day to balance science, product, people, and operations?

Manos Metzakopian: It’s not always easy to balance between everything. It depends on the stage at which the activities are. If it’s joining a mature corporation where they’ve already set off and they’re on a journey, or in this case, CellCodex, where we are just launching, everyone in CellCodex wears multiple hats, and we try to support each other and help each other so that we can deliver the needs of the company. And I structure my day where I look at the needs of the people, if there’s any way I can help in their day-to-day activities, the needs of the company, and in designing the strategy, and what type of products we’re going to have, and offerings. And of course, now, when launching, we’re thinking of operations. How are we going to operate most effectively? And I would say, at the moment, it’s split 30% equally throughout everything.

Manos Metzakopian: So I would say it’s equally divided across strategy, products, and operations.

Grant Belgard: What advice would you offer to scientists considering a leap into company building?

Manos Metzakopian: You’re not going to feel ready. So at any time point, especially when it’s your first venture. So I would say, if you have the right ideas, and you have a very strong feeling and passion about these ideas, you have people equally passionate with you, and you can work together to make them materialize, then I would jump in, and I wouldn’t wait until you feel fully ready. You probably won’t get to that type of feeling. And it’s not a bad thing. And bottlenecks are not going to fix themselves. So if you see one clearly, then that’s your opportunity to jump in with your ideas to solving a problem in the world.

Grant Belgard: What practices help a small team avoid cargo cult, ML, or overfitting ideas type cycles?

Manos Metzakopian: I wouldn’t chase hype cycles. I would ask if the method helps explain or actually lead to a solution. So I would really think and investigate very, very, very well, very deeply, if a new direction, a new tool, a new approach is really going to make a big difference. And ask yourself if it’s worth the investment. So I wouldn’t chase. I would investigate and research what new things come out.

Grant Belgard: What advances outside your control would most accelerate your roadmap?

Manos Metzakopian: So that’s a great question. So outside, so currently, as I’ve said before, throughout this conversation, this podcast, there are a lot of companies out there that are generating their own artificial intelligence models, and they are using them for predictions that can progress drug discovery. Now, there are a lot of companies that are doing that at the moment already, and there is a big need for data. However, as soon as these models start showing the power that they have in increasing drug target discovery and driving efficiencies in therapies, there’s going to be even a larger need, and there’s going to be a larger number of models that are going to be generated to be trained, and there’s going to be a lot more data that’s going to be needed to train these models.

Manos Metzakopian: So I would say that since there’s going to be such a huge need for data advances that can increase the number of cells that we can analyze in a multi-omics context and technology development that can allow us to analyze multiple modalities from similar samples, all of these will allow for better data, larger-scale data that can provide the fuel that these new models will need in the future.

Grant Belgard: Well, Manos, thank you for sharing the CellCodex vision and the thinking behind it. It was nice having you on today.

Manos Metzakopian: Thank you very much for the invitation. It was a great, great conversation. Thank you. For listeners who want to follow along, the best place is cellcodex.bio and also our LinkedIn page. If you enjoyed this, please subscribe and share with a colleague who cares about building predictive biology. Thanks.