The Bioinformatics CRO Podcast

Episode 55 with Mo Jain

Mo Jain, Founder and CEO of Sapient, discusses the importance of small molecule biomarkers and his approach to biomarker discovery research.

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Google Podcasts, Amazon, and Pandora.

Mo Jain

Mo Jain is the Founder and CEO of Sapient, a biomarker discovery CRO using next-generation mass spectrometry technology.

Transcript of Episode 55: Mo Jain

Disclaimer: Transcripts may contain some errors.

Grant Belgard: [00:00:00] Welcome to The Bioinformatics CRO Podcast. I’m Grant Belgard and joining me today is Mo Jain of Sapient. Welcome.

Mo Jain: [00:00:07] Thank you so much, Grant. Pleasure to be here today.

Grant Belgard: [00:00:10] So tell us about Sapient.

Mo Jain: [00:00:12] Absolutely, Grant. So Sapient is a discovery CRO organization which is really focused on biomarker discovery. And the way we operate is through leveraging novel technologies, particularly in the mass spectrometry sector in order to enhance human discovery. And we primarily serve as a partner for large pharma, early biotech, and even some foundations to help them in their biomarker discovery efforts as part of their drug discovery work.

Grant Belgard: [00:00:39] Can you tell us about the history of the company?

Mo Jain: [00:00:41] Yeah, absolutely. The concept of Sapient admittedly dates back almost two decades now. I’m trained as a physician, and one of the common questions that one receives when treating patients is, why did I get this disease? How do I know next time if this is going to happen to me? How can I protect my loved ones and family members? What are the diagnostic tests that I can use to know if I’m going to respond to this drug? And one of the really humbling aspects of medicine is, despite the massive amount of knowledge that’s been gained over the last several hundred years, really what we still understand and know represents a very, very small fraction of all the knowledge there is to know. And for most of these very insightful questions, the answer typically is I really don’t know the answer. And at this time, when I was in the midst of training, the human genome was really coming to fruition in the early 2000, when the initial draft of the human genome was reported, and genomics was going to revolutionize the world as we know it. And the basic idea behind this was by understanding the basic blueprints of human life. We could leverage that information to understand how healthy or not healthy you may be over the course of your existence, what diseases you were going to develop or predispose to, what drugs you were going to respond to, and essentially we would be able to transform the way we think about diagnosing and treating human disease.

Mo Jain: [00:02:06] The challenge has to do with the fact that however the amount of information and the type of information that’s encoded in the genome doesn’t actually enable that to happen in most cases. And at the time when I was in the midst of training and I apologize for the long answer, you’ll see where I’m going in a moment. But at the time of this, we were doing a thought exercise, and that is well if we could parallelize sequencing to the nth degree, and if we essentially could line up every single human on the planet, and we knew everything about everyone and we sequenced everyone’s genome, how much of human disease could we explain? And the hope would be 80, 90, 95, 98%. In actuality, when you look at the numbers and there’s many ways you can calculate this as a heritability index, population attributable risk index, etcetera. But the true answer is probably somewhere in the 10 to 15% range. And that’s a theoretical upper limit. If you really look in actuality, the numbers probably in the single digits for how much of human disease we can truly explain through sequence. And perhaps that’s not surprising that we know the way in which you live your life. Your genome is set from the moment to conception and the way you live your life. Everything you eat, drink, smell, smoke, where you live, how you live, we know is massively important in how healthy or not healthy you’re going to be over the course of your existence.

[00:03:25] And none of that information is captured in your underlying genome. And so I became very interested in that 85% of population attributable risk that’s not encoded in genetic sequence, understanding once again, diet, lifestyle, environmental factors, how one organ system may communicate with another organ system, how the microbiome that’s part of our gut and skin and saliva influences human disease. Again, none of that information is encoded in genetic sequence. But it turns out that that’s encoded in small molecule chemistry. So when you ate something for breakfast or lunch or dinner depending upon where in the world you are and what time it is, that gets broken down at your gut into small molecules. Those small molecules enter it into your bloodstream. And because we all eat only healthy things, they do good things in our organ systems and allow us to be healthy over time. And the basic premise is that well if we could capture that information, if we could take human blood and probe the thousands of markers that are floating around in human blood, we can begin to understand how humans interact with their environment both internally as well as externally, and leverage that information now in the way the genome was supposed to in understanding and predicting who’s going to develop what diseases over time, how long someone may live, whether or not I’m going to respond to a particular drug A versus drug B et cetera.

[00:03:25] So that was the basic premise. Now this is not a new idea. Every year you go to the doctor. They draw two tubes of blood, typically about 20ml of blood. And in that, we measure somewhere in the order of 12 things to 20 things depending upon the test you get. Half of those are small molecule biomarkers, creatinine and cholesterol levels, glucose etcetera. The challenge is that there’s tens of thousands of things floating around in your blood, and we’re literally capturing less than 0.1% of them. And so how do we develop technologies that allow us to very rapidly measure these thousands of things in blood, and to do this at scale across tens of thousands of people in a manner that allows us to discover, well, what are the next 12 most important tests? What are the next 12 after that? And how do we leverage this information at scale to to really predict and understand the human condition at its earliest disease points? And so that was the basic premise of Sapient. It was born out of academia, where we spent the last decade prototyping and developing these bioanalytical technologies. And as these were coming to fruition, we spun them out to form Sapient and that’s how we came to be today in.

Grant Belgard: [00:05:47] The work you’ve done at Sapient, have you seen a large number of complex, non-linear, non-additive interactions among factors, or are you finding the major signals are things that can be reduced to more simple and straightforward guidelines looking at LDL and HDL, new markers along those lines?

Mo Jain: [00:06:11] Yeah. The simple answer is both, which I recognize is not all that helpful. But it comes down to what type of predictive analytic you require, what’s the threshold that you require for actual diagnosis. Now often times for virtually all cases, you can reduce down information to a single marker or at least what I would say is a practical number of markers, somewhere below half a dozen that we can measure under the most stringent of laboratory methods clinically in hospitals around the world, and provides us the information we need to know. That works well. You can imagine in the same way, cholesterol is highly predictive of those individuals who are at risk for heart disease. Developing simple tests like that for cancer, Alzheimer’s disease, liver disease, lung disease, GI illnesses, pregnancy related complications, etcetera, etcetera is oftentimes quite functional. And that’s where we spend most of our time at Sapient. At the same time though, as you’re suggesting Grant, much of human disease is non-linear in its etiology. It’s rarely a single case or an additive case of two events that cause disease, but rather it’s a much more complex interaction of many, many different [inciting] etiologies. It may be a genetic predisposition, which increases risk somewhere in the order of several percentile. Added on with an environmental exposure, together with a particular initial acute insult that collectively results in a disease process cascading and starting. And so this is where we’ve become much more interested in taking these very complex data sources, where we’re measuring tens of thousands of things in human blood, and using much more advanced AI based statistic modeling now to be able to much more holistically predict and understand these complex interactions.

Grant Belgard: [00:07:48] How much added power do you see from that?

Mo Jain: [00:07:50] Yeah, quite a bit, which is both an incredible opportunity and is incredibly challenging. As you can imagine, in the same way if you look at a picture, a painting, oftentimes with a very small reductionist view of that painting where you’re looking at only several pixels, you can oftentimes tell something about the painting. This is a blue painting. It’s of the ocean or something to that extent, but with a very, very small snapshot. But as you took a much more global view of the underlying image, that’s where the real granularity begins to emerge. And as we go from simply saying, well, your risk of disease X is Y percentile or it’s increased in this manner to a much, much more holistic view of across these 100 different diseases, this is your sort of combinatorial risk. This is how you want to optimize life and diet. This is how you want to optimize medications specifically for you. This is how we want to develop new drugs. This is where allowing for that complexity is absolutely critical.

Grant Belgard: [00:08:52] Would you say this approach is more powerful for risk of onset of a disease that hasn’t yet occurred or for prognosis?

Mo Jain: [00:09:04] Yeah, it very much depends upon the disease and the biological question. And you can break these down into diagnosis, meaning early diagnosis prior to disease onset or at the earliest stages. Prognosis meaning once disease has become clinically apparent, understanding long term outcomes and then prediction regarding response to therapeutics, which is really the third component of this. And as you can imagine, the added value of more complex modeling versus reductionist testing of single molecules partially depends upon which of those three question baskets you’re in. And then also the specific disease and the complexity and heterogeneity of the underlying disease state. Now the good and bad is we do a relatively poor job of this today. And when you think about a complex disease, whether it be heart disease or diabetes, this probably represents half a dozen or more diseases, all of which have a common end phenotypic variable of metabolic insufficiency or hardening of your coronary arteries that were lumping all together, even though they have very different mechanisms of action that allowed someone to go from a normal stage to an abnormal stage. So even for these very heterogeneous, complex multi-organ systemic diseases, even being able to break it down into those broad categories, what are the four types of subgroups? What are the five types of subgroups can be extremely valuable? But now being able to take that even further and using more complex modeling, these AI based nonlinear approaches where you can be able to say well I’m not interested in the five subgroups, I’m interested in the 100 subgroups and understanding which one of these specific subgroups is going to be optimal for a particular therapeutic. This is where adding complexity and nuance becomes critical.

Grant Belgard: [00:10:43] What is the path to clinic look like for what you do?

Mo Jain: [00:10:45] So this is where it becomes really, really important. And this is absolutely an evolving area that’s changing literally week over week. It used to very much be five years ago that clinical translation had to be dependent upon a single test, a single molecule that was well measured that we could enter into what we call a CALEA Accredited Laboratory. And that was a one test, one diagnosis. That modality and that way of operating has completely changed over the last half decade. And we’ve seen this now. There’s something on the over 100 different tests that are at the FDA that use much more complex ML based algorithms or AI based algorithms for diagnostic purposes. We’ve already seen this in early pathology and histopathology and in radiology, and I wholly expect that the inflection is only starting now. So I suspect that over the near future here, I’m literally talking about the next several years, much more complex, nuanced, blood based testing is going to become the norm. We already see this in a number of conditions, whether it be diagnostic tests, whether it be Cologuard, for instance for colon cancer, whether it be a genetic testing for particular chemotherapeutics in the setting of cancer. We’re already seeing this evolution happening in real time, and I suspect that’s going to not spill over, but extend to virtually every single human disease.

Grant Belgard: [00:12:07] How much of the work you do is brought to you by clients or sponsored by partners versus internal R&D to develop these tests?

Mo Jain: [00:12:17] Yeah, it’s a great question Grant. We’re somewhat multi-personality if you will, let’s say, and that we’re a front facing CRO. So a good portion of what we do over 80% of our time and attention is really based upon servicing large pharma and early biotech and their drug development efforts and simply put there. We’re engaging these sponsors. They’re bringing biological samples to us. We’re analyzing them on our proprietary mass spec technologies, generating that data, doing the statistical analysis, making the discovery, and returning that discovery to them for commercialization as part of their drug development efforts. The other 20%, as you suggested is really based upon our internal R&D efforts. And so at the same time, because we have such ultra high throughput mass spectrometry systems that are capable of generating data faster than any other technology worldwide, we’ve also at the same time been able to go around the world and collect hundreds of thousands of biological specimens internally as part of our R&D efforts, analyze those samples, generate now what is the world’s largest human chemical database. And amalgamate that information in a centralized repository internally here at sapient that we now are subsequently mining for novel diagnostic purposes.

Grant Belgard: [00:13:28] What are the biggest challenges you face doing that?

Mo Jain: [00:13:30] Yeah, it’s a really good question. Up until several years ago, this would have been a simple technology issue. How do we actually generate the data? And I’m very glad to say that the efforts of Sapient have enabled us to generate now and handle data very, very quickly, meaning handling 100,000 to several million biological specimens for mass spectrometry analysis now is no longer a dream effort, but is very practical. It’s a daily ongoing here. So you can imagine that bucket now has been or that can has been kicked down a little bit of the road where it’s no longer a data generation issue. It comes down to a data understanding interpretation issue, whereby how do we take this complex data now and really commercialize it in a way that for the betterment of society. How do we develop the diagnostic tests that are going to be most meaningful. In many ways Grant, it becomes the kid in the candy store problem. If you have massive amounts of data, you can theoretically answer hundreds of questions simultaneously. And so what are the most high yield, high impact questions for different populations that we want to answer first and bring to clinical testing as quickly as possible? And that’s very much a personal sort of question.

[00:14:42] Obviously there’s a business use case behind it. But you can imagine, if you ask a particular foundation that operates in the rare disease space, they may have a particular preference. If you look across prevalence of disease across large populations of adults in the developed world, it’s a very different answer, may be heart disease and cancer, may be basic aging. And if you ask foundations that are operating in low to medium income countries, whereby there’s arguably the greatest need for human health and development, it’s obviously a very different set of questions around early childhood development, pregnancy nutrition and optimisation of in-field testing. So that’s one of the largest challenges that we face on a day to day basis. Now, certainly that’s a good problem to have. It’s very much a “first world problem” as to where you want to go first, and how do you want to operationalize and commercialize a very large data? But it’s a very real problem that I think many organizations that operate in this space are facing every day.

Grant Belgard: [00:15:40] Is there a system you use to make those decisions?

Mo Jain: [00:15:44] Yes, there is. And like the best of systems, you can imagine Grant that it oftentimes goes out of the window within the first three sentences of a discussion. So there’s certainly a lot of business use cases that we think through understanding what’s the addressable markets, what is reimbursed look like, these things that point us in particular directions. But at the same time, we’re fortunate enough just given how we operate, to have a little bit of leeway and the other questions that may be of equal if not greater importance, but may be of slightly less commercial value. And thinking through some programs that we have in understanding maternal nutrition in the developing world, programs that can have massive impact in large numbers of people that can move needles but may not have the same commercial relevance as coming up with a diagnostic test that tells us if we’re going to develop cancer in the next several years. Equally important, just slightly different commercial market.

Grant Belgard: [00:16:37] How do you think about causality or do you think about causality? Are you just really focused on strong associations? What’s the most predictive regardless of causal relationships?

Mo Jain: [00:16:46] It’s a fantastic question, and I’m happy to provide an answer. But ask me tomorrow and I’m sure I’ll give you a different answer. And as you can imagine, this is one of those things that fluctuates quite a bit. In the end, it depends upon what you’re using that information for. So let me give you an example, HDL is an extremely strong predictor, stronger than any other predictor for heart disease over time. But it’s still very questionable whether HDL itself is causal for coronary disease. We call it the good cholesterol, but likely what the evidence really points to is that HDL is reading out some other phenomenon that’s actually the causal agent. Now if I want to understand what my heart disease risk is over time, I just need a valid correlation that we know is specific and is statistically rigorous over time. And so HDL serves that purpose for me. Now, if I want to develop a drug and I want to use as a marker of drug efficacy HDL, well then having a causal association becomes much, much more important. And we spend quite a bit of time thinking through this process, because our goal is not only to come up with diagnostic markers, but to develop new drug targets and to validate those targets to develop new nutraceutical and natural product based therapeutics et cetera, et cetera.

[00:18:01] There’s a lot we can do with this type of data. And part of this has to do with understanding that causal question. And this is where we do quite a bit of multidimensional data integration, particularly with genomics information together with these small molecule biomarkers. In essence, doing a Mendelian randomization type of approach from which we may be able to infer causal relationships. And as you’re well aware having worked for many years in this space, MRI based analyses particularly MR bidirectional is very useful when it works, but in the absence of information, doesn’t necessarily negate causality. And so this is the way we certainly think about it. Again, it all has to do with how you want to use that data and what’s the objective function. In the end for a diagnostic test, it just a matter if it accurately diagnoses and predicts people who are at risk of a disease state.

Grant Belgard: [00:18:51] How important is longitudinal data for what you do?

Mo Jain: [00:18:54] Very, very important. And I think one of the lessons that we learned from the genomic revolution, well there’s several things that we learned. One, the genome as I suggested earlier on, really imparts a minority risk of human disease, who our parents are, what occurs at that moment of conception when a human is formed, that really provides only a very small amount of predictive capacity for what’s going to happen to us over the next 100 years of our existence. That’s the first lesson. The other lesson that we’ve learned is that human disease is a very dynamic process. Health is ever fluctuating. On different time scales certainly on a decade long time scale, but even on a day to day basis, in an hour to hour basis, when you really dive into the nuance, if I slept last night versus if I slept one hour last night, I probably have a different health state today. Now, the impact of that may not be relevant over many, many years, but certainly you could argue that I’m healthier because I slept or didn’t sleep, or if I ate correctly versus didn’t eat correctly. And anyone who’s ever gone out and had an interesting night and woke up with a hangover can agree with that. And so being able to understand that dynamic nature is critical. Now your genome for the most part, your somatic genome is fixed from the time of conception and doesn’t change over life. And this is where diving into dynamic market, particularly these small molecule dynamic markers that read out communication channels between our internal organ systems, between the external world and the world, and our internal sort of physiology between things like diet, lifestyle, microbiome, toxicants, etcetera, etcetera. This is where small molecule biomarkers are particularly important. And because of their dynamic nature, they have the ability to change quite quickly, which can read out almost in a real time fashion particular health and disease states.

Grant Belgard: [00:20:41] What challenges have you run into collecting longitudinal data and integrating that with clinical data? And I guess I’m going to ask a multi-pronged question here. Have you done this in health systems outside the US? And do you have and experience comparing and contrasting the data you get from different systems?

Mo Jain: [00:21:04] Yeah, as you can imagine, there’s a couple of different parts to the question, all of which are really important. I’m going to answer the final part first. I firmly believe that humans are all equal, but not identical. And one of the core components is where geographically in the world in which we live. And you know the famous quote that best summarizes this is that it’s not your genetic code, but it’s your zip code that’s a better predictor of disease. And statistically, that’s absolutely true. Based upon your geographic zip code, we typically have a better handle on your underlying long term risk of disease than anything else. And so certainly geography plays a huge role in this. It’s one of the core aspects of our interaction with the world. And you can imagine geography feeds into everything from the degree of sunlight, the type of water, the type of diet, toxicants that are local to that environment, socioeconomic state and access to health care. There are so many aspects that are fed into underlying geography. And so this is where our ability to broadly biological samples as well as individuals from around the world has been critical. And so as you can imagine, there’s value in identification of universal diagnostic tests that work independently of where you are. If you’re in Sub-Saharan Africa, if you’re in sunny San Diego, if you’re in Western Europe, it doesn’t matter. The test reads out what it should read out.

[00:22:19] And there’s also secondarily value in having localized or population specific tests, not something that traditionally in medicine we’ve done. But if it’s the case that there’s particular exposures that are unique to a given environment, those are a key determinant of a given disease in a particular location. To not sample that information and use it is silly to me. And so we oftentimes are looking for both of these. What you can find is there’s certainly universal realities and universal markers that denote health and disease states over time or drug response. But there’s also geographically localized markers that may be unique in specific populations, owing once again to diet or whatnot that may be unique to that environment. And I think both of them have the value in understanding the human condition. There’s certainly some practical issues and considerations. We’ve been very fortunate to have a number of relationships with top academic medical centers around the world. And the simple answer is that there’s more biological specimens and there’s more data available in the world than people are using. So it’s there if you’re willing to work within the constraints of the legal constraints of accessing it and whatnot. The real challenge is one that I suspect you’re alluding to, that everyone who works in the large data analytics space has learned one way or the other over the last decade, and that is garbage in, garbage out. I don’t care how good your metrics are. If the data is fundamentally not clean, and if you’re not conditioning on high quality data then you’re just leading yourself astray.

[00:23:53] Now, that doesn’t mean data has to be pristine. I’m actually quite a bit of a fan of using real world data because you want there to be noise. When you see signal emerging from that noise, you have much more confidence that that signal is real, as opposed to pristine data that may be present only in a phase three clinical trial. And then when you extend those same markers to the real world, you see that they have less of an effect that they should, simply because there’s now other confounding factors. But in the end, as much as I’m a fan of our technologies and the type of data we generate, having clean phenotype data is absolutely essential. And so we spend quite a bit of time internally here at Sapient, thinking through ways in which we can clean human data. We can QC that data, we can sanity check it. And ensure it’s of the highest quality otherwise you’re just leading yourself astray. And certainly there’s very, very large data assets out there and data sets out there that are of less than stellar quality. And oftentimes those don’t result in any real meaningful discovery.

Grant Belgard: [00:24:52] Are there any go to external datasets that you’ll look to for validation of what you’re seeing at Sapient?

Mo Jain: [00:25:01] Yeah, it’s a really good question Grant. And this is one I’ve been in the challenges from an R&D perspective for us personally in that when you think about the molecules that are floating around in your blood right now. There’s tens of thousands of these molecules floating around in you Grant. Somewhere in the order of less than 5% of these have ever been measured, analyzed, structurally elucidated, or understood in any meaningful way, which means more than 95% of what’s in your blood right now is a black box. And so this is where I have challenged sometimes using external sources for validation, because they’re very much couched within that 5%. This is the same as the light pole, if you will effect where everyone is looking under that same light shade or lamppost at the same several dozen to several hundred molecules, whether it be genetic factors or small molecule biomarkers or protein biomarkers when the real signal lies outside of that initial light. And I’m a big fan of jumping into the dark, even if it’s sometimes a little bit challenging. And so what this ultimately ends up meaning is that we end up doing quite a bit of validation ourselves, simply because the current publicly available data assets, or even proprietary private data assets are really not of a nature that allows us to adequately validate or not simply powered for true discovery in this space.

Grant Belgard: [00:26:22] What is the future hold for Sapient?

Mo Jain: [00:26:24] Yeah, it’s a great question. The simple answer is I frankly don’t know. There’s obviously many things that we’re hoping to do. I very much believe in our service orientation and really accelerating the drug development process and pipeline together with our sponsors, whether they be large pharma organizations, small medium biotech foundations or governmental organizations. There’s tremendous value in that work that we see. And if we can help bring drugs to fruition, well, we’ve had a good day. At the same time as I mentioned earlier Grant, we’ve already generated the world’s largest human chemical databases, and they’re growing exponentially month over month. That provides some very, very unique opportunities that we have an obligation to bring to clinical translation, whether it be around new diagnostic tests, whether it be around better development and designing of clinical trials, whether it be an understanding and bringing means forward whereby we can predict who’s going to respond to particular therapeutics, whether it be developing natural pharmaceuticals, natural product pharmaceuticals themselves de novo. There’s a tremendous amount that we can do with these data assets. And that’s where I think Sapient is certainly going to continue to grow into the future. Now I hope as someone who’s aging hourly, today is actually my birthday Grant. So I’m aging more than [overlap]

Grant Belgard: [00:27:43] Happy birthday.

Mo Jain: [00:27:44] Well thank you, sir. Thank you. I just was alerted to that this morning. I forgot so I’m actually aging faster than I care to admit. But you know, I hope within the next several years to decade, diagnostic testing is completely different than where it is today. We’re no longer measuring two dozen molecules in human blood, but we’re measuring 20,000 molecules in human blood. And from that, being able to provide much, much more nuanced diagnostic information, prognostic information and therapeutic information regarding what’s the ideal way that I need to live my life, what’s the ideal diet, lifestyle and drug regimen that maximizes my personal health over time. And I’m hoping we’re moving to that position very quickly.

Grant Belgard: [00:28:27] So changing topics a bit, can you tell us about your own history, what in your background ultimately led to what you’re doing now? What prepared you for this? What maybe didn’t prepare you for this?

Mo Jain: [00:28:40] I wish I could say it was all very well planned out and deliberate, but you and I know that’s absolutely not the case. And so I trained as an MD-PhD. I was dual trained in medicine and science. My PhD degrees in molecular physiology. And I absolutely loved, loved, loved clinical medicine. It was a privilege to take care of patients. I very much enjoyed my patients. Those personal interactions and being able to help people in their own personal journey was something I was very passionate about as a cardiologist. I was also very frustrated by it, simply because much of clinical medicine is really about regressing to a common denominator or a common mean, whereby everyone with a given disease be treated with a given drug, even though we know that’s just not the way human medicine works, but it’s the best we can do. I was very frustrated by not being able to answer simple questions. When someone asks, well, why did I develop a heart attack at age 40? And what’s going to happen to my kids? And how do I test them for this? And there’s a lot of hemming and hawing that happens from the physician, simply because the real answer is, I have no idea. And there’s simply no testing we have for you. That, to me is very unacceptable.

[00:29:45] As I mentioned, I was training at the dawn of the genomic revolution and I was very much excited by this idea that parallelized sequencing and genomics were going to transform the universe in a meaningful way that this was all going to change. And at the same time, I was frustrated when saw that that wasn’t going to actually happen. When we really got down to brass tacks and did the calculations, it didn’t make sense how this was going to work. And so I trained initially in clinical medicine. I spent quite a bit of time in Boston at the Broad Institute, at MIT, at Harvard Medical School, and at Brigham and Women’s. In my scientific pursuits, that’s where I started working in mass spectrometry and large data handling. I spent the better part of the last decade as a professor in the University of California system here in San Diego as a professor, where I was privileged to work with really, really bright students and postdocs and faculty members to develop some of these technologies. And now I’ve lost track. I don’t know if it’s the third or fourth career, but this next phase or next adventure whereby we spun out the technology and now I have the privilege of leading this organization and thinking through how we begin to commercialize these technologies and these types of data. So it’s a very wandering path. This is oftentimes the case. I’m excited by big questions. I’m excited by solutions that bring about real change. And I’m still charting that path, if you will.

Grant Belgard: [00:31:08] And what have been the biggest surprises to you personally on your founder’s journey?

Mo Jain: [00:31:14] Oh, boy. How much time do you have Grant? I think there’s some universal surprises that I think anyone who goes through this process learns. You learn how hard it is. You learn that no matter how great your technology is, no matter how unique your data is, this is a people business in the end. And having the right people around the table is really the key to everything. Virtually all questions can be answered if you have the right people. You learn that it’s absolutely an emotional roller coaster. This is something that a number of founders had warned me about, but never really made sense to me that this is something that you’re going to have days where you’re flying high, and then literally an hour later, you’re on the ground in the fetal position, and that high frequency fluctuation is maddening. This is a hard business, if you will. There’s a lot less risky pursuits in life than being a founder and being an entrepreneur. But in the end, there’s frankly nothing more rewarding in my mind. And oftentimes these two things go together. I’m not necessarily a risk taker by nature, but I feel this is what I was meant to do so it makes sense to me in some crazy way that I can’t quite explain.

Grant Belgard: [00:32:27] If you could go back to early 2021 around the founding of Sapient and give yourself some advice, maybe three bits of advice, what would they be?

Mo Jain: [00:32:37] Yeah. Wow. That’s tough. I hope it’s not don’t do this, but I was incredibly naive when we founded Sapient. And I think that’s a good thing. I think sometimes knowing too much prevents people from taking a leap, and leaps require faith, and they require oftentimes blinders. You can’t see the pool if you’re going to jump into it.

Grant Belgard: [00:32:57] You need a bit of irrational optimism, right?

Mo Jain: [00:32:59] That’s exactly right. Anyone who knows me knows I suffer from massive doses of optimism and so I’m not sure if I knew everything I know today. Certainly we would do things differently and whatnot, but I like the fact that I was naive. I think that was really an important aspect of our development, certainly of my own personal growth, but also for the company. Oftentimes coming into an enterprise with bias means you’re just going to do the same thing that the person before you did by the very nature of bias. And not having that experience forces you to question from a very first principle basis, every problem and come up with oftentimes solutions that may not be traditional in many ways more efficient. I think I would warn myself, if you will, just to answer the question about how difficult this is emotionally and psychologically, not something I appreciated. My job was to take care of people who are dying in the ICU. So I said, well, how hard can this be? And I was incredibly naive and ignorant to just how hard it is. But again, that’s not a bad thing. That’s a good thing. And lots of people had given me the advice of make sure you surround yourself with other founders, other CEOs, people going through this. You’ll need more “emotional support” than you’ve ever needed at any point in your life. And I’ve always liked to believe I was highly resilient and had strong emotional backbone, but that absolutely turned out to be true and in many ways has been the difference maker. And so I’ve certainly sort that out over the last year and have some incredible friends who are in this space who are in very, very different fields, who help me every day. I wish I had done that a little bit earlier. I think that would have saved a little bit of sanity and probably some gray hairs.

Grant Belgard: [00:34:43] I noticed just before recording that you’re a YPO member.

Mo Jain: [00:34:47] Yeah, that’s exactly right. I’m a true convert. And when I first heard about YPO, I said this sounds nuts. I don’t need another networking event. And it was honestly a very dear friend of mine who we were having dinner with in the early days of Sapient, who just was a biotech entrepreneur himself and very successful, who looked over at me and said, hey, I’ve got something you need. And I said, oh man. He explained it to me and it sounds like a mix of a cult and a networking event, neither of which have the time or energy for. And my wife certainly at the time was like, wow, do you really want to get involved in something else? But I would honestly say, and again I know I sound like a true believer, but it’s been one of the most important things I’ve done for myself personally over the 40 years of my existence. And so it’s been incredibly helpful to learn from people who are very talented and successful at different phases of their life in a very open and honest way. And so it’s had massive impact on me, not only professionally, but personally.

Grant Belgard: [00:35:43] Yeah, I just went to a Vistage event this morning, so I know exactly what you’re talking about. Fantastic. Maybe third piece of advice. I think we’re on three.

Mo Jain: [00:35:53] Yeah, a third piece of advice would be to ground yourself. And what I mean by that, it’s really, really important. As life becomes more chaotic and crazy and I never thought it could get any crazier than it was previously, but somehow we’ve been able to pack more in. It’s really important to understand what your North Star is. It’s important to have those people in your life that you’re present for, whether it be family and children, whether it be spouses and friends, parents, whatever the case may be. It’s really, really important selfishly to have those grounding mechanisms. And again, I always understood how important it was, but not something I appreciated how important it is professionally. I’m a better professional individual and I think I’m a better founder and a better CEO because I take the time now for those individuals in my life and I ground myself, and that’s really important for me personally.

Grant Belgard: [00:36:46] I think it’s fantastic advice. Thank you so much for your time. I really enjoyed speaking with you. I think it’s been a fun discussion.

Mo Jain: [00:36:54] Thank you so much, Grant. I really appreciate the invitation and I equally had a lot of fun today. So looking forward to this again at some point in the future.

Grant Belgard: [00:37:01] Thanks.

The Bioinformatics CRO Podcast

Episode 54 with Evan Floden

Evan Floden, CEO and Co-founder of Seqera Labs, discusses Nextflow, the push for reproducibility in scientific workflows, and his experience as a scientist with a start-up. 

On The Bioinformatics CRO Podcast, we sit down with scientists to discuss interesting topics across biomedical research and to explore what made them who they are today.

You can listen on Spotify, Apple Podcasts, Google Podcasts, Amazon, and Pandora.

Evan Floden

Evan Floden is the CEO and co-founder of Seqera Labs, the developer of Nextflow.

Transcript of Episode 54: Evan Floden

Disclaimer: Transcripts may contain some errors.

Grant Belgard: [00:00:00] Welcome to The Bioinformatics CRO Podcast. I’m Grant Belgard and joining me today is Evan Floden. Evan, would you like to introduce yourself?

Evan Floden: [00:00:07] Yeah, awesome. Thanks a lot for having me Grant. I’m Evan Floden. I’m the CEO, co-founder of Seqera Labs, previously been building the Nextflow project for the last ten years or so. So I’ve been very interested in following the developments in bioinformatics over that time. It’s great to be on the show.

Grant Belgard: [00:00:23] Thanks for joining us. And I’m sure most of our listeners have heard of Nextflow, but maybe not everyone’s heard of Seqera. Can you tell us about the company and its origins and pulling the strings behind Nextflow?

Evan Floden: [00:00:36] Yeah, absolutely. It’s an exploit was started by myself and co-founder Paolo Di Tommaso. And really the idea around Seqera was really a continuation of the project, but really bringing it to fruition in terms of a commercial sense. So whilst we focused originally on a lot of the work that Nextflow was doing on pipelines, now we’ve expanded out a fair bit from that. So Nextflow we began ten years ago. Seqera has been around for about five years now. We’re really focusing on taking some of the principles that Nextflow has. The idea of empowering scientists with modern software engineering came about from the use of things like containers, the adoption of cloud, really enabling scientists to use those tools and to focus on that. And Seqera is just a continuation of that, but now broader sense. So really making the whole bioinformatics pipelines accessible, but going beyond the pipelines as well.

Grant Belgard: [00:01:25] And what’s your business model?

Evan Floden: [00:01:27] Very much focused on bottom up adoption from the open source. So in terms of Nextflow usage, we’re looking at around 100,000 people in total. So use Nextflow and that gives us obviously a really cool base. In terms of business model, it’s mostly focused on selling to enterprises, to organizations, to folks who are scaling up from single bioinformaticians to running things in production and really providing them the infrastructure, the tools that they need to build the pipelines out. And increasingly so the aspects as well.

Grant Belgard: [00:01:58] Have you seen adoption beyond bioinformatics?

Evan Floden: [00:02:00] Interestingly, in Nextflow, yes. Nextflow doesn’t have anything too specific with regards to bioinformatics in the way that it’s written. However, obviously its application is very much being focused and being used in bioinformatics. So we’ve started to see use cases and things. For example, image analysis, you start to see it. For example, satellite image analysis, also radio astronomy. Anywhere there is scientific workloads that have particularly batch component to them. I think an element of that, the user base has developed a lot of content in Nextflow through things like nf-core, and that obviously lends itself to people picking up Nextflow itself and using it for life sciences. But it’s not to say it’s not being used in other areas and obviously we’re happy to support that and see where the community takes that.

Grant Belgard: [00:02:45] How did Nextflow in Seqera evolve? Can you take us back to the beginning and what your thoughts were then and how that’s played out over time?

Evan Floden: [00:02:53] Absolutely. So Paolo and myself were working in a lab in CRG in Barcelona, and our lab was looking at multiple sequence alignment. Folks in Bioinformatics may be familiar with some software called T-coffee. It’s a very commonly used multiple sequence alignment tool that was developed by our former supervisor, Cedric Notredame. And as part of that, the job in the lab of Paolo was to enable us to run those analysis. And it was, we were particularly interested in high throughput so tens of thousands of sequences and looking at how small variations in those sequences can have an effect on the multiple sequence alignment and the resulting outputs. That was the topic of my PhD and that was what was intended to go and study. Obviously as I got there, I started to spend more and more time on Nextflow and that evolved from there. It was a very small project to begin with. We just published it onto GitHub. It started with I remember, after a year I think we had a list of the ten people who were using it or 30 people who were using it, and it was a very a small start. Over time we were able to just continually evolve and adapt it.

[00:03:57] It’s one of the great things about open source is you’re able to get that feedback and people are able to contribute ideas back, issues back, and it allows us to really evolve from there. It’s been a fantastic journey over that time. We got to probably be about five years into the project and realized that there was first a commercial opportunity, but secondly, it’s something that we both love doing. I was getting towards the end of my PhD and I just really wanted to keep working on the technology. I saw a huge potential. Paolo and myself traveling around Europe and doing training courses and just really saw the opportunity to take that to the next level. And that’s the spark for creating Seqera and seeing all the opportunity that there was from that, I should say. So since starting Seqera, Nextflow has increased its usage at least tenfold on that. So I guess there was a slight risk at that time in doing that, but we were pretty convinced on the project and it’s really been the foundation for everything we’ve built so far.

Grant Belgard: [00:04:50] Yeah, it’s gotten very widespread adoption in biotech for sure. I think it’s one of those situations where people will want to use a tool that is nice and robust that a lot of the potential hires they would be looking at have experience with. And I think Nextflow has gotten to that critical mass where it’s not this really niche thing. It’s certainly for people who have been in biotech for a few years, a lot of bioinformaticians have experience with it.

Evan Floden: [00:05:24] Yeah, I think that’s an interesting point on how does something like Nextflow essentially become a de facto standard. It’s an interesting one in that if you look at there’s been many groups or many times that folks have tried to create standards, whether this is in academia or in industry bodies and the like. And if we look into parallels of the areas, things like the Docker container is almost is the standard for containerization. But that was started by a few folks who had an idea and created a company. And now really revolutionized the world of modern software. I think that Nextflow has similar ideas and that it was we were trying to do something a little bit against the grain, not necessarily sanctioned by anybody. And that almost spurred us on in some sense. But then once you get that critical mass has taken off, I think that there’s touching on the aspect of I agree, it’s fantastic that folks can come in, they’ve already got the skills and Nextflow and then there’s that other whole piece to it, which is what I call the content, but it’s really the pipelines and all of that material which enables folks to take take those off the shelf. There’s now things like nf-core, there’s modules. We’re getting up to over a thousand modules there which you can really mix and match the components of your pipeline and obviously use the framework and the tooling to build that there. And that’s really can save organizations so much time just even getting started with that analysis. For example, add their own module in which is specific maybe for their chemistry on some sequencing, but they can use the rest of the pipeline. Those kind of examples were prevalent and it’s something which I think is possible from having this open science approach to things.

Grant Belgard: [00:07:03] And what’s your vision for the company?

Evan Floden: [00:07:05] [] at the start we’ve really been focused on the workflow execution piece and I think this is going to continue to be our bread and butter. We still see the challenges exist with regards to scaling generally across bioinformatics, but also across life sciences as well. The volume of data is not decreasing. It’s if anything, it’s increasing the use cases for sequencing as well. And imaging analysis is increasing. The multi-modality of the work which is coming in is requiring almost different approaches. So we focused a lot on that high throughput piece. There’s an element where we have been building up a collection of open products, things like Nextflow. We have MultiQC, which is the most widely used analytic and reporting tool. We also have FusionWAVE, which are two infrastructure tools which allow folks to run these pipelines at scale. And that’s a like a core layer of infrastructure within building on top of that secure platform, which is essentially the main product which our customers purchase. And as part of that, that’s the piece that we’re scaling up beyond the pipelines into things like data management, into things like interactive environments and going from there. There’s a lot of platforms which claim to do the same thing. I think we have a slightly different approach to that and that’s I think kind of the one of the key differentiators here as well.

Grant Belgard: [00:08:19] And who are your competitors and how are you different from them?

Evan Floden: [00:08:24] There’s been genomics in the cloud. It has been around for a while and there’s obviously been some big players there for a fair amount of time. There’s obviously a whole bunch of of newer ones as well who have received funding recently. We still see the biggest competitor in at least the majority of deals is folks building it themselves. They are typically building these systems. You often have people who are, say, familiar with a certain way of doing things and they try and basically do the same thing in the cloud or they want to scale up beyond single users. And we see a lot of customers who purchased the platform. They’ve already tried to build their own thing first. So that’s the core competitor that we see in terms of building that out. The other competitors there are when I think about generic genomics in the cloud. They’re really focused primarily on a lot of simplification. And I think that there is certainly a subset of users who do need that simplification. But one of the things that we think about a lot is when we think about our value that we provide, we’re not necessarily just helping people sort of simplify.

[00:09:25] We also are making the science easier and also making it possible to do harder things in some sense. So it’s really about taking modern software engineering, providing those tools to scientists. And it’s a little bit like treating scientists like they are developers and giving them the tools to do harder things than to specifically run things in a more simple way. The other aspect of that is that whilst we have our open source roots, that really means that when customers run an exploit pipeline, they run in their environment, they run in their cloud. If you connect up our platform, you connect it up to your cluster. It could be running in Europe and you could connect it up to your Azure instance, which is running in West Coast. You are moving the workload to where the data is in this case, as opposed to the other way around. So it’s a very much more like an open framework and open platform that allows you to connect that as opposed to more of a walled garden, which you see in the other approaches.

Grant Belgard: [00:10:20] What challenges have you encountered the dramatic growth you’ve had in your user base?

Evan Floden: [00:10:25] I think the challenge is often from an organization side of things is really scaling up, how do you go from a group of people, a small group of people, really building something to be able to replicate that across an org. It’s a lot about investing in folks. Not everyone you hire is going to have a PhD in bioinformatics and being able to translate those skills and to be able to have that customer empathy and that customer understanding and almost like scientific understanding of the problem is a challenge. And I think that that’s kind of applies a lot. You see in some other organizations where bringing folks in maybe without any life sciences background or ability or willingness to learn in that doesn’t necessarily translate so well. So from an organization perspective, it’s a lot about building that context and building that organizational knowledge and memory to be able to do that. On the user base side, I think we haven’t really had too many challenges, I would say on that community growth. We’ve been very fortunate that projects like nf-core really came out of the community. They were organic in the sense there is folks who building their own training courses. There is people who have just built so much content around Nextflow, the plug in systems, the AD pipelines on nf-core. That’s really almost I would say, really happened organically and therefore it hasn’t really involved too much in terms of necessity of resources or work from our side other than really just trying to foster that community and enable those people to solve the problems for themselves.

Grant Belgard: [00:11:55] This is your first company, right? So I guess there have been a lot of new things to learn. What’s surprised you the most?

Evan Floden: [00:12:01] I previously had worked at a startup for 4 or 5 years, which was very interesting. Experience was very small at the time. The company ended up going public, so I spent some time there and doing product development. It was at the bench though, so it was very much a scientific role. I saw that there, that it was very interesting. However, it was just very slow to do things at the bench given to what you could do and my inkling for tech really got the better of me and went into the bioinformatics field. When I think about how that journey has progressed and I think particularly the last three years as you start to hire and work, I was surprised at how important the personal relationships have been. I think as a scientist you often think of the world of business or you think of the world of creating an organization. You think it’s very transactional. And when I think about the folks that we’ve partnered with on the investment side or the people that we’ve hired or the partners that we’ve brought on the customers, those relationships are now, in some cases ten years old. And I’ve just been so surprised at how important and how deep they have been just given my maybe slightly naive view coming from a purely academic perspective so I think it is the key one I always go back to when I think about that.

Grant Belgard: [00:13:17] That’s an interesting observation. I think it aligns with what I’ve maybe seen as a broader perception in academia where actually many things are more transactional in academia than they often are in biotech, although in both contexts those personal relationships are crucial and very, very long lived. Because it’s a very small world. And you often work with the same people for many, many years in many different contexts. I know we very often work with the same people, but at different companies because there’s so much churn, they’ll leave one company and go somewhere else. We work with them there and and then someone else from that new company leaves and goes to another, but it’s actually those personal relationships can play a much larger role than the formal relationships with the companies in some cases.

Evan Floden: [00:14:12] Yeah, absolutely. We’ve had one customer who’s on a company number three, and he’s a buyer number three as well in terms of spreading the word there. So that’s the relationships which you think about. They grow over time. I think the community aspect of Nextflow helps a lot with that. We really think that there’s a lot of value you can add and through that community, through knowledge sharing to solve those problems with those folks and hopefully bringing some software to them which adds that value. And then obviously as part of that, that can help them broaden and strengthen the relationship on the academia side. It’s definitely very important. I think particularly around some of the relationships you form with folks like your supervisors across that time. I think those are very special relationships. They can last a long time. I’m not going to go too controversial and try to think about the order of first author ordering as often happens in some academic papers. Thankfully, I haven’t had too many situations like that, but yeah definitely don’t envy that.

Grant Belgard: [00:15:11] For sure. On a completely different topic, how do you think about on site versus remote versus hybrid at Seqera?

Evan Floden: [00:15:19] Yeah. It’s interesting one for us. We started. We really got our Pre-seed funding in March of 2020. I quit my job at the CRG and I was like, we’re going to do this. In February, I started working at home because we didn’t have an office for about 2 or 3 weeks, and then the rest of the world joined me on that. So that was an interesting transition. It’s like we hired our first people during that. We raised our first money in March of 2020. So that was like being forced into it, particularly in Spain. It was particularly long and strict lockdown. As part of that, it forced us to be essentially distributed team from the beginning. And given the focus of the customer bases, which is primarily in life sciences hubs. So you can think of Boston, Massachusetts area, California, so US East coast and offices, some stuff in Cambridge, UK, that was going to always be the central hub for our customers. And we had to deal with that from the beginning. So that was a reality for us and saying, now we try and build ourselves out from hubs themselves.

[00:16:22] So we believe that it’s great for people to be able to get together, if anything, for the social aspect of it and to get to know each other and to build those relationships more than actual the work of. Because most folks are going to be in Zoom calls for a decent chunk of the day anyway. So that’s our take on it, believe in building those relationships. And it’s not easy, though. And I think it’s particularly not easy if you’re a young company and you start like that in a, I would say, non-intentional way. It was definitely not our intention to do things in a sense. It kind of happened and we’ve tried to do the best that we can in terms of managing that, but it’s something that we would have learned a lot from. And I guess like a lot of the tools and like a lot of folks, that’s become the new norm for many things.

Grant Belgard: [00:17:10] If you hadn’t gone down the Seqera route, what do you think you would be doing now?

Evan Floden: [00:17:15] Interesting. I don’t really think about that stuff too much. I think that I still see myself as a scientist at heart. I really do enjoy the scientific process. I enjoy discovering things and learning in this way. I could definitely see myself tinkering a lot and I would continue to do that whether that’s in more product development roles or scientific method development. So very much like what we were doing and doing a PhD. That’s the thing that I really enjoy doing. I think it’s part of this role though. I’ve been learning a whole lot of new stuff which also excites me as well. Like it’s many different things that I didn’t think that I would be doing. So it’s hard to say. I’m very glad that I’ve kind of gone down this path in terms of what I would be doing. In another sense, struggle a little bit to think about it.

Grant Belgard: [00:18:00] If you could go back in time to give yourself advice in 2018 as you started the company, what would that advice be?

Evan Floden: [00:18:09] The best piece of advice that I give myself in some sense really about the bigger picture sometimes because it’s very easy to get drawn into the day to day and the small things. And I think particularly as a company scales, you can often you find yourself thinking of those little things. And it’s really only when you step back and you see the growth or the success or the things really matter. So being able to zoom in and yes, the small things do matter, like getting those things right is important, but also being able to scale out sometimes. And I guess just getting that balance right is difficult. It’s a very intense job. It’s a lot of hours and it’s a lot of time. I think that trying to get that balance right, I wouldn’t even call it balance. There’s harmony in your life. And by having those different perspectives and also different perspectives on the different elements of your life, that’s the advice I would give myself to try and work on. And I think you can tell from my description something I’m still trying to work on now.

Grant Belgard: [00:19:06] Are there any practices you’ve adopted over time? Having a protected half day a week or something to focus on that? Or has it been a moving target?

Evan Floden: [00:19:15] Very, very much. For me, it’s about like routine. It’s the way that I’m able to structure my life. That’s typically starts with beginning in the morning, spending some an hour or so with my son before I have to go to work. And then really trying to fit in all the things that I need to do to to feel good around the work. So for me, that involves a cycle to work. I need to get to work. So I cycle there. It takes 45 minutes or so and then I do the work and then maybe at the end of the evening I’ll be able to cycle back and then try and fit in those times just to try and make it work in a way where I don’t feel like I’m going too much in one direction. So being able to pull those things together the way that it works, I do find this is very difficult with travel though. And obviously it makes it very difficult to fit in the routine in that. So I’m trying to be a little bit more structured about that. And one of the things I’m working on to improve as well, I guess a lot of folks have similar challenges there.

Grant Belgard: [00:20:10] Do you have travel system down now? Checklists and things that feel like you’ve optimized?

Evan Floden: [00:20:17] What I’ve been working on now is more around basically having Monday to Friday where I’m trying not to travel during that period. So I will travel on the weekends to different places and I’ll be in a location for a week, even things like staying in an Airbnb if possible, because then you have got a relatively normal house where you can get into those routines, just trying to do that more. That helps me a bit. I’m still not super. I wouldn’t say I wouldn’t call myself having a system down or having particularly a way of doing things. I like to have a set up, a structure. So where I’ve got my laptop with a keyboard mouse guy, so having that set up and structure just helps me a lot as well.

Grant Belgard: [00:20:51] And what things do you find yourself traveling for Seqera these days?

Evan Floden: [00:20:55] Yeah, we’ve got a lot of events that we’re running. And so given the focus in a lot of North America, we’re spending a fair bit of time there. So we have, for example, the Nextflow summit, which is going to be in Barcelona, but also in Boston this year. So we’ll be spending some time there. We also do secure sessions, which are great events for the community to get together. We’ll often have some talks on technology, things that are coming updated, product updates, roundtables, this kind of thing for 3 or 4 hours in an afternoon. Previously done those in San Francisco and in San Diego, Boston as well, the hubs and continuing to build out that. We’ve been doing a few shows around. We’re going to be at ASG this year and going to be traveling a fair bit around that. And those are the most of the areas. We also have, as I said, a distributed team. So being able to spend time with them is really important as well.

Grant Belgard: [00:21:45] Great. We’ll have to have our operations director come say hi. I won’t be going to ASG, but we will have a booth there.

Evan Floden: [00:21:52] Yeah, folks are absolutely welcome to come and say hi. We’ll get you some next swag and always happy to give folks a demo.

Grant Belgard: [00:21:58] Nice. What message would you have for our listeners about Nextflow and Seqera? As I said, I’m sure most of our listeners have heard of Nextflow and probably many our users, but for those who haven’t used it before, how would you recommend they get started?

Evan Floden: [00:22:15] Yeah, absolutely. If you’re thinking about running pipelines in a way where you want to run them in your own infrastructure, where you don’t want to deal with the complexity of setting that infrastructure up, then Seqera platforms are a great way to start out. We have a community showcase where there is collections of pipelines which are available, where you could log in, select those pipelines and run those and get a feel of how it works. We also continue to add in more options around that, which is enabling on the data management side. So by the time this podcast comes out, we’ll have a data explorer which enables you to really browse and search across different buckets, across different object storage that you may have. And we’re also looking to bring out more functionality and interactive space. So that’s a great place to get started. If you go to tower.nf or if you go to seqera.io, you’ll be able to log in there and find that out. It’s absolutely free to go, go work that and give it a go.

Grant Belgard: [00:23:07] Great. And for people who are already casual Nextflow users, how would they best further build their skills?

Evan Floden: [00:23:17] Yeah, I think there’s some interesting courses which have come out recently, which we’ve been developers with the community as well as from folks at Seqera around advanced Nextflow usage. That’s been a really useful set of resources which have been built out. I think being around the nf-core Slack and the Nextflow Slack is always a great place. There’s a lot of people doing very innovative things there, platform and being able to connect it in there. And then of course attending the events is always a great place to see that. We have 50 speakers, I believe across the events of Nextflow summit in Barcelona and Boston this year. It includes sequencing companies. Obviously the large cloud providers are all going to be there presenting the latest things. We have customers and developing kits. We have customers working in population genomics sequencing projects as well as obviously a whole bunch in biopharma. So that range of use cases can give people a really nice understanding of what other folks are doing. And I think that format as well, where you can really interact with people, can go a little bit deeper into the specifics of how they’re solving those problems is a great way to learn.

Grant Belgard: [00:24:23] So in theory, something like Nextflow would be fantastic for scientific reproducibility, right? Which is obviously been a major issue in the life sciences. But what do you think are the major barriers to adoption of Nextflow for those purposes? Because you usually hear about Nextflow in the context of people trying to do analysis on their own data for their own projects and so on. And it still seems pretty uncommon to see papers published where they have single button reproducibility anyway.

Evan Floden: [00:25:00] Yeah. And I would point folks to the Nextflow paper from 2017 that we published, which is really a little bit of inception here, but we published an excellent paper, obviously using Nextflow, which is really describes a lot of that. And from that git repository you can reproduce everything calling Nextflow from notebooks. So the idea is of open science, I think they’re worth exploring because it goes a little bit beyond just what people consider open source. And that open science is really is a key part of that. So if you think about open source, it’s almost like it’s a license. It’s like, okay, you put Nextflow software out there, people can use it. People can do what they want with it, the Apache 2.0, etcetera. Open science goes beyond that, and it goes to that point where, as you say, people are, for the most part still just publishing papers. But we start to see more and more adoption of folks who are not publishing papers, but they want to publish the paper and the analysis or even just the analysis in itself. When you want to run that analysis or even reproduce that result there, if it’s not going to run on your laptop, it’s going to be very difficult to do so you’re 100% right that Nextflow enables that piece. It does it through a couple of ways. One is obviously containerization, so that integration of containers means that the environment that the task runs in is essentially absolutely the same byte for byte. The other piece of it is that you can run those containers then in any infrastructure so you can run them in [] or you can go run them in your cluster or you can run them on your laptop.

[00:26:21] That piece then enables people to reproducibly do that and almost validate the result that then has a little bit of a flywheel effect. Because if I publish my analysis in that way or my tool in that way, you can then take it and then you can put your data into there as well. And that’s the real important piece I think that Nextflow has enabled there. If we think about that going further, one thing that we’ve really stressed is this idea of empowering scientists with modern software engineering so you can reproduce the workflow, but how are you going to reproduce the environment that you use to set that up, or how are you going to reproduce the data set that you use in this sense? And that’s really what we’ve been working with Seqera is the whole thing is defined or can be defined from API. There’s a CLI as well. So you can say import this pipeline or define this computer environment in this way, import export from that. And it’s treating the whole research environment in a reproducible sense, not just the individual component. And this is very much in the vein of infrastructure as code like setups where folks have been using things like Terraform for building those environments and just taking it to the next step specifically for bioinformatics.

Grant Belgard: [00:27:32] What do you think it will take to get that to become standard practice? I mean, there are some individuals and a few groups that routinely will do that. But majority of the time, it seems these are done by custom scripts that are available upon reasonable request and nobody ever gets them.

Evan Floden: [00:27:55] It definitely is changing as depending on where you are. So if you are developing a new tool, it’s kind of by default. It has to be there. If you consider it was going to be in a paper, the reviewers would essentially have to run the tool and try it out. I think the more you go down like two different areas, then you’ll see I agree it gets less and less in terms of that compliance. I think it’s probably very much like carrot and stick in this sense. Carrot in the way that if you consider yourself, like when you write something in Nextflow or you write a pipeline or an analysis in a reproducible way, you’re really just doing it for yourself in three months time. Because if you’re anything like me, in three months after you’ve done an analysis, you come back to it and then you have to rerun it because you’ve got a new sample or you’ve got some new parameter. It’s just absolutely impossible to remember how you did it, what you did it like, exactly that. So that reproducibility piece is almost like for yourself in a very selfish way. That implies the carrot. The stick bit is coming from this publishers. So as our former supervisors, Cedric Notredame, he has one of the journals and as part of that, it’s really about publishing pipelines, publishing things in this way. And it is using standards like nf-core to do that. So you have to publish in a completely reproducible way, you can define exactly what you are publishing, and I can really see us moving towards a situation where the paper is just one artifact of the actual output. However, it’s not the main output. The actual main output is often the case is like the actual analysis in the tool and say this is particularly relevant for tool development, which is obviously very, very widely used in bioinformatics.

Grant Belgard: [00:29:31] Nice. So maybe changing gears a bit. Can you take us back to your childhood? What got you interested in science?

Evan Floden: [00:29:40] Yeah. So I was originally born in New Zealand. I spent probably the first nine years there and then got the opportunity with my family. We lived in Malaysia and Sweden growing up for some years. I think in New Zealand it was a very kind of natural environment in some senses. It’s obviously a lot less people and a lot more nature, got me interested in bio. And I vividly remember thinking about biology in the sense during high school. I got a little bit obsessed with scientific nonfiction and saw myself really wanting to go into biotech. Bioinformatics at that time was much less prevalent. I guess it was very early for bioinformatics. So that’s what led me to study biotech and then to spend time going into molecular biology. I had a really interesting opportunity for a couple of years as an undergrad working in a yeast laboratory, and what we were doing was essentially had a knock out set of yeast. So it’s you can imagine very large agar plates. Each one of those plates has got really a couple thousand samples on it and each sample has got one different gene removed. And you can treat this with different chemicals or you can make these yeast together and you can look at chemical interactions or genetic interactions and understand what’s happening there at a genetic level and how it integrates with those pieces. There was obviously a bit of robotics, obviously a lot of yeast culturing and a touch of bioinformatics as well. And I think that’s one of the things that sparked my interest into bioinformatics later on. Although to be fair, I didn’t do that until I went to Italy to study a master’s there. Bioinformatics wasn’t available in New Zealand at the time, so it was my opportunity to jump into the field.

Grant Belgard: [00:31:20] So you finished your degree in New Zealand in 2010 and what did you do then?

Evan Floden: [00:31:27] I joined the start-up. It was a very interesting startup. It was about five people at the time were developing a medical device, which sounds nice and clean, but the medical device itself was coming from the fourth stomach of sheep, so I’m not sure if the listeners are familiar with haggis is essentially one of the stomachs of the sheep. It’s a very interesting material. We were trying out lots of different materials and the idea was to see if we could create a bio scaffold. So essentially a tissue which could be used for soft tissue repair in surgery, you would remove the different layers on the top, decellularized it, freeze dry it and essentially end up with a shelf stable product which could then be used in different applications. So the first few years there did a lot of product development. We got FDA approval for the basics of the platform and really ended up developing several other products, for example, creating multiple layers of this for breast reconstruction or hernia repair. And this was really just involved in that whole start up phase. It was really exciting. It was really interesting. I saw how I saw the determination which was required to create a startup, but I also saw how interesting it could be to work on many different topics and many different things. And that change I really liked and I just was just enthralled by that. And I got my, I guess if I [] place the seed, let’s say, for what was happening with Seqera later on.

Grant Belgard: [00:32:48] And what brought you to Italy then?

Evan Floden: [00:32:50] I really wanted to get into bioinformatics. I think it was something I’d been pushing for and that’s where I got an opportunity. I got a scholarship to do a master’s there. It was a very interesting time. I got to fully focus on that and I knew some basics of programming, but I really got to fully hunker down and spend a good 18 months or two years just purely focused on that. The bioinformatics program in Bologna is quite widely known. We got to do fantastic things. For example, we would build a Markov models from scratch, from the individual components really got exposed to how machine learning was working in sequence analysis itself. It’s quite a mathematical program, but it really gave me the basis for many of the things that came later on. It’s actually where I met my supervisor, Cedric, and that’s what started the journey into Seqera. I had a little bit of time in Cambridge in between in the UK working at RFM, but that was what got me started in that.

Grant Belgard: [00:33:47] Nice. And then after your stint at Cambridge, you went back to Italy, right?

Evan Floden: [00:33:52] It is Barcelona. Yes. Sorry, it’s Barcelona. That’s where I started my PhD, and that’s where I met Paolo. And the story kicked off.

Grant Belgard: [00:33:59] Nice. And then afterwards you stuck around in Barcelona at the CRG. Were you working at all with the CRG during your PhD?

Evan Floden: [00:34:07] Yeah, so my PhD was at the CRG. It’s a research organization, but I mean technically you’re part of a university as well. Although you spend the whole time in the research organization, it’s more of an affiliation so that they can provide you with an academic degree. Yeah, really interesting place. And there’s a lot of the leading biomedical research center in Southern Europe, fantastic location as well, very international. And it provided a fantastic opportunity to learn there and be surrounded by smart people. And obviously it’s what we’re doing.

Grant Belgard: [00:34:39] And this Seqera Avenue formal relationship with CRG or is it kind of just another institute where there are a lot of Nextflow users?

Evan Floden: [00:34:49] Obviously, CRG being home of Nextflow, let’s say the original home of Nextflow, there’s always a special relationship there. The usage of Nextflow is obviously very wide in the organization there. We consider ourselves like a spinoff of the organization. And so the relationship stays special in that way.

Grant Belgard: [00:35:09] That’s great. And do you have any advice for our listeners who might be scientists who are considering the entrepreneurship journey?

Evan Floden: [00:35:19] Yeah, it’s hard one in the sense that, like, you don’t know until you really jump off the diving board in that sense. I found it personally to be very rewarding and very fulfilling. As I said related to your question before, I can’t really imagine myself having not done this or doing something else. At the same time, I fully admit it’s not for everybody. There’s a lot of sacrifices you make in other aspects which are difficult. It’s a way that you can have a very fulfilling role, very fulfilling job. And for me, being driven by the impact of it, I think it’s just the way that I felt that I would best be able to build something that would scale and that would have the most impact on it. I think that one of the reasons behind Seqera at the beginning is really just to spread that I was one of the first couple of uses of Nextflow. It really changed how I was working and I wanted to put that into as many people as possible. I feel the same way about what we’re building in Seqera. There’s great technology which we just want to put into the hands of scientists to help them work. That entrepreneurial journey is for me, it’s really much it’s just the way to get that done. And it’s that the way that it can manifest, I would say.

Grant Belgard: [00:36:26] So if you look forward ten years from now, what would you consider a success for Seqera?

Evan Floden: [00:36:33] We really want to see ourselves as first and foremost, having helped thousand biotech biopharma organizations really reach their own goals. And for that, that’s usually outcome in patients. We want to see biotech continue to grow. We want to see the adoption of those technologies. We want to see things like personalized medicine become available to people. We want to see the promise of genomics technology become a reality in that. That’s the first, I think, that we can play a really important role in making the analysis part of this data analysis, part of this accessible, available, open and build the bioinformatics tool framework that in the world that we want to see in there. From an organization perspective, one of the things I really would love to see is that from Seqera, we almost create our own ecosystem as well. So whether that means of employees who create their own things or really new projects which sprout from the Nextflow ecosystem, really seeing that gives me a lot of satisfaction because it shows that you can start one thing and it can really flower into a whole bunch of other areas. Just myself personally, just really ten years would just love to be obviously healthy, still enjoying the job and really hopefully having made as much impact as possible on those areas.

Grant Belgard: [00:37:49] Well Evan, thank you so much for joining us today. It’s been a nice conversation.

Evan Floden: [00:37:53] Awesome. Thanks a lot, Grant. [See anytime] and folks, if you do want to join us at Nextflow summit, both Barcelona and Boston are still open. We’d love to see you there and thanks so much for the time.

Genome sequencing analyses of “alien” Nazca mummies do not support an extraterrestrial origin

Context

On September 12, researchers presented 2 Nazca mummies to the General Congress of the United Mexican States. They report that these mummies are definitively non-human (but humanoid) biological entities and showcased evidence from genomic studies and anatomical analyses. If their conclusions are correct, this indicates an important change in our understanding of biology and taxonomy – to say nothing of the more sensationalist claims that the mummies are extraterrestrial in origin.

José de la Cruz Ríos, Jamie Maussan, and José de Jesus Zalce Benitez presented their findings in Mexico City. Genomic reads from 3 samples have been submitted to the NIH Sequence Read Archive (SRA) by a researcher affiliated with the Universidad Nacional Autónoma de México who performed some genetic analysis presented in the hearing. The researchers say they sent the samples to multiple groups for genetic analysis, none of whom appear to have been directly involved in the hearing before the Congress.

Previously known information on the mummies

All 3 of the presenters have previously presented evidence on non-human humanoid Nazca mummies before the Congress of the Republic of Peru in 2018. Based on the testimony provided before the Mexican Congress, these seem to be a subset of the same mummies previously shown. The mummies have been extensively examined by a team who documents their results and sells DVDs with further information on www.the-alien-project.com They have 3-fingered hands and feet, elongated skulls, and spinal connections in the center of the skull.

An overview of a thorough debunking of these specimens can be accessed at https://antropogenez.ru/review/1119/. In non-exhaustive summary: X-ray experts independent of the mummy discovery group say that the specimens consist of human bones intentionally placed into an arrangement which makes no biological sense. The elongated skulls on the ‘humanoid reptile’ type, presented to the Mexican Congress, are hypothesized to be partial skulls from other mummified mammals contemporaneous to true ancient Nazca mummies.

During the presentation before General Congress of the United Mexican States, the researchers indicated that these specimens were discovered in 2015, at the same time as specimens which have already been presented in Peru. The Mexican Congress specimens are identifiable as the “humanoid reptile” type presented on the alien project website.

An object described as a humanoid reptilian mummy. It is covered in white paste or powder, with 3 fingers on each hand and 3 toes on each foot. It is very elongated.

Here is “Alberto”, a specimen previously described as a humanoid reptile.

A screenshot of a YouTube video titled Mexican Congress UAP hearing - September 12 2023. It shows an object identified as a humanoid mummy inside a box. It is covered in white paste or powder and is very elongated.

Here is one of the non-human humanoids presented before the Mexican Congress on September 12. The resemblance to “Alberto” is clear.

Preponderance of unaligned reads consistent with low yield extractions from ancient samples

The SRA samples provided have the same base count, GC content, and sample identifiers as samples discussed in an Abraxas Biosystems consulting report from 2018, uploaded by the Alien Project on their website.These data indicate that the Abraxas samples and SRA samples are the same – particularly the identical base count. The Abraxas Biosystems report describes sample Ancient002 (“sample 2”) and sample Ancient004 (“sample 4”) as being from different locations (bone and tissue) on the same mummy, called “Victoria”. “Victoria” is a headless humanoid mummy, and not one of the ones presented to the General Congress of the United Mexican States. Sample Ancient003 (“sample 3”) is described as a separate hand.

Each sample in the SRA has a BioSample accession, and all 3 samples were identified by the submitter as human. Samples Ancient002 (“sample 2”) and Ancient003 (“sample 3”) are identified as bone, and sample Ancient004 (“sample 4”) is identified as muscle tissue. GC content of the samples ranges between 39.7-46.4%, which is not inconsistent with the range of GC content in human DNA. Native SRA taxonomy analysis is available for each of the 3 samples. Sample 2’s 39.7% GC content is relatively low for human DNA, but is more typical of legumes, for example – more on this below.

An image of SRA taxonomy analysis titled WGS Ancient0002 (SRR21031366). 72.07% of reads are identified, 27.93% unidentified. 58.98% are Eukaryota, with 42.89% Phaseolus vulgaris and 3.18% Homo sapiens.

42.89% of reads in sample 2 are confidently assigned to Phaseolus vulgaris, the common bean. This is most easily explained by sample contamination or construction of the putative bone fragment from a bean derivative.

An image of SRA taxonomy analysis titled WGS Ancient0003 (SRR20755928). 97.38% of reads are identified, 2.62% unidentified. 91.89% are Eukaryota, with 82.03% Hominoidea and 30.22% Homo sapiens.

SRA taxonomy analysis confidently assigns 97.38% of the reads in sample 3 to known taxonomic categories. Only 30.22% of reads can be confidently assigned to Homo sapiens, which can initially seem like an indication of some DNA of non-human origin. However, let’s compare this to an SRA taxonomy analysis of a known high-quality human sample.

An image of SRA taxonomy analysis titled Target-capture of bone marrow or peripheral blood samples in AML patients (SRR24975192). 93.15% of reads are identified, 6.85% unidentified. 92.66% are Eukaryota, with 51.99% Hominoidea and 12.04% Homo sapiens.

Here, we see that only 93.15% of reads can be confidently identified – this is actually lower than the percentage of identified reads in sample 3. And only 12.04% of reads are confidently assigned to Homo sapiens – much lower than the 30.22% which can be assigned in sample 3. In this context, sample 3 is almost definitively human DNA. The Abraxas report, discussed earlier, also identifies sample 3 as containing human DNA, and further specifically as a human male.

An image of SRA taxonomy analysis titled WGS-ancient 004 (SRR20458000). 36.28% of reads are identified, and 63.72% are unidentified.

63.72% of reads in sample 4 are unidentified. This is most easily interpreted as a quality control issue of some kind – potentially caused by sample contamination, or very low-quality data.

The Abraxas report discusses the bioinformatics work that was done to match sample 4 reads to known genomes. Of note, 304,785,398 overlapped reads – a further processing step which the reads uploaded to SRA have not undergone – did not match to any of the tested genomes. However, after removing duplicate reads, this number was reduced by a factor of 10 to 30,823,217.

Continuing this analysis, they assembled the unique unknown reads for sample 4 into contigs. 65.69% of the unmapped reads were successfully assembled and re-matched to known organisms in the NCBI nt database. 97% of the assembled contigs were successfully matched to sequences in the nt database.

To summarize, the reads in sample 4 which could not be matched to tested species are on average highly duplicated reads. When duplicates were removed and the remaining unknown reads assembled into contigs, it resulted in the ability to match 64% of these remaining unknown reads to a database of known organism sequences.

The Abraxas report concludes with an acknowledgment that the NCBI nt database does not contain all sequences for all known organisms, and it is therefore certainly possible that the unidentified DNA reads are from already known (and therefore terrestrial) organisms which are not in the database.

The SRA taxonomy analysis figures still seem evocative, though – 64% unidentified? However, we can see that this is not unusual even for unambiguous ancient human DNA.

An image of SRA taxonomy analysis titled Malta-WGS-BAM-Xaghra8 (SRR17043540). 43.61% of reads are identified, and 56.39% are unidentified.

SRR17043540 is from a study into ancient Maltese genomes, and we can see that SRA taxonomy analysis gives 57% unidentified reads for this sample.

An image of SRA taxonomy analysis titled HiSeq X Five paired end sequencing (ERR4863252). 68.73% of reads are identified, and 31.27% are unidentified. 11.04% are Homo sapiens.

ERR4863252 is a sample from a single ancient human individual from the location corresponding to present-day France. Although the majority of reads in this sample are identified, 31.27% of reads are still unidentified by the SRA taxonomy analysis. And only 11.04% are confidently assigned as human.

Conclusion

So, after a review of the context surrounding the Nazca “alien mummies” and the genetic data presented as evidence of non-humanity – what conclusions can we draw? It seems clear that the genetic data is not conclusive evidence of non-human origins. Combined with the problems with the X-ray evidence espoused as proof of alien morphology – the Nazca mummies are not convincing. They may be assembled from ancient materials, but they are not ancient alien bodies.