The Bioinformatics CRO Webinar Series

February 18, 2026: Phil Ewels – Reproducible Bioinformatics at Scale: nf-core + Nextflow

Phil Ewels

Phil Ewels is Product Manager for Open Source at Seqera. He holds a PhD in Molecular Biology from the University of Cambridge, UK. Phil joined Seqera in 2022, previously working at the National Genomics Infrastructure (NGI) at SciLifeLab in Stockholm, Sweden, where he became involved in the Nextflow project and co-founded the nf-core community. Phil’s career has spanned many disciplines from lab work and bioinformatics research in epigenetics, through to software development and community engagement. He is passionate about open-source software and has a soft spot for tools with a focus on user-friendliness. He is the author and maintainer of tools like MultiQC and SRA-Explorer, and helps lead the nf-core and Nextflow development teams.

In this live webinar, he gives an overview of Nextflow and an introduction to some of its new and exciting features for bioinformaticians looking to scale up their pipelines.

Transcript of The Bioinformatics CRO Webinar Series – Reproducible Bioinformatics at Scale: nf-core + Nextflow

Disclaimer: Transcripts may contain errors.

Grant Belgard: Welcome to the final talk in The Bioinformatics CRO webinar mini-series. At The Bioinformatics CRO, we help life science teams turn complex data into clear decision ready insights, providing flexible expert bioinformatic support from study design through to analysis and reporting. As part of that mission, this webinar series features practitioner focused talks with concrete takeaways you can put to work right away. Today’s talk is by Phil Ewels. Phil is a senior product manager for open source software at Seqera where he helps lead the nf-core and Nextflow development teams. Today Phil will be presenting on reproducible bioinformatics at scale: nf-core and Nextflow. After the talk, we’ll host a live Q&A session. This is streaming both to YouTube and LinkedIn and on either platform, you can put your questions in the chat or the comments at any point during the talk and we’ll bring them into our discussion afterwards. Phil, over to you.

Phil Ewels: Thanks very much for the introduction and thanks Grant for the invite to come and speak today. It’s a pleasure to be as part of this webinar series and it’s always nice to have the opportunity to talk a little bit about Nextflow, a topic close to my heart. Um I don’t know if my slides are ready to come up but I yeah so basically my talk today is in two parts. I’m going to give a bit of an introduction to what Nextflow is and what nf-core is and why I think they’re good and useful for you and why I think you should care and then I’ll talk a little bit about some of the new features which have come out especially for Nextflow in the past kind of year or so, year or six months and this is particularly good for anyone in the audience who maybe has dabbled in Nextflow especially a little while ago because things are changing quite a lot and for the better. So I hope I convince you to really pick up Nextflow and see if it could help you in your work. So yeah, so my name is is Phil and I’ve been working originally in the lab and then kind of became a self-taught bioinformatician and went slowly moved from research into kind of core labs. So I worked at the National Genomics Infrastructure in Sweden developing new lab techniques and analysis and then started kind of accidentally getting into software design. Started writing pipelines, had my own pipeline tool. It was all the rage 10 years ago. And wrote software like MultiQC which I imagine many people will be familiar with. And got into Nextflow probably about eight years ago or so while I was in Sweden at the NGI. And we were running huge numbers of samples. It was a real step up from my previous work in Cambridge where now we were running hundreds of samples, hundreds of projects, sorry, thousands of samples. And we needed to the software I’d used previously wasn’t really up to the task. So I looked around and found Nextflow and we started building lots of different pipelines and because we’re a team of about eight people building pipelines and kind of we started to standardize and nf-core was born out of that standardization of our pipelines. I’ll talk a little bit about what made that possible.

Phil Ewels: So the background to the whole picture of why Nextflow exists is these classic statistics from this Nature paper quite old now 10 years ago. Where a simple study I think it’s the statistics resonate with many of us working in bioinformatics about this reproducibility crisis where it’s famously difficult to reproduce experiments that you find in the published materials and even reproducing your own experiments kind of what I refer to as your- one of your most important colleagues which is future you is notoriously difficult to do and reproducibility is the foundation of the scientific method. And so we were kind of in a bit of a bad place 10 years ago where data was really starting to scale. NGS was taking off. We had more data than we knew what to do with and we couldn’t really reproduce the analyses that we were doing and certainly we couldn’t transfer those analyses to other people. And it’s not surprising because it’s a really difficult problem. We’re running many different tools, each one of which might have a numerous different complex dependencies. Everyone’s running on a different system. And often, you know, even if it works on your machine, it might not work at a collaborator. Everyone is doing things in their own way and there was very little in the way of provenance, of knowing where data came from when your supervisor sent you an Excel spreadsheet with some results in. So, Nextflow set out to basically try and provide an answer for this. And it’s a workflow orchestration tool. So, it takes your analysis pipeline of multiple different steps and puts it together into a language. And it’s quite a unique syntax. It’s flow-based programming which kind of makes sense for what it’s doing. It’s got some real key features which make it very very popular. Something that’s really important is it’s got built-in support for software packaging. Docker was very new about when Nextflow was first launched and Nextflow supported it almost right away. And so you can package individual tools at the level of single processes within your pipeline. So the software effectively comes built in with the pipeline. So end users don’t have to worry about installing 20 or 50 different tools every time they run a new pipeline. And all those versions are pinned so you know you’re always running the same version of the software when you run that version of the pipeline. It’s multiplatform so Nextflow supports lots of different what it calls compute environments. It can submit jobs to all kinds basically anywhere you can run computing Nextflow will support. It has one of the most popular features is the ability to resume. So it’s got it’s quite clever with a cache of completed tasks. So if you’re, if you lose power halfway through your run and it’s been running for like three or four days, you haven’t lost everything. Nextflow was able to look back and understand which tasks already completed successfully and pick up where it left off. With this kind of dash regime, it’s massively scalable. Really, you know, I’ve put thousands here, but up to millions of jobs. We’ve seen truly enormous workloads passed through Nexftlow and it’s able to scale to really massive volumes of data and in the last 10 years it’s really grown an extremely active ecosystem and community which is one of the most attractive things to the system really is that there are lots of other people building with it. Okay, so for those unfamiliar with Nextflow how does it work? What does it do? There’s basically a few different steps to building a pipeline and running it. Firstly, you define kind of processes within your Nextflow code which are the building blocks. So, a single process usually corresponds to a single tool. You say what the data inputs are, what the expected outputs are, and then you have a script which could be a bash command. It could be a Python script, an R script, can be anything really, but that’s able to be resolved on the fly and that’s then submitted to your compute environment to be run as a single task. So you describe all the different processes in your pipeline and then you link them all together with what Nextflow calls channels, which is the data flow aspect of Nextflow. And you can have one but Nextflow handles all the data flow automatically when you run for the pipeline and it handles all the dependency and all the parallelization so that when you describe this flow then Nextflow automatically figures out basically how your pipeline should be run. Then once you have the pipeline logic and the code written you have a separate step which is to write configuration and then the configuration is importantly separate to the pipeline code and this is where you describe your specific setup. So your HPC, your cloud compute credentials, your laptop, whatever, and when once it’s configured you’re ready to go and you can execute it wherever you want to, basically.

Phil Ewels: The really key points if you remember nothing else is that Nextflow is not just one thing it’s several things. It’s a language. So it’s a language and a code syntax which is designed for describing workflows, the steps and also the data flow within workflows. It has separate configuration from that syntax so that you can separate the logic of the pipeline from how the pipeline should run and it’s also an orchestrator. So it’s the actual job that you run which actually passes that code and understands it and runs the pipeline for you. So it’s both a language and also an executor. The two things that Nextflow brings are reproducibility that you can run the same workflow and it can be years apart and as long as you run the same git versioned pipeline code which has pinned the exact same software for every step and you’re using the same version of Nextflow you’re almost guaranteed to get exactly the same results out which is really fantastic. And the other thing is this idea of it being portable. I can write one pipeline code and share it with different people in different places running on different systems and they can write their own config files but the pipeline code stays unmodified. And so for the first time really when Nextflow came out, it was possible to write one pipeline and run it anywhere, which now seems kind of obvious 10 years in, but at the time these two facets in Nextflow were really revolutionary and and groundbreaking.

Phil Ewels: And so what this means, this touches on this concept of scalability which was in my talk title. So you can write a single Nextflow pipeline and you can test it out on your laptop with one small test sample and and once you’re happy that it’s working properly, you can scale that same pipeline up without touching the pipeline code to tens or thousands or millions of samples. And you can also scale up the compute that it’s running on from just your laptop to maybe a slurm cluster somewhere or cloud computing basically any kind of cloud computing AWS, Azure, Google, um, Oracle. And because of the way that Nextflow is structured and architectured, it’s able to handle that scaling and basically grow grow with your needs.

Phil Ewels: Nextflow has become massively popular because of this. The figure on the left is from a recent paper that we did for nf-core community and shows just the number of citations for different workflow managers is a bit of a lagging metric but you can see that Nextflow has become more and more popular over recent years. And then on the right we just have the number of runs and you can see there’s there’s hundreds of thousands of runs of Nextflow pipelines every day. And this is probably undercounting it quite a lot as well. So Nextflow is arguably one of the most run workflow managers certainly in life sciences.

Phil Ewels: So that’s Nextflow. Quick introduction there for those who are unfamiliar. So that was how it works and why it was built the way it is. Because for the first time Nextflow was able to give us workflows which were portable between different systems. Back in 2017-18 we had this kind of light bulb moment where up until then everyone wrote their own RNA pipeline wherever you were in your core facilities, in your labs. You had to because other people’s pipelines didn’t work on your system. And they had hard-coded paths or maybe the environment module system with the software used different names. All these different things made it very difficult to collaborate. But Nextflow suddenly removed those blocks that we could now share code for running pipelines and we didn’t all need to write our own pipeline. And so back in around 2017-18 I started nf-core with some collaborators and friends and we started taking the standards that we we built in Stockholm and kind of opening it up to the wider world and based on those principles we founded this Nextflow community called nf-core. nf-core has exploded in popularity alongside Nextflow. The two have kind of formed a very symbiotic relationship and now we have over 140 different pipelines which is astonishing when you bear in mind that one of our key guidelines is we only have one pipeline per data type or analysis type. So we have only one RNA pipeline. So that’s 140 different types of data analysis that we have pipelines for. In the recent years we’ve also grown to be more than just pipelines. I’ll touch on this in a second, but we also now have shared modules, which are basically individual processes within the pipeline. And so these themselves are shared and can be reused across different pipelines and across pipelines outside of nf-core. And so every one of those is is a different tool and it comes bundled with its commands, its usage, and its software containers and everything. And there’s now over 1,700 of those, and that number is growing really, really fast. And then we have a community Slack where we have channels for every different pipeline for discussions. We have kind of a core team and a maintainers team and kind of some level of governance within that. And we have going on 14,000 community members in Slack now. So it’s an extremely active community and of course really kind of it’s built on this concept of best practice where we you can write Nextflow is a programming language and you can write your Nextflow pipeline in basically any way; there’s huge variability in how you do that and nf-core takes a very very opinionated stance and says if your pipeline is going to be part of nf-core it has to be written exactly this way we you have to use our template you have to do things our way. And the reason we do that is that then makes it possible for components to be interchangeable and for folks to be able to collaborate. So standardized tooling, best practices and a lot of documentation.

Phil Ewels: One of the things that’s quite unique about nf-core versus other pipeline registries and software registries is that one of the requirements of adding a pipeline to nf-core is acknowledgement that it’s not owned by you anymore. It’s community owned. This is another figure from that recent paper. But I really love these plots. This is for the small RNAseq pipeline which we actually started in Sweden before the origin of NF core. And you can see that top green bar is SciLife. And you can see that we were sole owners, maintainers, contributors to start with in 2017-18 and then more and more different organizations have joined in with maintaining and contributing to the pipeline and actually SciLifeLab stopped contributing really to it around 2022. But the pipeline lives on because the pipelines are community owned. They don’t suffer from this problem of a PhD student finishing a PhD and moving on to a different position and abandoned the software getting abandoned because it’s community owned. We can build updates in based on community consensus and bring in volunteer works from from groups across the world. And that’s a real kind of superpower for nf-core.

Phil Ewels: I also want to touch on the fact that nf-core is not just pipelines anymore. This modules library and the tooling that we build for nf-core is deliberately done in such a way you can use it for any pipeline, any Nextflow pipeline. And so this is the nf-core CLI I’m showing on the right and it has a TUI, a terminal interface which you can use to create new pipelines and that very rapid example there is creating a new pipeline which is not using nf-core template and you can choose which of the features from the template you want. So you can make it very very minimal or you can have everything that NF core comes with. And it’s up to you. And once you’ve got your pipeline, you can then go into that pipeline and use the tooling again to pull in these shared modules from a community repository. So here I’ve pulled in SAMtools sort and BWM and it fetches that those modules. It fetches that code with everything that comes with it and pulls it into my pipeline. And really then all that’s left is to connect those channels I mentioned. I’ve got the building blocks of my pipeline there provided for me from a community and I just need to put them together. So, nf-core tooling really provides a fantastic starting place for for anyone building their own Nextflow pipelines that you can just mix and match and you’re building on on community best practices. You’ve got all the the learnings of thousands of scientists using Nextflow over many years. And your, the modules you’re sure are well tested and being used by many other people. So you’re benefiting from a a huge pool of community knowledge.

Phil Ewels: Okay, that’s it for my introduction. So next I’m going to touch on some of the new developments in Nextflow.

Phil Ewels: Nextflow itself is is developed at Seqera and we have a team of engineers working on Nextflow and basically the last year or so we’ve had some pretty major projects based on the community survey that we do. We try and do one of these almost every year. And for as long as I can remember, people would always say that they love Nextflow, but they find it really difficult to work with. The error messages are unhelpful. The syntax is confusing. And there’s none of the kind of nice stuff that people are used to working with when they use other programming languages. I myself write a lot of Python. Um, multiqc’s written in Python, for example. So, I absolutely sympathize with these these requests. And so we really went back to the drawing board with Nextflow about a year and a half ago and said okay how do we solve these problems and basically we took on a really massive project which is we completely rewrote how Nextflow understands Nextflow code. In the past Nextflow was what’s called like a Groovy DSL like a domain specific language. So the way it worked was you wrote your Nextflow script and that was basically cross-compiled into Groovy code at runtime and then the Nextflow engine would run that Groovy code. That’s still kind of the case but now we have a new language parser which takes your language which takes your Nextflow code and is able to natively understand the syntax that you’ve written. This is really changes the game for us in terms of what we’re able to provide for developer tooling, for error messages and things like this and means we’re kind of moving away from the days of Nextflow being a Nextflow – sorry a Groovy DSL really Nextflow starts to become its own native language.

Phil Ewels: One of the first things that was possible with this was that we launched a language server, an LSP, which um, and we incorporated that into the VS Code extension, which is probably the best way to write Nextflow code. And so suddenly we were able to bring up this developer experience for writing Nextflow code to be in line with other languages that you might be used for. The simplest thing is error reporting. Just being able to see in real time as you write your code that something’s wrong rather than having to hit save, run the pipeline, and then try and figure out where the bug is when you’ve been writing code for half an hour. There’s things like quick navigation and auto formatting of your code. So, you don’t have to argue about whitespace and things like this. But, picture’s worth a thousand words. So, let’s have a quick couple of kind of examples of what I mean. This is one of the simplest things but probably the most impactful is just the little wiggly lines that you can now get when you’re writing Nextflow code. Here you can see that the red line is telling us that that variable is unknown and it’s not defined and the clue is just above it where we have defined a variable called locations with an s and we’re also getting a warning there that we’ve defined a variable and it’s not being used anywhere and these hints are being shown as you write your Nextflow code. So it’s a huge productivity boost to writing Nextflow. There’s features like this where we have tool tips over every Nextflow language item. So when you hover over in this case a channel factory, but it can be any part of the Nextflow syntax really, you get a short description about what that is and what it’s doing. And then there’s also a link underneath to read more and that takes you straight to the Nextflow docs. There’s things like this where there’s special little buttons that pop up in certain places in your workflow. So if you have valid syntax, your top level workflow now has this button saying preview DAG or D-A-G. And you click that and it will show you a mermaid diagram of your whole workflow in the sidebar right there in VS Code. And so this is a great way to get to grips with a new workflow which maybe you haven’t worked on before and you’re inheriting from someone else. I had about six of these slides in, but I thought they were a bit too much, so I pared it back down. But this is just a taste. There’s many different things like this now built into VS Code. So if you’re writing Nextflow Code now, it’s just vastly vastly better than it was a year ago or more. So if you’ve ever tried in the past, I recommend having another go now and seeing if the experience is better. Along with the language parser, other things that allows us to do is actually change and develop the syntax of Nextflow itself. Before we were kind of limited by what the Groovy Nextflow language parser could handle, but now because we have a totally separate step, we can develop the language however we want. And so we’re bringing out several improvements as a result of that. And something that’s been asked for for a long time is static types. So here we have some input parameters for a pipeline. On the left is the traditional way to do it. You save a name and you say a value, default value, then that’s that. But Nextflow just on the fly tries to typecast stuff based on the values it’s been given which sometimes leads to problems. For example, if you have a sample name as a string, but the sample name is given as something with leading zeros and it gets converted to a number. Now on the right hand side you can see we’re defining the types of each parameter whether it’s a path a boolean an integer a string so on and Nextflow will then strictly typecast those things on input and also validate that the values that’s been given are correct. So you’ll get immediate validation and errors if you try and launch a Nextflow pipeline with the wrong kind of input. So these kinds of things are small changes to a syntax but really make a huge difference to re-usability of Nextflow pipelines.

Phil Ewels: Another thing I mentioned was error messages. So here you can see one of the old style error messages where because it was compiled to Groovy before it ran, Groovy threw an error and it was really unhelpful. it was just like top level pointing at a squiggly bracket and then you had to go through hundreds of lines of code to try and work out where the error actually was. Whereas now we throw the error at the parsing, language parsing level. And here you can see it even indicates the exact character which is wrong and and you can go straight there. It’s just again way better. And we have a lint command you can run as part of your continuous integration test for example linting just to to find those validation errors before you even run the pipeline.

Phil Ewels: Okay, I need to speed up a bit. Other features that we’ve been working on, these are kind of low-level features which you might not notice right away but are really kind of foundational blocks for us being able to build a lot of cool stuff. Workflow outputs is a new way of defining where, how files are basically published at the end of of pipelines. And data lineage gives a way of saving and storing all the information that next flow has about the provenance of all of your data. And so when lineage is enabled, you can kind of find out from any given file the entire analysis path that it took through a pipeline where it came from. And and we can do start to do some really nice things such as passing inputs between pipelines and things like this.

Phil Ewels: Before I wrap up, I want to just touch on a few things that we do at Seqera, which is the company which was formed around Nextflow. So, Nextflow is all open source and of course, and that’s kind of been my focus professionally, but if you’re running Nextflow, then Seqera has a lot of extra tooling that you can build on top of Nextflow. Tthe key thing we have is something called Seqera Platform which is basically a way to manage running Nextflow. So Nextflow is a command line tool. But when you’re running a lot of Nextflow pipelines, it can be difficult to keep track of all those different runs and which ones through errors and where they are and where the data is. And so Seqera Platform kind of provides an interface to launch and to monitor different workflows. Importantly it works with your compute. So you connect it to your AWS account or your slurm cluster and all your pipelines are still running in the same places that they were before. It’s just that they’re being exposed through the Seqera Platform interface. This is kind of an example of the kinds of things you can do once Seqera Platform is aware of the great- basically the Nextflow pipeline and everything around it. So the encapsulation of configuration and execution environment and data. So you can use it then as a control plane which you basically can build on top of. And so one of my extra little projects in the past year is a plug-in for an open source tool called Node-RED. There’s a link here, but basically you can use this as a low-code platform for setting up automation. So here it might be that when a file is added to an S3 bucket, it automatically triggers a workflow and when that workflow finishes, it triggers a second workflow and when that one is finished, it triggers the creation of an analysis studio which you can then go in and do your downstream analysis in things like that. You can basically create any kind of automation and this is all done via the APIs of Seqera Platform. And so when you abstract away all the complexity of actually configuring and launching and maintaining all the infrastructure, you can start to build some really cool solutions.

Phil Ewels: We have a lot of tooling to make basically running your pipelines faster and cheaper and better. A big one is is fusion which handles all the file operations. Nextflow is traditionally very well targeted towards working with huge data files. You know, your BAM files and your fastq files and everything and fusion basically is optimized specifically for Nextflow. It knows how Nextflow works and it’s really you know it can really fine-tune it for that use case and one of the latest things that fusion can do is snapshots. So if you’re running on cloud with spot instances for example AWS might tell you that this you’ve got one minute before your instance is being reclaimed and snapshots will now freeze that, freeze that image, that running task and you can restart it. And don’t lose all the progress you’d made in that long running task. And then just this is like an everything else slide because there’s I could give another two hour long talk about all these features. There’s so much more. But if any of that sounds interesting, I’m happy to ask answer kind of questions or yeah, come back and talk about more.

Phil Ewels: So to wrap up if you’re interested in becoming more involved with Nextflow, writing your own pipelines or getting involved in the community, we’ve got kind of a smattering of links here. The top one with the Nextflow website of course has all the documentation. We have a website called training.nextflow.io which is all basically walk through tutorials and training which you can do yourself. We’ve just had a training week last week where we had over a thousand people registered for it just in that one week. And the there’s multiple different courses. The beginner one is called Hello Nextflow. I’ve done a set of video tutorials for each of those chapters. And so you can kind of follow through with me step by step as we work through all the worked examples. Which is basically the best way to to learn Nextflow. And that’s all up to date now with all the latest Nextflow syntax. We have a very active community forum. So if you ever need any help, you can drop in there and ask you a question and you can usually get a response very quickly. Another plug, I run a Nextflow podcast. So at the moment, I’m trying to do one every two weeks. We talk to all kinds of different people using Nextflow for different things or other kind of tangentially related technical topics. It tends to be very technical deep dives. So that’s kind of fun. We have a really good blog and I’ve written a community forum twice. Didn’t mean to do that. And then finally we have a bunch of events coming up. So in a few weeks time we’ve got the nf-core hackathon which is both online and then people self-organize different local sites all the way around the world. I think we’ve something like 20 or 30 local sites already from Argentina to the UK to the US to Germany all over the place. So very welcome to join. It’s a great way to get involved. And there’s all different projects so you can kind of dive in and help people with their pipelines. And then we’ve got the the two flagship summit events. One in Boston at the end of April and then we’ll have the main online one with some in person in Barcelona in October. And there’s loads of Seqera sessions and all kinds of other events if you click that link where it might well be something near you. I think there’s Seqera sessions coming up in London and a few other places soon.

Phil Ewels: With that hopefully I’m about on time and happy to sort of take any questions. I hope that was all clear and and made sense and was useful.

Grant Belgard: Thanks, Phil. Um, so what’s the easiest way for someone to get started with Nextflow?

Phil Ewels: So the training website I think is the best way to get started really it the the examples that we use with the Hello Nextflow training are kind of domain agnostic. We we use cowpy to print a little cow to the terminal saying different messages and things. So you don’t need to really know anything specifically about RNAseq or anything else. And you can do most of that course probably in an afternoon or a couple of afternoons. And that takes you from almost nothing all the way through to building your own pipeline complete with containers, Docker containers and everything. And it’s all set up to work on GitHub code spaces. So it doesn’t, yeah.

Grant Belgard: And for people who currently use Snakemake, how hard is it to migrate to Nextflow?

Phil Ewels: So yeah, so I mean I didn’t really talk about any of the other workflow managers, but Nextflow is not alone in this field. And what I generally say to anyone is that just using any workflow manager is better than not using any. And so Snakemake especially and Nextflow um and WDL and others they share many of the concepts about kind of splitting up different tools and running them sequentially and working out risk DAG. Because of that it’s not usually not too bad to convert from one to the other. Especially with AI these days like we have our own Seqera AI which is particularly good and well versed in the latest syntax of Nextflow. And so honestly with many pipelines these days you can just dump your Snakemake syntax in and say convert this to Nextflow for me and it will do a pretty good job almost from the first go. So I would that I’m kind of lazy and a bit of an AI advocate. So that’s what I would definitely do in that situation.

Grant Belgard: If someone has a pipeline that could be useful for nf-core, how do they go about adding it?

Phil Ewels: Yeah. So, nf-core is this kind of like I say, it’s kind of a unique community because we don’t just kind of list any pipeline. We, it’s specifically kind of community owned and and only one pipeline per data type. So, because of this, it’s not just a question of kind of clicking a couple of buttons. You have you have to come and forward and put in a proposal and basically then we say yes or no and then there’s a kind of a system for going through and building your pipeline and adding it to nf-core. The short answer is go to nf-core website and click on the docs and there’s a guide saying how to add your pipeline and then there’s an nf-core proposals website where you go and basically describe what it is you want to do and get a thumbs up.

Grant Belgard: Where do you see AI fitting into pipeline development in the next couple years?

Phil Ewels: A couple of years is difficult to say. I’m struggling to predict anything more than a month ahead at the time at the moment because things are changing so fast. But I mean nothing in tech is going to be the same and I’m sure that pipelines will be included in that. We’re starting to see it already like I say converting between languages. We have our Seqera AI tool and we’re trying to kind of take the rough edges off these tools and it certainly lowers the boundary. Nextflow is known for not being the easiest in terms of learning curve and AI makes it possible to get started so much easier. So right now I think the benefits are kind of a low hanging fruit is it’s just much easier to write to debug your Nextflow pipelines using AI. And as we go forward I’m expecting kind of more foundational changes with how how we just approach the whole concept of building up scientific analysis to be honest.

Grant Belgard: Is Nextflow overkill if someone’s just running a few samples on their laptop?

Phil Ewels: It depends a bit. So I mean it depends a bit on your background and how much Nextflow you’ve written. If you’ve never written Nextflow before then is it worth you learning the whole syntax and going through the whole process just so you can run a couple of samples? Maybe not. But once you have kind of got familiar with Nextflow I kind of think it’s a bit like wearing gloves when you’re pipetting in the lab. It’s difficult. You want to, you end up wanting to write Nextflow pipelines for everything because it is self-documenting. It’s automatically versioned. You can rerun it any time in the future. And you know when you try and remember what it was you did six months ago, you can just see the next pipeline and it’s there. So it ends up being quite a low lift. So then of course I’m a bit biased in this question, but I would say yes to everything in Nextflow pipelines. That’s what I find myself doing.

Grant Belgard: Mhm. Is Seqera containers free and how does it compare to biocontainers or Docker hub?

Phil Ewels: Yeah, so I didn’t touch on this so much but Seqera containers is something we do on the Seqera side. So what one of the tools we have containers are are key and fundamentals in Nextflow and the success of bioinformatics workflows that you can encapsulate the software in this kind of clean environment on a per process basis. So your versions of Python don’t conflict and this and that. And so you almost every Nextflow pipeline you will see now have these container declarations and you have you might have 50 different or 60 different steps in your pipeline and you need to come up with a Docker container for every single one and so the bioinformatics community has kind of responded to this usage of containers in a few different ways. The biocontainers project has been wildly successful and basically every conda package gets a Docker image for free and so we’ve been using biocontainers in nf-core for a long time. The limitations we found are when you want to have a process in your pipeline that has more than one tool then you have to – the whole process for generating one of those containers is quite convoluted. And so we have we built a tool at Seqera called wave which is also open source which basically builds Docker containers on the fly. So you build, you add this into your Nextflow pipeline and you say I want to run tool A and tool B in this process and it will go off and it will request it and if Wave has seen it before it will just give you the container straight away. And if not it will sit there and it will build it on the fly and then give it to you. Which is really cool because it means you basically don’t have to think about containers anymore. They just magically happen. So Seqera containers is based on this technology and it’s exactly the same thing but it’s just a public repository. So when you build your, you request your image you build it it then gets stored there for we say a minimum of 5 years and then anyone can just fetch it and download it. So we for example are now going to be using this in nf-core where every single one of those 1700 modules will have their own custom built, you have and docker and singularity you’ll have x86 you’ll have ARM CPU processing will all be built automatically on the fly and then pinned for a long time for perfect reproducibility and it’s just free and yeah it works really well.

Grant Belgard: When can one start using static types in Nextflow?

So the syntax example I showed with those params you can do that now. So that’s out as of, we do two Nextflow releases every year one major release in April and one in October. And so the 25/10 release came out with that syntax. So you can use it for parameters today. Basically, we are working on developing more syntax which will come out in the next major release, so 26/04, which will have basically strong typing through all of your pipeline code pretty much. And so that will really take that concept and kind of bring it through and then you’ll have a lot more validation because if you try and as you’re building as you’re connecting all your processes with all these different channels, excuse me, if you say that this you know this process has an output which goes into this it will tell you immediately like well you can’t do that because those are different types. So we’re going to have that very soon in a few months but already today you can do typing for just the input parameters for the pipeline.

Grant Belgard: And lastly how do you go about deciding if it’s worth updating an ancient DSL1 pipeline?

Phil Ewels: Yeah. So, so for for those who don’t know where DSL1, DSL2, this is like back when Nextflow started, it was this Groovy DSL and this term got bandied around a lot. And then around 2020, I think there was a major language update. We used to have these huge monolithic scripts of like thousands of lines of code and DSL2 changed a bunch of the syntax and one of the things it allowed us to do is break out different files, have these modules which we now, you know, like I say, rely on for this level of granularity and testing and and community. So, but that change from DSL1 to DSL 2 was was quite painful. It was quite hard work doing a lot of the rewrites which I should say we’re taking great pains to avoid with the new syntax updates we’re doing. We’re doing it much more gently and there’s also a lot of tooling to automatically update code. But so if you have an ancient pipeline in DSL1 and you want to sort of leap frog all this and bring it forward what like six years in terms of syntax it’s surprisingly common to have this question but like basically you have a couple of options probably the easiest is the same as converting from Snakemake you chuck it into an AI tool and say rewrite this pipeline for me or you start from scratch and you just kind of copy over the logic into the new syntax and you take the nf-core template or something. Or if you really want to and you’re a bit of a sadist, you can go through and try and update all the syntax line by line, which is doable. But, you’ll probably have to go DSL, you know, it’s like a software migration. You have to go DSL1 to DSL 2 and then DSL 2 to a new syntax. It’s doable.

Grant Belgard: Well, Phil, thank you so much for joining us. Thanks to all our listeners.

Phil Ewels: It’s a pleasure. Thanks very much for inviting me.

Facebook Tweet LinkedIn