Welcome to the Nextflow Training Workshop

We are excited to have you on the path to writing reproducible and scalable scientific workflows using Nextflow. This guide complements the full Nextflow documentation - if you ever have any doubts, head over to the docs located here.

By the end of this course you should:

  1. Be proficient in writing Nextflow pipelines

  2. Know the basic Nextflow concepts of Channels, Processes and Operators

  3. Have an understanding of containerised workflows

  4. Understand the different execution platforms supported by Nextflow

  5. Be introducted to the Nextflow community and ecosystem

Overview

To get you started with Nextflow as quickly as possible, we will walk through the following steps:

  1. Set up a development environment to run Nextflow.

  2. Explore Nexflow concepts using some basic workflows including a multistep RNA-Seq analysis.

  3. Build and use Docker containers to encaspusulate all workflow dependencies.

  4. Dive deeper into the core Nextflow syntax including Channels, Processes and Operators.

  5. Cover cluster and cloud deployment scenarios and explore Nextflow Tower capabilities.

This will give you a broad understanding of Nextflow, so that you can start writing your own pipelines. We hope you enjoy the course! This is ever-evolving document and feedback is always welcome.

Summary

1. Environment setup

1.1. Using a terminal

Nextflow can be used on any POSIX compatible system (Linux, OS X, Windows Subsystem for Linux etc.).

Requirements:

Optional requirements:

1.1.1. Training material

Download the training material by copy & pasting the following command in the terminal:

1
curl https://s3-eu-west-1.amazonaws.com/seqeralabs.com/public/nf-training.tar.gz | tar xvz

Or alternatively if you have wget instead of curl:

1
wget -q -O- https://s3-eu-west-1.amazonaws.com/seqeralabs.com/public/nf-training.tar.gz | tar xvz

1.1.2. Setup AWS Cloud9 environment

If you are running this tutorial in an AWS Cloud9 environment, run the following one-liner to setup the virtual environment:

1
source setup/all.sh

1.1.3. Nextflow installation

Install the latest version of Nextflow by copy & pasting the following snippet in a terminal window:

1
2
curl get.nextflow.io | bash
mv nextflow ~/bin

Check the correct installation running the following command:

1
nextflow info

2. Introduction

2.1. Basic concepts

Nextflow is a workflow orchestration engine and a domain specific language (DSL) that makes it easy to write data-intensive computational pipelines.

It is designed around the idea that the Linux platform is the lingua franca of data science. Linux provides many simple but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations.

Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel computational environment based on the dataflow programming model. Nextflow’s core features are:

  • Workflow portability & reproducibility

  • Scalability of parallelization and deployment

  • Integration of existing tools, systems & industry standards

2.1.1. Processes and Channels

In practice, a Nextflow pipeline is made by joining together different processes. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).

Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state. The only way they can communicate is via asynchronous first-in, first-out (FIFO) queues, called channels in Nextflow.

Any process can define one or more channels as input and output. The interaction between these processes, and ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.

channel process

2.1.2. Execution abstraction

While a process defines what command or script has to be executed, the executor determines how that script is actually run in the target platform.

If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline development and testing purposes, but for real world computational pipelines, an HPC or cloud platform is often required.

In other words, Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution system (or runtime). Thus it is possible to write a pipeline once and to seamlessly run it on your computer, a cluster, or the cloud, without modifying it, by simply defining the target execution platform in the configuration file.

execution abstraction

2.1.3. Scripting language

Nextflow implements a declarative domain specific language (DSL) that simplifies the writing of complex data analysis workflows as an extension of a general purpose programming language.

This approach makes Nextflow very flexible, because it allows in the same computing environment the benefits of a concise DSL allowing the handling of recurrent use cases with ease and the flexibility and power of a general purpose programming language to handle corner cases, which may be difficult to implement using a purely declarative approach.

In practical terms, Nextflow scripting is an extension of the Groovy programming language, which in turn is a super-set of the Java programming language. Groovy can be considered as Python for Java in that it simplifies the writing of code and is more approachable.

2.2. Your first script

Here you will execute your first Nextflow script (hello.nf), which we will go through line by line.

2.2.1. Nextflow code (hello.nf)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env nextflow

params.greeting  = 'Hello world!'
greeting_ch = Channel.from(params.greeting)

process splitLetters {

    input:
    val x from greeting_ch

    output:
    file 'chunk_*' into letters_ch

    """
    printf '$x' | split -b 6 - chunk_
    """
}

process convertToUpper {

    input:
    file y from letters_ch.flatten()

    output:
    stdout into result_ch

    """
    cat $y | tr '[a-z]' '[A-Z]'
    """
}

result_ch.view{ it }

2.2.2. Line-by-line description

  • 1: The code begins with a shebang which declares Nextflow as the interepreter.

  • 3: Declares a parameter greeting that is initialized with the value 'Hello world!'.

  • 4: Initalises a channel labelled greeting_ch, which contains the value from params.greeting.

  • 6: Begins the first process defined as splitLetters.

  • 8: Input declaration for the splitLetters process.

  • 9: Tells the process to expect an input value (val) from the channel greeting_ch, that we assign to the variable 'x'.

  • 11: Output declaration for the splitLetters process.

  • 12: Tells the process to expect output file/s (file) containing 'chunk_*' as output from the script and send the files to the channel letters_ch.

  • 14: Triple quote marks intiate the code block to execute in this process.

  • 15: Code to execute, printing the input value x (called using the dollar [$] symbol prefix), and splitting the string into chunks of 6 characters long ("Hello " and "world!") and saving to files: chunk_aa and chunk_ab.

  • 16: Triple quote marks the end of the code block.

  • 17: End of first process block.

  • 19: Begin second process defined as convertToUpper.

  • 21: Input declaration for the convertToUpper process.

  • 22: Tells the process to expect input file/s (file; e.g. chunk_aa and chunk_ab) from the letter_ch, that we assign to the variable 'y'.

The use of the operator .flatten() here is to split the two files into two separate items to be put through the next process (else they would treat them as a single element).
  • 24: Output declaration for the convertToUpper process.

  • 25: Tells the process to expect output as standard output (stdout) and direct this into the result_ch channel.

  • 27: Triple quote marks intiate the code block to execute in this process.

  • 28: Script to read files (cat) using the '$y' input variable, then pipe to uppercase conversion, outputting to standard output.

  • 29: Triple quote marks the end of the code block.

  • 30: End of first process block.

  • 32: The final output (in the result_ch) is printed to screen using the view operator (appended onto the channel name).

2.2.3. In practise

Please now copy the following example into your favourite text editor and save it to a file named hello.nf.

Execute the script by entering the following command in your terminal:

nextflow run hello.nf

The output will look similar to the text shown below:

1
2
3
4
5
6
7
N E X T F L O W  ~  version 21.04.3
Launching `hello.nf` [confident_kowalevski] - revision: a3f5a9e46a
executor >  local (3)
[0d/59d203] process > splitLetters (1)   [100%] 1 of 1 ✔
[9f/1dd42a] process > convertToUpper (2) [100%] 2 of 2 ✔
HELLO
WORLD!

Where the standard output shows (line by line):

  • 1: The Nextflow version executed.

  • 2: The script and version names.

  • 3: The executor used (in the above case: local).

  • 4: The first process executed once (1). Starting with a unique hexadecimal (see TIP below) and ending with percent and job complete information.

  • 5: The second process` executed twice (2).

  • 6-7: Followed by the printed result string from stdout.

The hexadecimal numbers, like 0d/59d203, identify the unique process execution. These numbers are also the prefix of the directories where each process is executed. You can inspect the files produced by changing to the directory $PWD/work and using these numbers to find the process-specific execution path.
The second process runs twice, executing in two different work directories for each input file. Therefore, in the previous example the work directory [9f/1dd42a] represents just one of the two directories that were processed. To print all the relevent paths to screen, use the -ansi-log flag (e.g. nextflow run hello.nf -ansi-log false).

It’s worth noting that the process convertToUpper is executed in parallel, so there’s no guarantee that the instance processing the first split (the chunk 'Hello ') will be executed before the one processing the second split (the chunk 'world!').

Thus, it is perfectly possible that you will get the final result printed out in a different order:

WORLD!
HELLO

2.3. Modify and resume

Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are actually changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result used instead.

This helps when testing or modifying part of your pipeline without having to re-execute it from scratch.

For the sake of this tutorial, modify the convertToUpper process in the previous example, replacing the process script with the string rev $y, so that the process looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
process convertToUpper {

    input:
    file y from letters_ch.flatten()

    output:
    stdout into result_ch

    """
    rev $y
    """
}

Then save the file with the same name, and execute it by adding the -resume option to the command line:

nextflow run hello.nf -resume

It will print output similar to this:

N E X T F L O W  ~  version 21.04.3
Launching `hello.nf` [admiring_venter] - revision: aed50861e0
executor >  local (2)
[74/d48321] process > splitLetters (1)   [100%] 1 of 1, cached: 1 ✔
[59/136e00] process > convertToUpper (1) [100%] 2 of 2 ✔
!dlrow
 olleH

You will see that the execution of the process splitLetters is actually skipped (the process ID is the same), and its results are retrieved from the cache. The second process is executed as expected, printing the reversed strings.

The pipeline results are cached by default in the directory $PWD/work. Depending on your script, this folder can take of lot of disk space. If you are sure you won’t resume your pipeline execution, clean this folder periodically.

2.4. Pipeline parameters

Pipeline parameters are simply declared by prepending to a variable name the prefix params, separated by a dot character. Their value can be specified on the command line by prefixing the parameter name with a double dash character, i.e. --paramName

Now, let’s try to execute the previous example specifying a different input string parameter, as shown below:

nextflow run hello.nf --greeting 'Bonjour le monde!'

The string specified on the command line will override the default value of the parameter. The output will look like this:

N E X T F L O W  ~  version 21.10.0
Launching `hello.nf` [angry_mcclintock] - revision: 073d8111fc
executor >  local (4)
[6c/e6edf5] process > splitLetters (1)   [100%] 1 of 1 ✔
[bc/6845ce] process > convertToUpper (2) [100%] 3 of 3 ✔
uojnoB

!edno

m el r

2.4.1. In DAG-like format

To better how Nextflow is dealing with the data in this pipeline, we share below a DAG-like figure to visual all the inputs, outputs, channels and processes.

Check this out, by clicking here:
helloworlddiagram

3. Simple RNA-Seq pipeline

To demonstrate a real-world biomedical scenario, we will implement a proof of concept RNA-Seq pipeline which:

  1. Indexes a transcriptome file.

  2. Performs quality controls.

  3. Performs quantification.

  4. Creates a MultiQC report.

This will be done using a series of seven scripts, each of which builds on the previous, in order to have a complete worflow. You can find these in the tutorial folder (script1.nf to script7.nf).

3.1. Define the pipeline parameters

Parameters are inputs and options that can be changed when the pipeline is run.

The script script1.nf defines the pipeline input parameters.

1
2
3
4
5
params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"

println "reads: $params.reads"

Run it by using the following command:

nextflow run script1.nf

Try to specify a different input parameter, for example:

nextflow run script1.nf --reads this/or/that

Exercise

Modify the script1.nf adding a fourth parameter named outdir and set it to a default path that will be used as the pipeline output directory.

Click here for the answer:
1
2
3
4
params.reads = "$baseDir/data/ggal/*_{1,2}.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
params.multiqc = "$baseDir/multiqc"
params.outdir = "$baseDir/myresults"

Exercise

Modify the script1.nf to print all the pipeline parameters by using a single log.info command and a multiline string statement.

See an example here.
Click here for the answer:

Add the following to your script file:

1
2
3
4
5
6
7
8
log.info """\
         R N A S E Q - N F   P I P E L I N E
         ===================================
         transcriptome: ${params.transcriptome_file}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

Recap

In this step you have learned:

  1. How to define parameters in your pipeline script

  2. How to pass parameters by using the command line

  3. The use of $var and ${var} variable placeholders

  4. How to use multiline strings

  5. How to use log.info to print information and save it in the log execution file

3.2. Create a transcriptome index file

Nextflow allows the execution of any command or script by using a process definition.

A process is defined by providing three main declarations: the process inputs, outputs and command script.

To add a transcriptome index process, try adding the following code block to your script (or continue from script2.nf).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
 * define the `index` process that creates a binary index
 * given the transcriptome file
 */
process index {

    input:
    path transcriptome from params.transcriptome_file

    output:
    path 'index' into index_ch

    script:
    """
    salmon index --threads $task.cpus -t $transcriptome -i index
    """
}

This process takes the transcriptome params file as input and creates the transcriptome index by using the salmon tool. The output is specified as a path to a output file called index, and this is passed to the index_ch channel.

the input declaration defines a transcriptome path variable which is used in the script as a reference (using the dollar sign) in the Salmon command line.
Resource requirements such as CPU and memory can change with different workflow executions and platforms. Nextflow can use $task.cpus as a variable for the number of CPUs. See process directives documentation for more details.

Run it by using the command:

nextflow run script2.nf

The execution will fail because salmon is not installed in your environment.

Add the command line option -with-docker to launch the execution through a Docker container as shown below:

nextflow run script2.nf -with-docker

This time it works because it uses the Docker container nextflow/rnaseq-nf defined in the nextflow.config file in your current directory. If you are running this script locally then you need to download docker to your machine then login to activate docker and allow the script to download the container containing the run scripts - learn more about docker later: here.

In order to avoid adding -with-docker each time, add the following line in the nextflow.config file:

docker.enabled = true

Exercise

Enable the Docker execution by default adding the above setting in the nextflow.config file.

Exercise

Print the output of the index_ch channel by using the view operator.

Click here for the answer:

Add the following to your script file, at the end:

1
index_ch.view()

Exercise

If you have more cpu available, try changing your script to request more resources for this process. For an example, see the directive docs. $task.cpus is already specified in this script, so setting the cpu as the top as a directive will tell Nextflow to run this job with the specified number of cpus.

Click here for the answer:

Add cpus 2 to the top of the index process:

1
2
3
4
process index {
    cpus 2
    input:
    ...

Then check it worked, by looking at the script executed in the work directory. Look for the hexidecimal (e.g. work/7f/f285b80022d9f61e82cd7f90436aa4/), Then cat the .command.sh.

Bonus Exercise

Use the command tree work to see how Nextflow organizes the process work directory. Check here, if you need to download tree.

Click here for the answer:

It should look something like this:

work
├── 17
│   └── 263d3517b457de4525513ae5e34ea8
│       ├── index
│       │   ├── complete_ref_lens.bin
│       │   ├── ctable.bin
│       │   ├── ctg_offsets.bin
│       │   ├── duplicate_clusters.tsv
│       │   ├── eqtable.bin
│       │   ├── info.json
│       │   ├── mphf.bin
│       │   ├── pos.bin
│       │   ├── pre_indexing.log
│       │   ├── rank.bin
│       │   ├── refAccumLengths.bin
│       │   ├── ref_indexing.log
│       │   ├── reflengths.bin
│       │   ├── refseq.bin
│       │   ├── seq.bin
│       │   └── versionInfo.json
│       └── transcriptome.fa -> /workspace/Gitpod_test/data/ggal/transcriptome.fa
├── 7f

Recap

In this step you have learned:

  1. How to define a process executing a custom command

  2. How process inputs are declared

  3. How process outputs are declared

  4. How to print the content of a channel

  5. How to access the number of available CPUs

3.3. Collect read files by pairs

This step shows how to match read files into pairs, so they can be mapped by Salmon.

Edit the script script3.nf and add the following statement as the last line:

read_pairs_ch.view()

Save it and execute it with the following command:

nextflow run script3.nf

It will print something similar to this:

[ggal_gut, [/.../data/ggal/gut_1.fq, /.../data/ggal/gut_2.fq]]

The above example shows how the read_pairs_ch channel emits tuples composed by two elements, where the first is the read pair prefix and the second is a list representing the actual files.

Try it again specifying different read files by using a glob pattern:

nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'
File paths including one or more wildcards ie. *, ?, etc. MUST be wrapped in single-quoted characters to avoid Bash expanding the glob.

Exercise

Use the set operator in place of = assignment to define the read_pairs_ch channel.

Click here for the answer:
1
2
3
Channel
    .fromFilePairs( params.reads )
    .set { read_pairs_ch }

Exercise

Use the checkIfExists option for the fromFilePairs method to check if the specified path contains at least file pairs.

Click here for the answer:
1
2
3
Channel
    .fromFilePairs( params.reads, checkIfExists: true )
    .set { read_pairs_ch }

Recap

In this step you have learned:

  1. How to use fromFilePairs to handle read pair files

  2. How to use the checkIfExists option to check input file existence

  3. How to use the set operator to define a new channel variable

3.4. Perform expression quantification

script4.nf adds the quantification process.

In this script, note how the index_ch channel, declared as output in the index process, is now used as a channel in the input section.

Also, note how the second input is declared as a tuple composed by two elements: the pair_id and the reads in order to match the structure of the items emitted by the read_pairs_ch channel.

Execute it by using the following command:

nextflow run script4.nf -resume

You will see the execution of the quantification process.

When using the -resume option, any step that has already been processed is skipped.

Try to execute the same script with more read files as shown below:

nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'

You will notice that the quantification process is executed more than one time.

Nextflow parallelizes the execution of your pipeline simply by providing multiple input data to your script.

It may be useful to apply optional settings to a specific process using directives which are provided in the process body.

Exercise

Add a tag directive to the quantification process to provide a more readable execution log.

Click here for the answer:

Add the following before the input declaration:

  tag "Salmon on $pair_id"

Exercise

Add a publishDir directive to the quantification process to store the process results into a directory of your choice.

Click here for the answer:

Add the following before the input declaration in the quantification process:

  publishDir params.outdir, mode:'copy'

Recap

In this step you have learned:

  1. How to connect two processes by using the channel declarations

  2. How to resume the script execution skipping cached steps

  3. How to use the tag directive to provide a more readable execution output

  4. How to use the publishDir directive to store a process results in a path of your choice

3.5. Quality control

This step implements a quality control of your input reads. The inputs are the same read pairs which are provided to the quantification steps.

You can run it by using the following command:

nextflow run script5.nf -resume

The script will report the following error message:

Channel `read_pairs_ch` has been used twice as an input by process `fastqc` and process `quantification`

Exercise

Modify the creation of the read_pairs_ch channel by using an into operator in place of a set.

see an example here.
Click here for the answer:

Add the following before the input declaration in the quantification process:

  Channel
    .fromFilePairs( params.reads, checkIfExists: true )
    .into { read_pairs_ch; read_pairs2_ch }

Then ensure that the second process uses the "read_pairs2_ch". See, script6.nf.

Recap

In this step you have learned:

  1. How to use the into operator to create multiple copies of the same channel.

In Nextflow DSL2, it is no longer a requirement to duplicate channels.

3.6. MultiQC report

This step collects the outputs from the quantification and fastqc steps to create a final report using the MultiQC tool.

Execute the next script with the following command:

nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'

It creates the final report in the results folder in the current work directory.

In this script, note the use of the mix and collect operators chained together to get all the outputs of the quantification and fastqc processes as a single input. Operators can be used to combine and transform channels.

We only want one task of MultiQC being executed which produces one report. Therefore, we use the mix channel operator to combine the two channels followed by the collect operator, to return the complete channel contents as a single element.

Recap

In this step you have learned:

  1. How to collect many outputs to a single input with the collect operator

  2. How to mix two channels in a single channel

  3. How to chain two or more operators togethers

3.7. Handle completion event

This step shows how to execute an action when the pipeline completes the execution.

Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another as they are written in the pipeline script as it would happen in a common imperative programming language.

The script uses the workflow.onComplete event handler to print a confirmation message when the script completes.

Try to run it by using the following command:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

3.8. Bonus!

Send a notification email when the workflow execution complete using the -N <email address> command line option. Note: this requires the configuration of a SMTP server in nextflow config file. For the sake of this tutorial add the following setting in your nextflow.config file:

1
2
3
4
5
6
7
8
9
10
mail {
  from = 'info@nextflow.io'
  smtp.host = 'email-smtp.eu-west-1.amazonaws.com'
  smtp.port = 587
  smtp.user = "xxxxx"
  smtp.password = "yyyyy"
  smtp.auth = true
  smtp.starttls.enable = true
  smtp.starttls.required = true
}

Then execute again the previous example specifying your email address:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq' -c mail.config -N <your email>

See mail documentation for details.

3.9. Custom scripts

Real world pipelines use a lot of custom user scripts (BASH, R, Python, etc). Nextflow allows you to use and manage all these scripts in a consistent manner. Simply put them in a directory named bin in the pipeline project root. They will be automatically added to the pipeline execution PATH.

For example, create a file named fastqc.sh with the following content:

1
2
3
4
5
6
7
8
9
#!/bin/bash
set -e
set -u

sample_id=${1}
reads=${2}

mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}

Save it, give execute permission and move it in the bin directory as shown below:

1
2
3
chmod +x fastqc.sh
mkdir -p bin
mv fastqc.sh bin

Then, open the script7.nf file and replace the fastqc process' script with the following code:

1
2
3
4
  script:
  """
  fastqc.sh "$sample_id" "$reads"
  """

Run it as before:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

Recap

In this step you have learned:

  1. How to write or use existing custom scripts in your Nextflow pipeline.

  2. How to avoid the use of absolute paths by having your scripts in the bin/ folder.

3.10. Metrics and reports

Nextflow is able to produce multiple reports and charts providing several runtime metrics and execution information.

Run the rnaseq-nf pipeline previously introduced as shown below:

nextflow run rnaseq-nf -with-docker -with-report -with-trace -with-timeline -with-dag dag.png

The -with-docker option launches each task of the execution as a Docker container run command.

The -with-report option enables the creation of the workflow execution report. Open the file report.html with a browser to see the report created with the above command.

The -with-trace option enables the create of a tab separated file containing runtime information for each executed task. Check the content of the file trace.txt for an example.

The -with-timeline option enables the creation of the workflow timeline report showing how processes where executed along time. This may be useful to identify most time consuming tasks and bottlenecks. See an example at this link.

Finally the -with-dag option enables to rendering of the workflow execution direct acyclic graph representation. Note: this feature requires the installation of Graphviz in your computer. See here for details. Then try running :

dot -Tpng dag.dot > graph.png
open graph.png

Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.

You view the HTML files right-clicking on the file name in the left side-bar and choosing the Preview menu item.

3.11. Run a project from GitHub

Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services eg. BitBucket and GitLab).

This simplifies the sharing and deployment of complex projects and tracking changes in a consistent manner.

The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:

You can run it by specifying the project name as shown below, launching each task of the execution as a Docker container run command:

nextflow run nextflow-io/rnaseq-nf -with-docker

It automatically downloads the container and stores it in the $HOME/.nextflow folder.

Use the command info to show the project information, e.g.:

nextflow info nextflow-io/rnaseq-nf

Nextflow allows the execution of a specific revision of your project by using the -r command line option. For example:

nextflow run nextflow-io/rnaseq-nf -r dev

Revision are defined by using Git tags or branches defined in the project repository.

This allows a precise control of the changes in your project files and dependencies over time.

3.12. More resources

4. Manage dependencies & containers

Computational workflows are rarely composed of a single script or tool.  Most of the times they require the usage of dozens of different software components or libraries.

Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in scientific applications.

To overcome these issues, we use containers, which allow the encapsulation of software dependencies, i.e. tools and libraries required by a data analysis application in one or more self-contained, ready-to-run, immutable Linux container images, that can be easily deployed in any platform supporting the container runtime.

Containers can be executed in an isolated manner from the hosting system. Having its own copy of the file system, processing space, memory management, etc.

Containers were first introduced with kernel 2.6 as a Linux feature known as Control Groups or Cgroups.

4.1. Docker

Docker is a handy management tool to build, run and share container images.

These images can be uploaded and published in a centralised repository know as Docker Hub, or hosted by other parties like Quay.

4.1.1. Run a container

A container can be run using the following command:

docker run <container-name>

Try for example the following publically available container (if you have docker installed):

docker run hello-world

4.1.2. Pull a container

The pull command allows you to download a Docker image without running it. For example:

docker pull debian:stretch-slim

The above command download a Debian Linux image. You can check it exists by using:

docker images

4.1.3. Run a container in interactive mode

Launching a BASH shell in the container allows you to operate in an interactive mode in the containerised operating system. For example:

docker run -it debian:stretch-slim bash

Once launched, you will notice that it is running as root (!). Use the usual commands to navigate in the file system. This is useful to check if the expected programs are present within a container.

To exit from the container, stop the BASH session with the exit command.

4.1.4. Your first Dockerfile

Docker images are created by using a so-called Dockerfile, which is a simple text file containing a list of commands to assemble and configure the image with the software packages required.

In this step you will create a Docker image containing the Salmon tool.

the Docker build process automatically copies all files that are located in the current directory to the Docker daemon in order to create the image. This can take a lot of time when big/many files exists. For this reason it’s important to always work in a directory containing only the files you really need to include in your Docker image. Alternatively you can use the .dockerignore file to select the path to exclude from the build.

Then use your favourite editor (e.g. vim or nano) to create a file named Dockerfile and copy the following content:

1
2
3
4
5
6
7
FROM debian:stretch-slim

MAINTAINER <your name>

RUN apt-get update && apt-get install -y curl cowsay

ENV PATH=$PATH:/usr/games/

4.1.5. Build the image

Build the Dockerfile image by using the following command:

docker build -t my-image .

Where "my-image" is the user specified name tag for the Dockerfile, present in the current directory.

Don’t miss the dot in the above command. When it completes, verify that the image has been created listing all available images:
docker images

You can try your new container by running this command:

docker run my-image cowsay Hello Docker!

4.1.6. Add a software package to the image

Add the Salmon package to the Docker image by adding to the Dockerfile the following snippet:

1
2
3
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz | tar xz \
 && mv /salmon-*/bin/* /usr/bin/ \
 && mv /salmon-*/lib/* /usr/lib/

Save the file and build the image again with the same command as before:

docker build -t my-image .

You will notice that it creates a new Docker image with the same name but with a different image ID.

4.1.7. Run Salmon in the container

Check that Salmon is running correctly in the container as shown below:

docker run my-image salmon --version

You can even launch a container in an interactive mode by using the following command:

docker run -it my-image bash

Use the exit command to terminate the interactive session.

4.1.8. File system mounts

Create a genome index file by running Salmon in the container.

Try to run Salmon in the container with the following command:

docker run my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

The above command fails because Salmon cannot access the input file.

This happens because the container runs in a completely separate file system and it cannot access the hosting file system by default.

You will need to use the --volume command line option to mount the input file(s) e.g.

docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
  salmon index -t /transcriptome.fa -i transcript-index
the generated transcript-index directory is still not accessible in the host file system.
An easier way is to mount a parent directory to an identical one in the container, this allows you to use the same path when running it in the container e.g.
docker run --volume $HOME:$HOME --workdir $PWD my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

Check the content of the transcript-index folder entering the command:

ls -la transcript-index
Note that the permissions for files created by the Docker execution is root.

Exercise

Use the option -u $(id -u):$(id -g) to allow Docker to create files with the right permission.

4.1.9. Upload the container in the Docker Hub (bonus)

Publish your container in the Docker Hub to share it with other people.

Create an account in the hub.docker.com web site. Then from your shell terminal run the following command, entering the user name and password you specified registering in the Hub:

docker login

Tag the image with your Docker user name account:

docker tag my-image <user-name>/my-image

Finally push it to the Docker Hub:

docker push <user-name>/my-image

After that anyone will be able to download it by using the command:

docker pull <user-name>/my-image

Note how after a pull and push operation, Docker prints the container digest number e.g.

Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest

This is a unique and immutable identifier that can be used to reference a container image in a univocally manner. For example:

docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266

4.1.10. Run a Nextflow script using a Docker container

The simplest way to run a Nextflow script with a Docker image is using the -with-docker command line option:

nextflow run script2.nf -with-docker my-image

We’ll see later how to configure in the Nextflow config file which container to use instead of having to specify every time as a command line argument.

4.2. Singularity

Singularity is container runtime designed to work in HPC data center, where the usage of Docker is generally not allowed due to security constraints.

Singularity implements a container execution model similar to Docker, however it uses a complete different implementation design.

A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by many computing nodes managed by a batch scheduler.

4.2.1. Create a Singularity images

Singularity images are created using a Singularity file in similar manner to Docker, though using a different syntax.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Bootstrap: docker
From: debian:stretch-slim

%environment
export PATH=$PATH:/usr/games/

%labels
AUTHOR <your name>

%post

apt-get update && apt-get install -y locales-all curl cowsay
curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz \
 && mv /salmon-*/bin/* /usr/bin/ \
 && mv /salmon-*/lib/* /usr/lib/

Once you have save the Singularity file. Create the image with these commands:

sudo singularity build my-image.sif Singularity

Note: the build command requires sudo permissions. A common workaround consists to build the image on a local workstation and then deploy in the cluster just copying the image file.

4.2.2. Running a container

Once done, you can run your container with the following command

singularity exec my-image.sif cowsay 'Hello Singularity'

By using the shell command you can enter in the container in interactive mode. For example:

singularity shell my-image.sif

Once in the container instance run the following commands:

touch hello.txt
ls -la
Note how the files on the host environment are shown. Singularity automatically mounts the host $HOME directory and uses the current work directory.

4.2.3. Import a Docker image

An easier way to create Singularity container without requiring sudo permission and boosting the containers interoperability is to import a Docker container image pulling it directly from a Docker registry. For example:

singularity pull docker://debian:stretch-slim

The above command automatically downloads the Debian Docker image and converts it to a Singularity image store in the current directory with the name debian-jessie.simg.

4.2.4. Run a Nextflow script using a Singularity container

Nextflow allows the transparent usage of Singularity containers as easy as with Docker ones.

It only requires to enable the use of Singularity engine in place of Docker in the Nextflow configuration file using the -with-singularity command line option:

nextflow run script7.nf -with-singularity nextflow/rnaseq-nf

As before the Singularity container can also be provided in the Nextflow config file. We’ll see later how to do it.

4.2.5. The Singularity Container Library

The authors of Singularity, SyLabs have their own repository of Singularity containers.

In the same way that we can push docker images to Docker Hub, we can upload Singularity images to the Singularity Library.

4.3. Conda/Bioconda packages

Conda is a popular package and environment manager. The built-in support for Conda allows Nextflow pipelines to automatically create and activate the Conda environment(s), given the dependencies specified by each process.

4.3.1. Using conda

A Conda environment is defined using a YAML file, which lists the required software packages. For example:

name: nf-tutorial
channels:
  - defaults
  - bioconda
  - conda-forge
dependencies:
  - salmon=1.0.0
  - fastqc=0.11.5
  - multiqc=1.5
  - tbb=2020.2

Given the recipe file, the environment is created using the command shown below:

conda env create --file env.yml

You can check the environment was created successfully with the command shown below:

conda env list

This should look something like this:

# conda environments:
#
base                     /opt/conda
nf-tutorial           *  /opt/conda/envs/nf-tutorial

To enable the environment you can use the activate command:

conda activate nf-tutorial

Nextflow is able to manage the activation of a Conda environment when the its directory is specified using the -with-conda option (using the same path shown in the list function. For example:

nextflow run script7.nf -with-conda /opt/conda/envs/nf-tutorial
When specifying as Conda environment a YAML recipe file, Nextflow automatically downloads the required dependencies, build the environment and automatically activate it.

This makes easier to manage different environments for the processes in the workflow script.

See the docs for details.

4.3.2. Create and use conda-like environments using micromamba

Another way to build conda-like environments is through a Dockerfile and micromamba.

micromamba is fast and robust package for building small conda-based environments.

This saves having to build a conda environment each time you want to use it (as outlined in previous sections).

To do this, you simply require a Dockerfile, such as the one in this public repo:

FROM mambaorg/micromamba
MAINTAINER Your name <your_email>

RUN \
   micromamba install -y -n base -c defaults -c bioconda -c conda-forge \
      salmon=1.5.1 \
      fastqc=0.11.9 \
      multiqc=1.10.1 \
   && micromamba clean -a -y

The above Dockerfile takes the parent image 'mambaorg/micromamba', then installs a conda environment using micromamba, and installs salmon, fastqc and multiqc.

Exercise

Try executing the RNA-Seq pipeline from earlier (script7.nf), but start by building your own micromamba Dockerfile (from above), save to your docker hub repo and direct Nextflow to run from this container (changing your nexflow.config).

Building a Docker container and pushing to your personal repo can take >10 minutes.
For an overview of steps to take, click here:
  1. Make a file called Dockerfile in the current directory (with above code).

  2. Build the image: docker build -t my-image . (don’t forget the '.').

  3. Publish the docker image to your online docker account.

    Something similar to the following, with <myrepo> replaced with your own Docker ID, without '<' and '>' characters!.

    "my-image" could be any name you choose. Just choose something memorable and ensure the name matches the name you used in the previous command.
    docker login
    docker tag my-image <myrepo>/my-image
    docker push <myrepo>/my-image
  4. Add the image file name to the nextflow.config file.

    e.g. remove the following from the nextflow.config:

    process.container = 'nextflow/rnaseq-nf'

    and replace with:

    process.container = '<myrepo>/my-image'

    replacing the above the correct docker hub link.

  5. Trying running Nextflow, e.g.:

    nextflow run script7.nf -with-docker

It should now work, and find salmon to run the process.

4.4. BioContainers

Another useful resource linking together Bioconda and containers is the BioContainers project. BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.

5. Channels

Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.

They are used to logically connect tasks to each other or to implement functional style data transformations.

channel files

5.1. Channel types

Nextflow distinguishes two different kinds of channels: queue channels and value channels.

5.1.1. Queue channel

A queue channel is a asynchronous unidirectional FIFO queue which connects two processes or operators.

  • asynchronous means that operations are non-blocking.

  • unidirectional means that data flow from a producer to a consumer.

  • FIFO means that the data is guaranteed to be delivered in the same order as it is produced. First In, First Out.

A queue channel is implicitly created by process output definitions or using channel factories such as Channel.from or Channel.fromPath.

Try the following snippets:

1
2
3
ch = Channel.from(1,2,3)
println(ch)     (1)
ch.view()       (2)
1 Use the built-in print line function println to print the ch channel.
2 Apply the view method to the ch channel, therefore prints each item emitted by the channels.

Exercise

Try to execute this snippet, it will produce an error message.

1
2
3
ch = Channel.from(1,2,3)
ch.view()
ch.view()
A queue channel can have one and exactly one producer and one and exactly one consumer.

5.1.2. Value channels

A value channel (a.k.a. singleton channel) by definition is bound to a single value and it can be read unlimited times without consuming its contents.

1
2
3
4
ch = Channel.value('Hello')
ch.view()
ch.view()
ch.view()

It prints:

Hello
Hello
Hello

A value channel is created using the value factory method or by operators returning a single value, such us first, last, collect, count, min, max, reduce, sum.

5.2. Channel factories

These are Nextflow commands for creating channels that have implicit expected inputs and functions.

5.2.1. value

The value factory method is used to create a value channel. An optional not null argument can be specified to bind the channel to a specific value. For example:

1
2
3
ch1 = Channel.value()                 (1)
ch2 = Channel.value( 'Hello there' )  (2)
ch2 = Channel.value( [1,2,3,4,5] )    (3)
1 Creates an empty value channel.
2 Creates a value channel and binds a string to it.
3 Creates a value channel and binds a list object to it that will be emitted as a sole emission.

5.2.2. from

The factory Channel.from allows the creation of a queue channel with the values specified as argument.

1
2
ch = Channel.from( 1, 3, 5, 7 )
ch.view{ "value: $it" }

The first line in this example creates a variable ch which holds a channel object. This channel emits the values specified as a parameter in the from method. Thus the second line will print the following:

value: 1
value: 3
value: 5
value: 7
Method Channel.from will be deprecated and replaced by Channel.of (see below).

5.2.3. of

The method Channel.of works in a similar manner to Channel.from, though it fixes some inconsistent behavior of the latter and provides a better handling for range of values. For example:

1
2
3
Channel
    .of(1..23, 'X', 'Y')
    .view()

5.2.4. fromList

The method Channel.fromList creates a channel emitting the elements provided by a list objects specified as argument:

1
2
3
4
5
list = ['hello', 'world']

Channel
    .fromList(list)
    .view()

5.2.5. fromPath

The fromPath factory method create a queue channel emitting one or more files matching the specified glob pattern.

1
Channel.fromPath( './data/meta/*.csv' )

This example creates a channel and emits as many items as there are files with csv extension in the /data/meta folder. Each element is a file object implementing the Path interface.

Two asterisks, i.e. **, works like * but crosses directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.
Table 1. Available options
Name Description

glob

When true interprets characters *, ?, [] and {} as glob wildcards, otherwise handles them as normal characters (default: true)

type

Type of paths returned, either file, dir or any (default: file)

hidden

When true includes hidden files in the resulting paths (default: false)

maxDepth

Maximum number of directory levels to visit (default: no limit)

followLinks

When true it follows symbolic links during directories tree traversal, otherwise they are managed as files (default: true)

relative

When true returned paths are relative to the top-most common directory (default: false)

checkIfExists

When true throws an exception of the specified path do not exist in the file system (default: false)

Learn more about the glob patterns syntax at this link.

Exercise

Use the Channel.fromPath method to create a channel emitting all files with the suffix .fq in the data/ggal/ directory and any subdirectory, in addition to hidden files. Then print the file names.

Click here for the answer:
Channel.fromPath( './data/ggal/**.fq' , hidden:true)
       .view()

5.2.6. fromFilePairs

The fromFilePairs method creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).

1
2
3
Channel
    .fromFilePairs('./data/ggal/*_{1,2}.fq')
    .view()

It will produce an output similar to the following:

[liver, [/user/nf-training/data/ggal/liver_1.fq, /user/nf-training/data/ggal/liver_2.fq]]
[gut, [/user/nf-training/data/ggal/gut_1.fq, /user/nf-training/data/ggal/gut_2.fq]]
[lung, [/user/nf-training/data/ggal/lung_1.fq, /user/nf-training/data/ggal/lung_2.fq]]
The glob pattern must contain at least a star wildcard character.
Table 2. Available options
Name Description

type

Type of paths returned, either file, dir or any (default: file)

hidden

When true includes hidden files in the resulting paths (default: false)

maxDepth

Maximum number of directory levels to visit (default: no limit)

followLinks

When true it follows symbolic links during directories tree traversal, otherwise they are managed as files (default: true)

size

Defines the number of files each emitted item is expected to hold (default: 2). Set to -1 for any.

flat

When true the matching files are produced as sole elements in the emitted tuples (default: false).

checkIfExists

When true, it throws an exception of the specified path that does not exist in the file system (default: false)

Exercise

Use the fromFilePairs method to create a channel emitting all pairs of fastq read in the data/ggal/ directory and print them. Then use the flat:true option and compare the output with the previous execution.

Click here for the answer:

Use the following, with or without 'flat:true':

Channel.fromFilePairs( './data/ggal/*_{1,2}.fq', flat:true)
       .view()

Then check the square brackets around the file names, to see the difference with flat.

5.2.7. fromSRA

The Channel.fromSRA method makes it possible to query the NCBI SRA archive and returns a channel emitting the FASTQ files matching the specified selection criteria.

The query can be project ID or accession number(s) supported by the NCBI ESearch API.

This function now requires an API key you can only get by logging into your NCBI account.
For help with NCBI login and key acquisition, click here:
  1. Go to : www.ncbi.nlm.nih.gov/

  2. Click top right button to "Sign into NCBI". Follow their instructions.

  3. Once into your account, click the button at top right, left of My NCBI, usually your ID.

  4. Scroll down to API key section. Copy your key.

You also need to use the latest edge version of Nextflow. Check your nextflow -version, it should say -edge, if not: download the newest Nextflow version, following the instructions linked here.

For example the following snippet will print the contents of an NCBI project ID:

1
2
3
4
5
params.ncbi_api_key = '<Your API key here>'

Channel
  .fromSRA(['SRP073307'], apiKey: params.ncbi_api_key)
  .view()
Replace <Your API key here> with your API key.

This should print:

1
2
3
4
5
[SRR3383346, [/vol1/fastq/SRR338/006/SRR3383346/SRR3383346_1.fastq.gz, /vol1/fastq/SRR338/006/SRR3383346/SRR3383346_2.fastq.gz]]
[SRR3383347, [/vol1/fastq/SRR338/007/SRR3383347/SRR3383347_1.fastq.gz, /vol1/fastq/SRR338/007/SRR3383347/SRR3383347_2.fastq.gz]]
[SRR3383344, [/vol1/fastq/SRR338/004/SRR3383344/SRR3383344_1.fastq.gz, /vol1/fastq/SRR338/004/SRR3383344/SRR3383344_2.fastq.gz]]
[SRR3383345, [/vol1/fastq/SRR338/005/SRR3383345/SRR3383345_1.fastq.gz, /vol1/fastq/SRR338/005/SRR3383345/SRR3383345_2.fastq.gz]]
(remaining omitted)

Multiple accession IDs can be specified using a list object:

1
2
3
4
ids = ['ERR908507', 'ERR908506', 'ERR908505']
Channel
    .fromSRA(ids, apiKey: params.ncbi_api_key)
    .view()
1
2
3
[ERR908507, [/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, /vol1/fastq/ERR908/ERR908507/ERR908507_2.fastq.gz]]
[ERR908506, [/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, /vol1/fastq/ERR908/ERR908506/ERR908506_2.fastq.gz]]
[ERR908505, [/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, /vol1/fastq/ERR908/ERR908505/ERR908505_2.fastq.gz]]
Read pairs are implicitly managed and are returned as a list of files.

It’s straightforward to use this channel as an input using the usual Nextflow syntax. The code below creates a channel containing 2 samples from a public SRA study and runs FASTQC on the resulting files. See:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
params.ncbi_api_key = '<Your API key here>'

params.accession = ['ERR908507', 'ERR908506']
reads = Channel.fromSRA(params.accession, apiKey: params.ncbi_api_key)

process fastqc {
    input:
    tuple sample_id, file(reads_file) from reads

    output:
    file("fastqc_${sample_id}_logs") into fastqc_ch

    script:
    """
    mkdir fastqc_${sample_id}_logs
    fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads_file}
    """
}

6. Processes

In Nextflow, a process is the basic computing primitive to execute foreign functions (i.e. custom scripts or tools).

The process definition starts with the keyword process, followed by the process name and finally the process body delimited by curly brackets.

A basic process, only using the script definition block, looks like the following:

1
2
3
4
5
6
process sayHello {
  script:
  """
  echo 'Hello world!'
  """
}

In more complex examples, the process body can contain up to five definition blocks:

  1. Directives are initial declarations that define optional settings.

  2. Input defines the expected input file/s and the channel from where to find them.

  3. Output defines the expected output file/s and the channel to send the data to.

  4. When is a optional clause statement to allow conditional processes.

  5. Script is a string statement that defines the command to be executed by the process

The full process syntax is defined as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
process < name > {

  [ directives ]        (1)

  input:                (2)
  < process inputs >

  output:               (3)
  < process outputs >

  when:                 (4)
  < condition >

  [script|shell|exec]:  (5)
  """
  < user script to be executed >
  """
}
1 Zero, one or more process directives.
2 Zero, one or more process inputs.
3 Zero, one or more process outputs.
4 An optional boolean conditional to trigger the process execution.
5 The command to be executed.

6.1. Script

The script block is a string statement that defines the command to be executed by the process.

A process contains one and only one script block, and it must be the last statement when the process contains input and output declarations.

The script block can be a single or multi-line string. The latter simplifies the writing of non-trivial scripts composed by multiple commands spanning over multiple lines. For example:

1
2
3
4
5
6
7
8
process example {
  script:
  """
  echo 'Hello world!\nHola mundo!\nCiao mondo!\nHallo Welt!' > file
  cat file | head -n 1 | head -c 5 > chunk_1.txt
  gzip -c chunk_1.txt  > chunk_archive.gz
  """
}

By default the process command is interpreted as a Bash script. However any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. For example:

1
2
3
4
5
6
7
8
9
10
process pyStuff {
  script:
  """
  #!/usr/bin/env python

  x = 'Hello'
  y = 'world!'
  print ("%s - %s" % (x,y))
  """
}
Multiple programming languages can be used within the same workflow script. However, for large chunks of code is suggested to save them into separate files and invoke them from the process script. One can store the specific scripts in the ./bin folder.

6.1.1. Script parameters

Script parameters (params) can be defined dynamically using variable values, like this:

1
2
3
4
5
6
7
8
params.data = 'World'

process foo {
  script:
  """
  echo Hello $params.data
  """
}
A process script can contain any string format supported by the Groovy programming language. This allows us to use string interpolation as in the script above or multiline strings. Refer to String interpolation for more information.
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using \ character.
1
2
3
4
5
6
process foo {
  script:
  """
  echo "The current directory is \$PWD"
  """
}
Try to modify the above script using $PWD instead of \$PWD and check the difference.

This can be tricky when the script uses many Bash variables. A possible alternative is to use a script string delimited by single-quote characters

1
2
3
4
5
6
process bar {
  script:
  '''
  echo $PATH | tr : '\\n'
  '''
}

However, this blocks the usage of Nextflow variables in the command script.

Another alternative is to use a shell statement instead of script which uses a different syntax for Nextflow variables: !{..}. This allows the use of both Nextflow and Bash variables in the same script.

1
2
3
4
5
6
7
8
9
params.data = 'le monde'

process baz {
  shell:
  '''
  X='Bonjour'
  echo $X !{params.data}
  '''
}

6.1.2. Conditional script

The process script can also be defined in a completely dynamic manner using an if statement or any other expression evaluating to a string value. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
params.compress = 'gzip'
params.file2compress = "$baseDir/data/ggal/transcriptome.fa"

process foo {

  input:
  path file from params.file2compress

  script:
  if( params.compress == 'gzip' )
    """
    gzip -c $file > ${file}.gz
    """
  else if( params.compress == 'bzip2' )
    """
    bzip2 -c $file > ${file}.bz2
    """
  else
    throw new IllegalArgumentException("Unknown aligner $params.compress")
}

Exercise

Write a custom function that given the compressor name as a parameter, returns the command string to be executed. Then use this function as the process script body.

6.2. Inputs

Nextflow processes are isolated from each other but can communicate between themselves sending values through channels.

Inputs implicitly determine the dependency and the parallel execution of the process. The process execution is fired each time new data is ready to be consumed from the input channel:

channel process

The input block defines which channels the process is expecting to receive data from. You can only define one input block at a time, and it must contain one or more input declarations.

The input block follows the syntax shown below:

input:
  <input qualifier> <input name> from <source channel>

6.2.1. Input values

The val qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name, as shown in the following example:

1
2
3
4
5
6
7
8
9
10
num = Channel.from( 1, 2, 3 )

process basicExample {
  input:
  val x from num

  """
  echo process job $x
  """
}

In the above example the process is executed three times, each time a value is received from the channel num and used to process the script. Thus, it results in an output similar to the one shown below:

process job 3
process job 1
process job 2
The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received.

6.2.2. Input files

The file qualifier allows the handling of file values in the process execution context. This means that Nextflow will stage it in the process execution directory, and it can be access in the script by using the name specified in the input declaration.

1
2
3
4
5
6
7
8
9
10
reads = Channel.fromPath( 'data/ggal/*.fq' )

process foo {
    input:
    file 'sample.fastq' from reads
    script:
    """
    echo your_command --reads sample.fastq
    """
}

The input file name can also be defined using a variable reference as shown below:

1
2
3
4
5
6
7
8
9
10
reads = Channel.fromPath( 'data/ggal/*.fq' )

process foo {
    input:
    file sample from reads
    script:
    """
    echo your_command --reads $sample
    """
}

The same syntax it’s also able to handle more than one input file in the same execution. Only requiring a change in the channel composition.

1
2
3
4
5
6
7
8
9
10
reads = Channel.fromPath( 'data/ggal/*.fq' )

process foo {
    input:
    file sample from reads.collect()
    script:
    """
    echo your_command --reads $sample
    """
}
When a process declares an input file, the corresponding channel elements must be file objects, i.e. created with the file helper function from the file specific channel factories e.g. Channel.fromPath or Channel.fromFilePairs.

Consider the following snippet:

1
2
3
4
5
6
7
8
9
10
params.genome = 'data/ggal/transcriptome.fa'

process foo {
    input:
    file genome from params.genome
    script:
    """
    echo your_command --reads $genome
    """
}

The above code creates a temporary file named input.1 with the string data/ggal/transcriptome.fa as content. That likely is not what you wanted to do.

6.2.3. Input path

As of version 19.10.0, Nextflow introduced a new path input qualifier that simplifies the handling of cases such as the one shown above. In a nutshell, the input path automatically handles string values as file objects. The following example works as expected:

1
2
3
4
5
6
7
8
9
10
params.genome = "$baseDir/data/ggal/transcriptome.fa"

process foo {
    input:
    path genome from params.genome
    script:
    """
    echo your_command --reads $genome
    """
}
The path qualifier should be preferred over file to handle process input files when using Nextflow 19.10.0 or later.

Exercise

Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq followed by a process that concatenates them into a single file and prints the first 20 lines.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
params.reads = "$baseDir/data/ggal/*_1.fq"

Channel
    .fromPath( params.reads )
    .set { read_ch }

process concat {
  tag "Concat all files"

  input:
  path '*' from read_ch.collect()

  output:
  path 'top_10_lines' into concat_ch

  script:
  """
  cat * > concatenated.txt
  head -n 20 concatenated.txt > top_10_lines
  """
  }

concat_ch.view()

6.2.4. Combine input channels

A key feature of processes is the ability to handle inputs from multiple channels. However it’s important to understands how channel content and their semantics affect the execution of a process.

Consider the following example:

1
2
3
4
5
6
7
8
9
10
process foo {
  echo true
  input:
  val x from Channel.from(1,2,3)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

Both channels emit three values, therefore the process is executed three times, each time with a different pair:

  • (1, a)

  • (2, b)

  • (3, c)

What is happening is that the process waits until there’s a complete input configuration i.e. it receives an input value from all the channels declared as input.

When this condition is verified, it consumes the input values coming from the respective channels, and spawns a task execution, then repeat the same logic until one or more channels have no more content.

This means channel values are consumed serially one after another and the first empty channel cause the process execution to stop even if there are other values in other channels.

So what happens when channels do not have the same cardinality (i.e. they emit a different number of elements)?

For example:

1
2
3
4
5
6
7
8
9
10
process foo {
  echo true
  input:
  val x from Channel.from(1,2)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}

In the above example the process is executed only two times, because when a channel has no more data to be processed it stops the process execution.

However, what happens if you replace value x with a value channel?

Compare the previous example with the following one :

1
2
3
4
5
6
7
8
9
10
process bar {
  echo true
  input:
  val x from Channel.value(1)
  val y from Channel.from('a','b','c')
  script:
   """
   echo $x and $y
   """
}
The output should look something like:
1
2
3
1 and b
1 and a
1 and c

This is because value channels can be consumed multiple times, so it doesn’t affect process termination.

Exercise

Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq and use the same data/ggal/transcriptome.fa in each execution.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"

Channel
    .fromPath( params.reads )
    .set { read_ch }

process command {
  tag "Run_command"

  input:
  path reads from read_ch
  path transcriptome from params.transcriptome_file

  output:
  path result into concat_ch

  script:
  """
  echo your_command $reads $transcriptome > result
  """
  }

concat_ch.view()

6.2.5. Input repeaters

The each qualifier allows you to repeat the execution of a process for each item in a collection, every time new data is received. For example:

1
2
3
4
5
6
7
8
9
10
11
12
sequences = Channel.fromPath('data/prots/*.tfa')
methods = ['regular', 'expresso', 'psicoffee']

process alignSequences {
  input:
  path seq from sequences
  each mode from methods

  """
  echo t_coffee -in $seq -mode $mode
  """
}

In the above example every time a file of sequences is received as input by the process, it executes three tasks, each running a different alignment method, set as a mode variable. This is useful when you need to repeat the same task for a given set of parameters.

Exercise

Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq and repeat the same task both with salmon and kallisto.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
methods= ['salmon', 'kallisto']

Channel
    .fromPath( params.reads )
    .set { read_ch }

process command {
  tag "Run_command"

  input:
  path reads from read_ch
  path transcriptome from params.transcriptome_file
  each mode from methods

  output:
  path result into concat_ch

  script:
  """
  echo $mode $reads $transcriptome > result
  """
  }

concat_ch
    .view { "To run : ${it.text}" }

6.3. Outputs

The output declaration block defines the channels used by the process to send out the results produced.

Only one output block can be defined containing one or more output declarations. The output block follows the syntax shown below:

output:
  <output qualifier> <output name> into <target channel>[,channel,..]

6.3.1. Output values

The val qualifier specifies a defined value output in the script context. In a common usage scenario, this is a value, which has been defined in the input declaration block, as shown in the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
methods = ['prot','dna', 'rna']

process foo {
  input:
  val x from methods

  output:
  val x into receiver

  """
  echo $x > file
  """
}

receiver.view { "Received: $it" }

6.3.2. Output files

The file qualifier specifies one or more files as output, produced by the process, into the specified channel.

1
2
3
4
5
6
7
8
9
10
11
process randomNum {

    output:
    file 'result.txt' into numbers

    '''
    echo $RANDOM > result.txt
    '''
}

numbers.view { "Received: " + it.text }

In the above example the process randomNum creates a file named result.txt containing a random number.

Since a file parameter using the same name is declared in the output block, when the task is completed that file is sent over the numbers channel. A downstream process declaring the same channel as input will be able to receive it.

6.3.3. Multiple output files

When an output file name contains a * or ? wildcard character it is interpreted as a glob path matcher. This allows us to capture multiple files into a list object and output them as a sole emission. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
process splitLetters {

    output:
    file 'chunk_*' into letters

    '''
    printf 'Hola' | split -b 1 - chunk_
    '''
}

letters
    .flatMap()
    .view { "File: ${it.name} => ${it.text}" }

it prints:

File: chunk_aa => H
File: chunk_ab => o
File: chunk_ac => l
File: chunk_ad => a

Some caveats on glob pattern behavior:

  • Input files are not included in the list of possible matches.

  • Glob pattern matches against both files and directory paths.

  • When a two stars pattern ** is used to recourse across directories, only file paths are matched i.e. directories are not included in the result list.

Exercise

Remove the flatMap operator and see out the output change. The documentation for the flatMap operator is available at this link.

Click here for the answer:
1
File: [chunk_aa, chunk_ab, chunk_ac, chunk_ad] => [H, o, l, a]

6.3.4. Dynamic output file names

When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic evaluated string, which references values defined in the input declaration block or in the script global context. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
species = ['cat','dog', 'sloth']
sequences = ['AGATAG','ATGCTCT', 'ATCCCAA']

process align {
  input:
  val x from species
  val seq from sequences

  output:
  file "${x}.aln" into genomes

  """
  echo align -in $seq > ${x}.aln
  """
}

genomes.view()

In the above example, each time the process is executed an alignment file is produced whose name depends on the actual value of the x input.

6.3.5. Composite inputs and outputs

So far we have seen how to declare multiple input and output channels, but each channel was handling only one value at time. However Nextflow can handle a tuple of values.

When using a channel emitting a tuple of values, the corresponding input declaration must be declared with a tuple qualifier followed by definition of each element in the tuple.

In the same manner, output channels emitting a tuple of values can be declared using the tuple qualifier following by the definition of each tuple element.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')

process foo {
  input:
    tuple val(sample_id), file(sample_files) from reads_ch
  output:
    tuple val(sample_id), file('sample.bam') into bam_ch
  script:
  """
    echo your_command_here --reads $sample_id > sample.bam
  """
}

bam_ch.view()
In previous versions of Nextflow tuple was called set but it was used exactly with the same semantic. It can still be used for backward compatibility.

Exercise

Modify the script of the previous exercise so that the bam file is named as the given sample_id.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')

process foo {
  input:
    tuple val(sample_id), file(sample_files) from reads_ch
  output:
    tuple val(sample_id), file("${sample_id}.bam") into bam_ch
  script:
  """
    echo your_command_here --reads $sample_id > ${sample_id}.bam
  """
}

bam_ch.view()

6.4. When

The when declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.

It is useful to enable/disable the process execution depending on the state of various inputs and parameters. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
params.dbtype = 'nr'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)

process find {
  input:
  file fasta from proteins
  val type from params.dbtype

  when:
  fasta.name =~ /^BB11.*/ && type == 'nr'

  script:
  """
  echo blastp -query $fasta -db nr
  """
}

6.5. Directives

Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.

They must be entered at the top of the process body, before any other declaration blocks (i.e. input, output, etc.).

Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra information for configuration or logging purpose. For example:

1
2
3
4
5
6
7
8
9
10
process foo {
  cpus 2
  memory 1.GB
  container 'image/name'

  script:
  """
  echo your_command --this --that
  """
}

The complete list of directives is available at this link.

6.6. Organise outputs

6.6.1. PublishDir directive

Given each process is being executed in separate temporary work/ folders (e.g. work/f1/850698…​; work/g3/239712…​; etc.), we may want to save important, non-intermediary, final files into a results folder.

Remember to delete the work folder from time to time, else all your intermediate files will fill up your computer!

To store our workflow result files, we need to be explicitly mark them using the directive publishDir in the process that’s creating these file. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
params.outdir = 'my-results'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)


process blastSeq {
    publishDir "$params.outdir/bam_files", mode: 'copy'

    input:
    file fasta from proteins

    output:
    file ('*.txt') into blast_ch

    """
    echo blastp $fasta > ${fasta}_result.txt
    """
}

blast_ch.view()

The above example will copy all blast script files created by the blastSeq task in the directory path my-results.

The publish directory can be local or remote. For example output files could be stored to a AWS S3 bucket just using the s3:// prefix in the target path.

6.6.2. Manage semantic sub-directories

You can use more then one publishDir to keep different outputs in separate directory. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
params.reads = 'data/reads/*_{1,2}.fq.gz'
params.outdir = 'my-results'

Channel
    .fromFilePairs(params.reads, flat: true)
    .set{ samples_ch }

process foo {
  publishDir "$params.outdir/$sampleId/", pattern: '*.fq'
  publishDir "$params.outdir/$sampleId/counts", pattern: "*_counts.txt"
  publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'

  input:
    tuple val(sampleId), file('sample1.fq.gz'), file('sample2.fq.gz') from samples_ch
  output:
    file "*"
  script:
  """
    < sample1.fq.gz zcat > sample1.fq
    < sample2.fq.gz zcat > sample2.fq

    awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt
    awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt

    head -n 50 sample1.fq > sample1_outlook.txt
    head -n 50 sample2.fq > sample2_outlook.txt
  """
}

The above example will create an output structure in the directory my-results, which contains a separate sub-directory for each given sample ID each of which contain the folders counts and outlooks.

7. Operators

Operators are methods that allow you to connect channels to each other or to transform values emitted by a channel applying some user provided rules.

There are seven main groups that are detailed in the documentation that cover:

7.1. Basic example

1
2
3
nums = Channel.from(1,2,3,4)         (1)
square = nums.map { it -> it * it }  (2)
square.view()                        (3)
channel map
1 Create a queue channel emitting four values.
2 Create a new channel, transforming each number to its square.
3 Print the channel content.

Operators can also be chained to implement custom behaviors, so the previous snippet can be written as:

1
2
3
Channel.from(1,2,3,4)
        .map { it -> it * it }
        .view()

7.2. Basic operators

Here we explore some of the most commonly used operators.

7.2.1. view

The view operator prints the items emitted by a channel to the console standard output, appending a new line character to each item. For example:

1
2
3
Channel
      .from('foo', 'bar', 'baz')
      .view()

It prints:

foo
bar
baz

An optional closure parameter can be specified to customize how items are printed. For example:

1
2
3
Channel
      .from('foo', 'bar', 'baz')
      .view { "- $it" }

It prints:

- foo
- bar
- baz

7.2.2. map

The map operator applies a function of your choosing to every item emitted by a channel, and returns the items obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as shown in the example below:

1
2
3
4
Channel
    .from( 'hello', 'world' )
    .map { it -> it.reverse() }
    .view()

A map can associate to each element a generic tuple containing any data as needed.

1
2
3
4
Channel
    .from( 'hello', 'world' )
    .map { word -> [word, word.size()] }
    .view { word, len -> "$word contains $len letters" }

Exercise

Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq, then chain with a map to return a pair containing the file name and the path itself. Finally print the resulting channel.

Click here for the answer:
1
2
3
Channel.fromPath('data/ggal/*.fq')
        .map { file -> [ file.name, file ] }
        .view { name, file -> "> $name : $file" }

7.2.3. into

The into operator connects a source channel to two or more target channels in such a way the values emitted by the source channel are copied to the target channels. For example:

1
2
3
4
5
6
Channel
     .from( 'a', 'b', 'c' )
     .into{ foo; bar }

foo.view{ "Foo emits: " + it }
bar.view{ "Bar emits: " + it }
Note the use in this example of curly brackets and the ; as channel names separator. This is needed because the actual parameter of into is a closure which defines the target channels to which the source one is connected.

7.2.4. mix

The mix operator combines the items emitted by two (or more) channels into a single channel.

1
2
3
4
5
c1 = Channel.from( 1,2,3 )
c2 = Channel.from( 'a','b' )
c3 = Channel.from( 'z' )

c1 .mix(c2,c3).view()
1
2
a
3
b
z
The items in the resulting channel have the same order as in respective original channel, however there’s no guarantee that the element of the second channel are appended after the elements of the first. Indeed in the above example the element a has been printed before 3.

7.2.5. flatten

The flatten operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.

1
2
3
4
5
6
7
foo = [1,2,3]
bar = [4,5,6]

Channel
    .from(foo, bar)
    .flatten()
    .view()

The above snippet prints:

1
2
3
4
5
6

7.2.6. collect

The collect operator collects all the items emitted by a channel to a list and returns the resulting object as a sole emission.

1
2
3
4
Channel
    .from( 1, 2, 3, 4 )
    .collect()
    .view()

It prints a single value:

[1,2,3,4]
The result of the collect operator is a value channel.

7.2.7. groupTuple

The groupTuple operator collects tuples (or lists) of values emitted by the source channel, grouping together the elements that share the same key. Finally it emits a new tuple object for each distinct key collected.

Try the following example:

1
2
3
4
Channel
     .from( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
     .groupTuple()
     .view()

It shows:

[1, [A, B, C]]
[2, [C, A]]
[3, [B, D]]

This operator is useful to process together all elements for which there is a common property or grouping key.

Exercise

Use fromPath to create a channel emitting all the files in the folder data/meta/, then use a map to associate to each file the baseName prefix. Finally group together all files having the same common prefix.

Click here for the answer:
1
2
3
4
Channel.fromPath('data/meta/*')
        .map { file -> tuple(file.baseName, file) }
        .groupTuple()
        .view { baseName, file -> "> $baseName : $file" }

7.2.8. join

The join operator creates a channel that joins together the items emitted by two channels for which there exists a matching key. The key is defined, by default, as the first element in each item emitted.

1
2
3
left = Channel.from(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
right= Channel.from(['Z', 6], ['Y', 5], ['X', 4])
left.join(right).view()

The resulting channel emits:

[Z, 3, 6]
[Y, 2, 5]
[X, 1, 4]
Notice 'P' is missing in the final result

7.2.9. branch

The branch operator allows you to forward the items emitted by a source channel to one or more output channels, choosing one of them at a time.

The selection criteria is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. On the first expression that evaluates to a true value, the current item is bound to a named channel as the label identifier. For example:

1
2
3
4
5
6
7
8
9
10
Channel
    .from(1,2,3,40,50)
    .branch {
        small: it < 10
        large: it > 10
    }
    .set { result }

 result.small.view { "$it is small" }
 result.large.view { "$it is large" }
The branch operator returns a multi-channel object i.e. a variable that holds more than one channel object.

7.3. More resources

Check the operators documentation on Nextflow web site.

8. Groovy basic structures and idioms

Nextflow is a domain specific language (DSL) implemented on top of the Groovy programming language, which in turn is a super-set of the Java programming language. This means that Nextflow can run any Groovy or Java code.

Here are some important Groovy syntax to learn, that are commonly used in Nextflow.

8.1. Printing values

To print something is as easy as using one of the print or println methods.

1
println("Hello, World!")

The only difference between the two is that the println method implicitly appends a new line character to the printed string.

parenthesis for function invocations are optional. Therefore, the following is also valid syntax:
1
println "Hello, World!"

8.2. Comments

Comments use the same syntax as C-family programming languages:

1
2
3
4
5
6
// comment a single config file

/*
   a comment spanning
   multiple lines
 */

8.3. Variables

To define a variable, simply assign a value to it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = 1
println x

x = new java.util.Date()
println x

x = -3.1499392
println x

x = false
println x

x = "Hi"
println x

Local variables are defined using the def keyword:

1
def x = 'foo'

It should be always used when defining variables local to a function or a closure.

8.4. Lists

A List object can be defined by placing the list items in square brackets:

1
list = [10,20,30,40]

You can access a given item in the list with square-bracket notation (indexes start at 0) or using the get method:

1
2
println list[0]
println list.get(0)

In order to get the length of the list use the size method:

1
println list.size()

We use the assert keyword to test if a condition is true (similar to an if function). Here, Groovy will print nothing if it is corrrect, else it will raise an AssertionError message.

1
assert list[0] == 10
This assertion should be correct, but try changing the value to an incorrect one.

Lists can also be indexed with negative indexes and reversed ranges.

1
2
3
list = [0,1,2]
assert list[-1] == 2
assert list[-1..0] == list.reverse()

List objects implement all methods provided by the java.util.List interface, plus the extension methods provided by Groovy API.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
assert [1,2,3] << 1 == [1,2,3,1]
assert [1,2,3] + [1] == [1,2,3,1]
assert [1,2,3,1] - [1] == [2,3]
assert [1,2,3] * 2 == [1,2,3,1,2,3]
assert [1,[2,3]].flatten() == [1,2,3]
assert [1,2,3].reverse() == [3,2,1]
assert [1,2,3].collect{ it+3 } == [4,5,6]
assert [1,2,3,1].unique().size() == 3
assert [1,2,3,1].count(1) == 2
assert [1,2,3,4].min() == 1
assert [1,2,3,4].max() == 4
assert [1,2,3,4].sum() == 10
assert [4,2,1,3].sort() == [1,2,3,4]
assert [4,2,1,3].find{it%2 == 0} == 4
assert [4,2,1,3].findAll{it%2 == 0} == [4,2]

8.5. Maps

Maps are like lists that have an arbitrary type of key instead of integer. Therefore, the syntax is very much aligned.

1
map = [a:0, b:1, c:2]

Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.

1
2
3
assert map['a'] == 0        (1)
assert map.b == 1           (2)
assert map.get('c') == 2    (3)
1 Use of the square brackets.
2 Use a dot notation.
3 Use of get method.

To add data or to modify a map, the syntax is similar to adding values to list:

1
2
3
4
map['a'] = 'x'           (1)
map.b = 'y'              (2)
map.put('c', 'z')        (3)
assert map == [a:'x', b:'y', c:'z']
1 Use of the square brackets.
2 Use a dot notation.
3 Use of put method.

Map objects implement all methods provided by the java.util.Map interface, plus the extension methods provided by Groovy API.

8.6. String interpolation

String literals can be defined enclosing them either with single- or double- quotation marks.

Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:

1
2
3
4
5
6
foxtype = 'quick'
foxcolor = ['b', 'r', 'o', 'w', 'n']
println "The $foxtype ${foxcolor.join()} fox"

x = 'Hello'
println '$x + $y'

This code prints:

1
2
The quick brown fox
$x + $y
Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.

Finally string literals can also be defined using the / character as delimiter. They are known as slashy strings and are useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double quote strings they allow to interpolate variables prefixed with a $ character.

Try the following to see the difference:

1
2
3
4
5
x = /tic\tac\toe/
y = 'tic\tac\toe'

println x
println y

it prints:

tic\tac\toe
tic    ac    oe

8.7. Multi-line strings

A block of text that spans multiple lines can be defined by delimiting it with triple single or double quotes:

1
2
3
4
text = """
    Hello there James.
    How are you today?
    """

Finally multi-line strings can also be defined with slashy strings. For example:

1
2
3
4
5
text = /
    This is a multi-line
    slashy string!
    It's cool, isn't it?!
    /
Like before, multi-line strings inside double quotes and slash characters support variable interpolation, while single-quoted multi-line strings do not.

8.8. If statement

The if statement uses the same syntax common in other programming languages, such Java, C, JavaScript, etc.

1
2
3
4
5
6
if( < boolean expression > ) {
    // true branch
}
else {
    // false branch
}

The else branch is optional. Also curly brackets are optional when the branch defines just a single statement.

1
2
3
x = 1
if( x > 10 )
    println 'Hello'
null, empty strings and empty collections are evaluated to false.

Therefore a statement like:

1
2
3
4
5
6
7
list = [1,2,3]
if( list != null && list.size() > 0 ) {
  println list
}
else {
  println 'The list is empty'
}

Can be written as:

1
2
3
4
if( list )
    println list
else
    println 'The list is empty'

See the Groovy-Truth for details.

In some cases it can be useful to replace the if statement with a ternary expression (aka conditional expression). For example:
1
println list ? list : 'The list is empty'

The previous statement can be further simplified using the Elvis operator as shown below:

1
println list ?: 'The list is empty'

8.9. For statement

The classical for loop syntax is supported as shown here:

1
2
3
for (int i = 0; i <3; i++) {
   println("Hello World $i")
}

Iteration over list objects is also possible using the syntax below:

1
2
3
4
5
list = ['a','b','c']

for( String elem : list ) {
  println elem
}

8.10. Functions

It is possible to define a custom function into a script, as shown here:

1
2
3
4
5
int fib(int n) {
    return n < 2 ? 1 : fib(n-1) + fib(n-2)
}

assert fib(10)==89

A function can take multiple arguments separating them with a comma. The return keyword can be omitted and the function implicitly returns the value of the last evaluated expression. Also, explicit types can be omitted, though not recommended:

1
2
3
4
5
def fact( n ) {
  n > 1 ? n * fact(n-1) : 1
}

assert fact(5) == 120

8.11. Closures

Closures are the swiss army knife of Nextflow/Groovy programming. In a nutshell, a closure is a block of code that can be passed as an argument to a function, it could also be defined an anonymous function.

More formally, a closure allows the definition of functions as first class objects.

1
square = { it * it }

The curly brackets around the expression it * it tells the script interpreter to treat this expression as code. The it identifier is an implicit variable that represents the value that is passed to the function when it is invoked.

Once compiled, the function object is assigned to the variable square as any other variable assignments shown previously. To invoke the closure execution use the special method call or just use the round parentheses to specify the closure parameter(s). For example:

1
2
assert square.call(5) == 25
assert square(9) == 81

This is not very interesting until we find that we can pass the function square as an argument to other functions or methods. Some built-in functions take a function like this as an argument. One example is the collect method on lists:

1
2
x = [ 1, 2, 3, 4 ].collect(square)
println x

It prints:

[ 1, 4, 9, 16 ]

By default, closures take a single parameter called it, to give it a different name use the -> syntax. For example:

1
square = { num -> num * num }

It’s also possible to define closures with multiple, custom-named parameters.

For example, the method each() when applied to a map can take a closure with two arguments, to which it passes the key-value pair for each entry in the map object. For example:

1
2
3
printMap = { a, b -> println "$a with value $b" }
values = [ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ]
values.each(printMap)

It prints:

Yue with value Wu
Mark with value Williams
Sudha with value Kumari

A closure has two other important features.

First, it can access and modify variables in the scope where it is defined.

Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the place where it needs to be used.

As an example showing both these features, see the following code fragment:

1
2
3
4
result = 0                                       (1)
values = ["China": 1 , "India" : 2, "USA" : 3]   (2)
values.keySet().each { result += values[it] }    (3)
println result
1 Define a global variable.
2 Define a map object.
3 Invoke the each method passing the closure object which modifies the result variable.

Learn more about closures in the Groovy documentation.

8.12. More resources

The complete Groovy language documentation is available at this link.

A great resource to master Apache Groovy syntax is the book: Groovy in Action.

9. DSL2 and modules

9.1. Basic concepts

DSL2 is our new syntax extension, developed primarily to improve readability and allow the use of modules.

However, DSL1 is still a valid way of writing pipelines, and indeed many public pipelines are written this way. To ensure backward compatibility, you can use DSL2 by adding the following line at the beginning of each workflow script:

nextflow.enable.dsl=2

DSL2 core features include:

  • Separation of processes from their invocation.

  • Use of a workflow directive to execute specific processes.

  • Archiving of processes into modules.

  • Update of syntax (pipe operator, & operator and channel forking)

9.2. Process

9.2.1. Process definition

The new DSL separates the definition of a process from its invocation. The process definition follows the usual syntax as described in the process documentation. The only difference is that the from and into channel declaration has to be omitted.

Then a process can be invoked as a function in the workflow scope, passing the expected input channels as parameters as it if were a custom function. For example :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
nextflow.enable.dsl=2

process foo {
    output:
      path 'foo.txt'
    script:
      """
      echo first_command > foo.txt
      """
}

workflow {
  foo()
}

At the first line we enable dsl2, then we define a process called 'foo', which puts into the output channel (path: foo.txt), without using "into out_ch". This is then executed in the workflow directive where the individual process is called.

A process component can be invoked only once in the same workflow context.

Next, we can add additional processes to this script and add another call from the workflow directive, thereby separating the processes into 'submodules'.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
nextflow.enable.dsl=2

process foo {
    output:
      path 'foo.txt'
    script:
      """
      echo first_command > foo.txt
      """
}

process bar {
    input:
      path x
    output:
      path 'bar.txt'
    script:
      """
      head -n 1 $x > bar.txt
      echo third_command >> bar.txt
      """
}

workflow {
  foo()
  data = channel.fromPath('data/test/baz.txt')
  bar(data)
}

In this example, we have added the second process 'bar', which again is called within the latter workflow directive, where its input comes from a specified channel.fromPath.

Exercise

Guess what will be in the output files foo.txt and bar.txt? Given data/test/baz.txt contains the string second_command. Then run the script to find out.

Answer:

foo.txt will have the following content:

first_command

bar.txt will have the following content:

second_command
third_command

9.2.2. Process composition

Processes having matching input-output declaration can be composed so that the output of the first process is passed as input to the following process. Taking in consideration the previous process definition, it’s possible to write the following workflow directive:

1
2
3
workflow {
    bar(foo())
}

Exercise

Try to work out what will be in the output path bar.txt using the updated workflow ?

Answer:

bar.txt should contain the following:

first_command
third_command

9.2.3. Process outputs

A process output can also be accessed using the out attribute for the respective process object.

For example:

1
2
3
4
5
workflow {
    foo()
    bar(foo.out)
    bar.out.view()
}

When a process defines two or more output channels, each of them can be accessed using the array element operator e.g. out[0], out[1], etc.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
nextflow.enable.dsl=2

process foo {
    output:
      path 'foo.txt'
      path 'extra.txt'
    script:
      """
      echo first_command > foo.txt
      echo fourth_command > extra.txt
      """
}

process bar {
    input:
      path x
    output:
      path 'bar.txt'
    script:
      """
      head -n 1 $x > bar.txt
      echo third_command >> bar.txt
      """
}

data = channel.fromPath('./baz.txt')

workflow {
  foo()
  bar(foo.out[1])
  bar.out.view()
}

Exercise

What would you expect to find in bar.txt?:

Answer:

bar.txt should contain the following:

fourth_command
third_command

Another option is using named outputs (see below).

9.2.4. Process named output

The process output definition allows the use of the emit option to define a name identifier that can be used to reference the channel in the external scope. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
nextflow.enable.dsl=2

process foo {
  output:
    path '*.bam', emit: samples_bam

  '''
  echo result > output.bam
  '''
}

workflow {
    foo()
    foo.out.samples_bam.view()
}

9.2.5. Process named stdout

The process can name stdout using the emit option:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
nextflow.enable.dsl=2

process sayHello {
    input:
        val cheers
    output:
        stdout emit: verbiage
    script:
    """
    echo -n $cheers
    """
}

workflow {
    things = channel.of('Hello world!', 'Yo, dude!', 'Duck!')
    sayHello(things)
    sayHello.out.verbiage.view()
}

9.3. Workflow

9.3.1. Workflow definition

The workflow keyword allows the definition of sub-workflow components that enclose the invocation of one or more processes and operators:

1
2
3
4
workflow my_pipeline {
    foo()
    bar( foo.out.collect() )
}

For example, the above snippet defines a workflow component, named my_pipeline, that can be invoked from another workflow component definition as any other function or process i.e. my_pipeline().

9.3.2. Workflow parameters

A workflow component can access any variable and parameter defined in the outer scope:

1
2
3
4
5
6
7
8
params.data = '/some/data/file'

workflow my_pipeline {
    if( params.data )
        bar(params.data)
    else
        bar(foo())
}

9.3.3. Workflow inputs

A workflow component can declare one or more input channels using the take keyword. For example:

1
2
3
4
5
6
workflow my_pipeline {
    take: data
    main:
    foo(data)
    bar(foo.out)
}
When the take keyword is used, the beginning of the workflow body needs to be identified with the main keyword.

Then, the input can be specified as an argument in the workflow invocation statement:

1
2
3
workflow {
    my_pipeline( channel.from('/some/data') )
}
Workflow inputs are by definition: channel data structures. If a basic data type is provided instead, i.e. number, string, list, etc., it’s implicitly converted to a channel value (ie. non-consumable).

9.3.4. Workflow outputs

A workflow component can declare one or more out channels using the emit keyword. For example:

1
2
3
4
5
6
7
workflow my_pipeline {
    main:
      foo(data)
      bar(foo.out)
    emit:
      bar.out
}

Then, the result of the my_pipeline execution can be accessed using the out property i.e. my_pipeline.out. When there are multiple output channels declared, use the array bracket notation to access each output component as described for the Process outputs definition.

Alternatively, the output channel can be accessed using the identifier name it’s assigned to in the emit declaration:

1
2
3
4
5
6
7
workflow my_pipeline {
   main:
     foo(data)
     bar(foo.out)
   emit:
     my_data = bar.out
}

Then, the result of the above snippet can accessed using my_pipeline.out.my_data.

9.3.5. Implicit workflow

A workflow definition which does not declare any name is assumed to be the main workflow and it’s implicitly executed. Therefore it’s the entry point of the workflow application.

Implicit workflow definition is ignored when a script is included as module. This allows the writing of a workflow script that can be used either as a library module and as application script.
An alternative workflow entry can be specified using the -entry command line option.

9.3.6. Workflow composition

Workflows defined in your script or imported by a module inclusion can be invoked and composed as any other process in your application.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
workflow flow1 {
    take: data
    main:
        foo(data)
        bar(foo.out)
    emit:
        bar.out
}

workflow flow2 {
    take: data
    main:
        foo(data)
        baz(foo.out)
    emit:
        baz.out
}

workflow {
    take: data
    main:
      flow1(data)
      flow2(flow1.out)
}
Nested workflow execution determines an implicit scope. Therefore the same process can be invoked in two different workflow scopes, like for example foo in the above snippet that is used in both flow1 and flow2. The workflow execution path along with the process names, defines the process (fully qualified) name that is used to distinguish the two different process invocations (i.e. flow1:foo and flow2:foo in the above example).

TIP : The process fully qualified name can be used as a valid process selector in the nextflow.config file and it has priority over the process simple name.

9.4. Modules

The new DSL allows the definition of module scripts that can be included and shared across workflow applications.

A module can contain the definition of a function, process and workflow definitions as described in the above sections.

9.4.1. Modules include

A component defined in a module script can be imported into another Nextflow script using the include keyword.

For example:

1
2
3
4
5
6
include { foo } from './path/to/modules.nf'

workflow {
    data = channel.fromPath('/some/data/*.txt')
    foo(data)
}

The above snippets include a process with name foo defined in the module script in the main execution context, as such it can be invoked in the workflow scope. "modules.nf" is a file that would contain multiple process code blocks (including foo).

Nextflow implicitly looks for the script file "./path/to/modules.nf", resolving the path within the included script location.

Relative paths must begin with the ./ prefix.

9.4.2. Multiple inclusions

A Nextflow script allows the inclusion of any number of modules. When multiple components need to be included from the same module script, the component names can be specified in the same inclusion using the curly brackets notation as shown below:

1
2
3
4
5
6
7
include { foo; bar } from './some/module'

workflow {
    data = channel.fromPath('/some/data/*.txt')
    foo(data)
    bar(data)
}

9.4.3. Module aliases

When including a module component it’s possible to specify a name alias. This allows the inclusion and the invocation of the same component multiple times in your script using different names. For example:

1
2
3
4
5
6
7
include { foo } from './some/module'
include { foo as bar } from './other/module'

workflow {
    foo(some_data)
    bar(other_data)
}

The same is possible when including multiple components from the same module script as shown below:

1
2
3
4
5
6
include { foo; foo as bar } from './some/module'

workflow {
    foo(some_data)
    bar(other_data)
}

9.4.4. Module parameters

A module script can define one or more parameters using the same syntax as Nextflow workflow scripts (as well as defining workflow or defined functions):

1
2
3
4
5
6
params.foo = 'Hello'
params.bar = 'world!'

def sayHello() {
    println "$params.foo $params.bar"
}

Parameters are inherited from the including context. For example:

1
2
3
4
5
6
7
8
params.foo = 'Hola'
params.bar = 'Mundo'

include {sayHello} from './some/module'

workflow {
    sayHello()
}

The above snippet should print:

1
Hola Mundo
The module inherits the parameters defined before the include statement, therefore any further parameters set later are ignored.
Define all pipeline parameters at the beginning of the script before any include declaration.

The option addParams can be used to extend the module parameters without affecting the external scope. For example:

1
2
3
4
5
include {sayHello} from './some/module' addParams(foo: 'Ciao')

workflow {
    sayHello()
}

The above snippet should prints:

1
Ciao world!

Finally the include option params allows the specification of one or more parameters without inheriting any value from the external environment.

Exercise

Try to run the above code. Replacing ./some/module with the file name to a process called sayHello(), which expects foo and bar parameters. Remember to use ./ for current directory.

Answer:
  1. First save the following to ./modules/my_modules.nf:

    1
    2
    3
    4
    5
    6
    
    params.foo = 'Hello'
    params.bar = 'world!'
    
    def sayHello() {
        println "$params.foo $params.bar"
    }
  2. Then run nextflow run myscript.nf:

    Where myscript.nf is the following:

nextflow.enable.dsl=2

params.foo = 'Hola'
params.bar = 'Mundo'

include {sayHello} from './modules/my_modules.nf'

workflow {
    sayHello()
}

9.5. DSL2 migration notes

Some of the syntax has changed between DSL1 and DSL2.

These are a few of the key changes:

  • The declaration: nextflow.enable.dsl=2 is used in place of nextflow.preview.dsl=2.

  • Process inputs and outputs of type set have to be replaced with tuple.

  • Process output option mode flatten is not available anymore. Replace it using the flatten operator on the corresponding output channel.

  • Anonymous and unwrapped includes are not supported anymore. Replace it with a explicit module inclusion. For example:

1
2
3
4
5
6
7
include './some/library'
include bar from './other/library'

workflow {
  foo()
  bar()
}

Should be replaced with:

1
2
3
4
5
6
7
include { foo } from './some/library'
include { bar } from './other/library'

workflow {
  foo()
  bar()
}
  • The use of unqualified value (val) and file elements into input tuples is not allowed anymore. Replace them with a corresponding val or path qualifiers:

1
2
3
4
5
6
7
8
process foo {
input:
  tuple X, 'some-file.bam'
 script:
   '''
   your_command
   '''
}

Use:

1
2
3
4
5
6
7
8
process foo {
input:
  tuple val(X), path('some-file.bam')
 script:
   '''
   your_command --in $X some-file.bam
   '''
}
  • The use of unqualified value (val) and file elements into output tuples is not allowed anymore. Replace them with a corresponding val or path qualifiers:

1
2
3
4
5
6
7
8
9
10
process foo {
output:
  tuple X, 'some-file.bam'

script:
   X = 'some value'
   '''
   your_command > some-file.bam
   '''
}

Use:

1
2
3
4
5
6
7
8
9
10
process foo {
output:
  tuple val(X), path('some-file.bam')

script:
   X = 'some value'
   '''
   your_command > some-file.bam
   '''
}
  • Operator bind has been deprecated by DSL2 syntax

  • Operator operator << has been deprecated by DSL2 syntax.

  • Operator choice has been deprecated by DSL2 syntax. Use branch instead.

  • Operator close has been deprecated by DSL2 syntax.

  • Operator create has been deprecated by DSL2 syntax.

  • Operator countBy has been deprecated by DSL2 syntax.

  • Operator into has been deprecated by DSL2 syntax since it’s not needed anymore.

  • Operator fork has been renamed to multiMap.

  • Operator groupBy has been deprecated by DSL2 syntax. Replace it with groupTuple

  • Operator print and println have been deprecated by DSL2 syntax. Use view instead.

  • Operator merge has been deprecated by DSL2 syntax. Use join instead.

  • Operator separate has been deprecated by DSL2 syntax.

  • Operator spread has been deprecated with DSL2 syntax. Replace it with combine.

  • Operator route has been deprecated by DSL2 syntax.

10. Get started with Nextflow Tower

10.1. Basic concepts

Nextflow Tower is the centralized command-post for the management of Nextflow data pipelines. It brings monitoring, logging & observability, to distributed workflows and simplifies the deployment of pipelines on any cloud, cluster or laptop.

Nextflow tower core features include:

  • the launching of pre-configured pipelines with ease.

  • programmatic integration to meet the needs of organizations.

  • publishing pipelines to shared workspaces.

  • management of infrastructure required to run data analysis at scale.

Sign up to try Tower for free or request a demo for deployments in your own on-premise or cloud environment.

10.2. Usage

You can use Tower via either the -with-tower option while using the Nextflow run command, through the online GUI or through the API.

10.2.1. Via the Nextflow run command

Create an account and login into Tower.

1. Create a new token

You can access your tokens from the The Settings drop-down menu :

usage create token

2. Name your token

usage name token

3. Save your token safely

Copy and keep your new token in a safe place.
usage token

4. Export your token

Once your token has been created, open a terminal and type:

1
2
export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE=
export NXF_VER=20.10.0

Where eyxxxxxxxxxxxxxxxQ1ZTE= is the token you have just created.

Check your nextflow -version. Bearer tokens require Nextflow version 20.10.0 or later, set with the second command above. Change as necessary.

To submit a pipeline to a Workspace using the Nextflow command line tool, add the workspace ID to your environment. For example

1
export TOWER_WORKSPACE_ID=000000000000000

The workspace ID can be found on the organisation’s Workspaces overview page.

5. Run Nextflow with tower

Run your Nextflow workflows as usual with the addition of the -with-tower command:

1
nextflow run hello.nf -with-tower

You will see and be able to monitor your Nextflow jobs in Tower.

To configure and execute Nextflow jobs in Cloud environments, visit the Compute environments section.

Exercise

Run the RNA-Seq script7.nf using the -with-tower flag, after correctly completing the token settings outlined above.

If you get stuck, click here:

Go to tower.nf/, login, then click the run tab, and select the run that you just submitted. If you can’t find it, double check your token was entered correctly.

10.2.2. Via online GUI

To run using the GUI, there are three main steps:

1. Create an account and login into Tower, available free of charge, at tower.nf.

2. Create and configure a new compute environment.

Configuring your compute environment

Tower uses the concept of Compute Environments to define the execution platform where a pipeline will run.

It supports the launching of pipelines into a growing number of cloud and on-premise infrastructures.

compute env platforms

Each compute environment must be pre-configured to enable Tower to submit tasks. You can read more on how to set up each environment using the links below.

The following set up guides describe how to configure eahc of these compute environments.

Selecting a default compute environment

If you have more than one Compute Environment, you can select which one will be used by default when launching a pipeline.

  1. Navigate to your compute environments.

  2. Choose your default environment by selecting the Make primary button.

Congratulations!

You are now ready to launch pipelines with your primary compute environment.

Launchpad

Launchpad makes it is easy for any workspace user to launch a pre-configured pipeline.

overview launch

A pipeline is a repository containing a Nextflow workflow, a compute environment and pipeline parameters.

Pipeline Parameters Form

Launchpad automatically detects the presence of a nextflow_schema.json in the root of the repository and dynamically creates a form where users can easily update the parameters.

The parameter forms view will appear, if the workflow has a Nextflow schema file for the parameters. Please refer the Nextflow Schema guide to learn more about the use-cases and how to create them.

This makes it trivial for users without any expertise in Nextflow to enter their pipeline parameters and launch.

launch rnaseq nextflow schema
Adding a new pipeline

Adding a pipeline to the pre-saved workspace launchpad is detailed in full on the tower webpage docs.

In brief, these are the steps you need to follow to set up a pipeline.

  1. Select the Launchpad button in the navigation bar. This will open the Launch Form.

  2. Select a compute environment.

  3. Enter the repository of the pipeline you want to launch. e.g. github.com/nf-core/rnaseq.git

  4. A Revision number can be used select different versions of pipeline. The Git default branch (main/master) or manifest.defaultBranch in the Nextflow configuration will be used by default.

  5. The Work directory specifies the location of the Nextflow work directory. The location associated with the compute environment will be selected by default.

  6. Enter the name(s) of each of the Nextflow Config profiles followed by the Enter key. See the Nextflow Config profiles documentation for more details.

  7. Enter any Pipeline parameters in YAML or JSON format. YAML example:

    1
    2
    
        reads: 's3://nf-bucket/exome-data/ERR013140_{1,2}.fastq.bz2'
        paired_end: true
  8. Select Launchpad to begin the pipeline execution.

Nextflow pipelines are simply Git repositories and the location can be any public or private Git-hosting platform. See Git Integration in the Tower docs and Pipeline Sharing in the Nextflow docs for more details.
The credentials associated with the compute environment must be able to access the work directory.
In the configuration, the full path to a bucket must be specified with single-quotes around strings no quotes around booleans or numbers.
To create your own customized Nextflow Schema for your pipleine, see the examples of from increasing number of nf-core workflows that have adopted this for example eager and rnaseq.

For advanced setting options check out this page.

There is also community support available if you get into trouble, see here.

10.2.3. API

To learn more about using the Tower API, visit to the API section in this documentation.

10.3. Workspaces and Organisations

Nextflow Tower simplifies the development and execution of workflows by providing a centralized interface for users and organisations.

Each user has a unique workspace where they can interact and manage all resources such as workflows, compute environments and credentials. details of this can be found here.

By default, each user has their own private workspace, while organisations have the ability to run and manage users through role-based access as members and collaborators.

10.3.1. Organization resources

You can create your own organisation and participant workspace by following the docs at tower.

Tower allows creation of multiple organizations, each of which can contain multiple workspaces with shared users and resources. This allows any organization to customize and organize the usage of resources while maintaining an access control layer for users associated with a workspace.

10.3.2. Organization users

Any user can be added or removed from a particular organization or a workspace and can be allocated a specific access role within that workspace.

The Teams feature provides a way for the organizations to group various users and participants together into teams, for example workflow-developers or analysts, and apply access control to all the users within this team as a whole.

For further information, please refer the User Management section.

Setting up a new organisation

Organizations are the top-level structure and contain Workspaces, Members, Teams and Collaborators.

To create a new Organization:

  1. Click on the dropdown next to your name and select New organization to open the creation dialog.

  2. On the dialog, fill in the fields as per your organization. The Name and Full name fields are compulsory.

    A valid name for the organization must follow specific pattern. Please refer the UI for further instructions.
  3. The rest of the fields such as Description, Location, Website URL and Logo Url are optional.

  4. Once the details are filled-in, you can access the newly created organization using the organizations page, which lists all of your organizations.

    It is possible to change the values of the optional fields either using the Edit option on the organizations page or using the Settings tab within the organization page, provided that you are the Owner of the organization .
    A list of all the included Members, Teams and Collaborators can be found at the organization page.

11. Nextflow configuration

A key Nextflow feature is the ability to decouple the workflow implementation by the configuration setting required by the underlying execution platform.

This enables portable deployment without the need to modify the application code.

11.1. Configuration file

When a pipeline script is launched Nextflow looks for a file named nextflow.config in the current directory and in the script base directory (if it is not the same as the current directory). Finally it checks for the file: $HOME/.nextflow/config.

When more than one on the above files exist they are merged, so that the settings in the first override the same ones that may appear in the second one, and so on.

The default config file search mechanism can be extended proving an extra configuration file by using the command line option: -c <config file>.

11.1.1. Config syntax

A Nextflow configuration file is a simple text file containing a set of properties defined using the syntax:

name = value
Please note, string values need to be wrapped in quotation characters while numbers and boolean values (true, false) do not. Also, note that values are typed, meaning for example that, 1 is different from '1', since the first is interpreted as the number one, while the latter is interpreted as a string value.

11.1.2. Config variables

Configuration properties can be used as variables in the configuration file itself, by using the usual $propertyName or ${expression} syntax.

1
2
3
propertyOne = 'world'
anotherProp = "Hello $propertyOne"
customPath = "$PATH:/my/app/folder"
In the configuration file it’s possible to access any variable defined in the host environment such as $PATH, $HOME, $PWD, etc.

11.1.3. Config comments

Configuration files use the same conventions for comments used in the Nextflow script:

1
2
3
4
5
6
// comment a single line

/*
   a comment spanning
   multiple lines
 */

11.1.4. Config scopes

Configuration settings can be organized in different scopes by dot prefixing the property names with a scope identifier or grouping the properties in the same scope using the curly brackets notation. This is shown in the following example:

1
2
3
4
5
6
7
alpha.x  = 1
alpha.y  = 'string value..'

beta {
    p = 2
    q = 'another string ..'
}

11.1.5. Config params

The scope params allows the definition of workflow parameters that overrides the values defined in the main workflow script.

This is useful to consolidate one or more execution parameters in a separate file.

1
2
3
// config file
params.foo = 'Bonjour'
params.bar = 'le monde!'
1
2
3
4
5
6
// workflow script
params.foo = 'Hello'
params.bar = 'world!'

// print both params
println "$params.foo $params.bar"

Exercise

Save the first snippet as nextflow.config and the second one as params.nf. Then run:

nextflow run params.nf
For the result, click here:
Bonjour le monde!

Execute is again specifying the foo parameter on the command line:

nextflow run params.nf --foo Hola
For the result, click here:
Hola le monde!

Compare the result of the two executions.

11.1.6. Config env

The env scope allows the definition of one or more variables that will be exported into the environment where the workflow tasks will be executed.

1
2
env.ALPHA = 'some value'
env.BETA = "$HOME/some/path"

Exercise

Save the above snippet a file named my-env.config. The save the snippet below in a file named foo.nf:

1
2
3
4
5
6
process foo {
  echo true
  '''
  env | egrep 'ALPHA|BETA'
  '''
}

Finally executed the following command:

nextflow run foo.nf -c my-env.config
Your result should look something like the following, click here:
BETA=/home/some/path
ALPHA=some value

11.1.7. Config process

process directives allow the specification of specific settings for the task execution such as cpus, memory, container and other resources in the pipeline script.

This is useful specially when prototyping a small workflow script.

However it’s always a good practice to decouple the workflow execution logic from the process configuration settings, i.e. it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow script.

The process configuration scope allows the setting of any process directives in the Nextflow configuration file. For example:

1
2
3
4
5
process {
    cpus = 10
    memory = 8.GB
    container = 'biocontainers/bamtools:v2.4.0_cv3'
}

The above config snippet defines the cpus, memory and container directives for all processes in your workflow script.

The process selector can be used to apply the configuration to a specific process or group of processes (discussed later).

Memory and time duration units can be specified either using a string based notation in which the digit(s) and the unit can be separated by a blank or by using the numeric notation in which the digit(s) and the unit are separated by a dot character and are not enclosed by quote characters.
String syntax Numeric syntax Value

'10 KB'

10.KB

10240 bytes

'500 MB'

500.MB

524288000 bytes

'1 min'

1.min

60 seconds

'1 hour 25 sec'

-

1 hour and 25 seconds

The syntax for setting process directives in the configuration file requires = (i.e. assignment operator), whereas it should not be used when setting the process directives within the workflow script.

For an example, click here:

process foo { cpus 4 memory 2.GB time 1.hour maxRetries 3

  script:
  """
    your_command --cpus $task.cpus --mem $task.memory
  """
}

This is important, especially when you want to define a config setting using a dynamic expression using a closure. For example:

process foo {
    memory = { 4.GB * task.cpus }
}

Directives that requires more than one value, e.g. pod, in the configuration file need to be expressed as a map object.

process {
    pod = [env: 'FOO', value: '123']
}

Finally directives that allows to be repeated in the process definition, in the configuration files need to be defined as a list object. For example:

process {
    pod = [ [env: 'FOO', value: '123'],
            [env: 'BAR', value: '456'] ]
}

11.1.8. Config Docker execution

The container image to be used for the process execution can be specified in the nextflow.config file:

1
2
process.container = 'nextflow/rnaseq-nf'
docker.enabled = true

The use of unique "SHA256" docker image IDs guarantees that the image content does not change over time, for example:

1
2
process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266'
docker.enabled = true

11.1.9. Config Singularity execution

To run a workflow execution with Singularity, a container image file path is required in the Nextflow config file using the container directive:

1
2
process.container = '/some/singularity/image.sif'
singularity.enabled = true
The container image file must be an absolute path i.e. it must start with a /.

The following protocols are supported:

  • library:// download the container image from the Singularity Library service.

  • shub:// download the container image from the Singularity Hub.

  • docker:// download the container image from the Docker Hub and convert it to the Singularity format.

  • docker-daemon:// pull the container image from a local Docker installation and convert it to a Singularity image file.

* Singularity hub shub:// is now no longer available as a builder service. Though existing images before 19th April 2021 are still working.
By specifying a plain Docker container image name, Nextflow implicitly downloads and converts it to a Singularity image when the Singularity execution is enabled. For example:
1
2
process.container = 'nextflow/rnaseq-nf'
singularity.enabled = true

The above configuration instructs Nextflow to use the Singularity engine to run your script processes. The container is pulled from the Docker registry and cached in the current directory to be used for further runs.

Alternatively if you have a Singularity image file, its absolute path location can be specified as the container name either using the -with-singularity option or the process.container setting in the config file.

Exercise

Try to run the script as shown below, changing the nextflow.config file to the one above using singularity:

nextflow run script7.nf
Nextflow will pull the container image automatically, it will require a few seconds depending the network connection speed.

11.1.10. Config Conda execution

The use of a Conda environment can also be provided in the configuration file adding the following setting in the nextflow.config file:

1
process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"

You can either specify the path of an existing Conda environment directory or the path of Conda environment YAML file.

12. Deployment scenarios

Real world genomic application can spawn the execution of thousands of jobs. In this scenario a batch scheduler is commonly used to deploy a pipeline in a computing cluster, allowing the execution of many jobs in parallel across many compute nodes.

Nextflow has built-in support for most commonly used batch schedulers, such as Univa Grid Engine, SLURM and IBM LSF. Check the Nextflow documentation for the complete list of supported execution platforms.

12.1. Cluster deployment

A key Nextflow feature is the ability to decouple the workflow implementation from the actual execution platform. The implementing of an abstraction layer allows the deployment of the resulting workflow on any executing platform supported by the framework.

nf executors

To run your pipeline with a batch scheduler, modify the nextflow.config file specifying the target executor and the required computing resources, if needed. For example:

1
process.executor = 'slurm'

12.2. Managing cluster resources

When using a batch scheduler, it is often needed to specify the amount of resources (i.e. cpus, memory, execution time, etc.) required by each task.

This can be done using the following process directives:

queue

the cluster queue to be used for the computation

cpus

the number of cpus to be allocated a task execution

memory

the amount of memory to be allocated a task execution

time

the max amount of time to be allocated a task execution

disk

the amount of disk storage required a task execution

12.2.1. Workflow wide resources

Use the scope process to define the resource requirements for all processes in your workflow applications. For example:

1
2
3
4
5
6
7
process {
    executor = 'slurm'
    queue = 'short'
    memory = '10 GB'
    time = '30 min'
    cpus = 4
}

12.2.2. Configure process by name

In real world application different tasks need different amount of computing resources. It is possible to define the resources for a specific task using the select withName: followed by the process name:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
process {
    executor = 'slurm'
    queue = 'short'
    memory = '10 GB'
    time = '30 min'
    cpus = 4

    withName: foo {
        cpus = 2
        memory = '20 GB'
        queue = 'short'
    }

    withName: bar {
        cpus = 4
        memory = '32 GB'
        queue = 'long'
    }
}

Exercise

Run the RNA-Seq script (script7.nf) from earlier, but specify that the quantification process cpu reuirements to 2 cpu and with 5 GB of memory, withon the nextflow.config file.

Click here for the answer:
1
2
3
4
5
6
process {
    withName: quantification {
        cpus = 2
        memory = '5 GB'
    }
}

12.2.3. Configure process by labels

When a workflow application is composed by many processes listing all process names in the configuration file can be difficult to choose resources for each of them.

A better strategy consists of annotating the processes with a label directive. Then specify the resources in the configuration file used for all processes having the same label.

The workflow script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process task1 {
  label 'long'

  """
  first_command --here
  """
}

process task2 {
  label 'short'

  """
  second_command --here
  """
}

The configuration file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process {
    executor = 'slurm'

    withLabel: 'short' {
        cpus = 4
        memory = '20 GB'
        queue = 'alpha'
    }

    withLabel: 'long' {
        cpus = 8
        memory = '32 GB'
        queue = 'omega'
    }
}

12.2.4. Configure multiple containers

It is possible to use a different container for each process in your workflow. You can define their continers in a config file as shown below:

1
2
3
4
5
6
7
8
9
10
process {
  withName: foo {
    container = 'some/image:x'
  }
  withName: bar {
    container = 'other/image:y'
  }
}

docker.enabled = true
Should I use a single fat container or many slim containers? Both approaches have pros & cons. A single container is simpler to build and to maintain, however when using many tools the image can become very big and tools can conflict each other. Using a container for each process can result in many different images to build and to maintain, especially when processes in your workflow uses different tools in each task.

Read more about config process selector at this link.

12.3. Configuration profiles

Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile command line option.

Configuration profiles are defined by using the special scope profiles which group the attributes that belong to the same profile using a common prefix. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
profiles {

    standard {
        params.genome = '/local/path/ref.fasta'
        process.executor = 'local'
    }

    cluster {
        params.genome = '/data/stared/ref.fasta'
        process.executor = 'sge'
        process.queue = 'long'
        process.memory = '10GB'
        process.conda = '/some/path/env.yml'
    }

    cloud {
        params.genome = '/data/stared/ref.fasta'
        process.executor = 'awsbatch'
        process.container = 'cbcrg/imagex'
        docker.enabled = true
    }

}

This configuration defines three different profiles: standard, cluster and cloud that set different process configuration strategies depending on the target runtime platform. By convention the standard profile is implicitly used when no other profile is specified by the user.

To enable a specific profile use -profile option followed by the profile name:

nextflow run <your script> -profile cluster
Two or more configuration profiles can be specified by separating the profile names with a comma character:
nextflow run <your script> -profile standard,cloud

12.4. Cloud deployment

AWS Batch is a managed computing service that allows the execution of containerised workloads in the Amazon cloud infrastructure.

Nextflow provides a built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions as Batch jobs.

Once the Batch environment is configured, specifying the instance types to be used and the max number of cpus to be allocated, you need to create a Nextflow configuration file like the one showed below:

1
2
3
4
5
6
process.executor = 'awsbatch'                          (1)
process.queue = 'nextflow-ci'                          (2)
process.container = 'nextflow/rnaseq-nf:latest'        (3)
workDir = 's3://nextflow-ci/work/'                     (4)
aws.region = 'eu-west-1'                               (5)
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' (6)
1 Set AWS Batch as the executor to run the processes in the workflow
2 The name of the computing queue defined in the Batch environment
3 The Docker container image to be used to run each job
4 The workflow work directory must be a AWS S3 bucket
5 The AWS region to be used
6 The path of the AWS cli tool required to download/upload files to/from the container
The best practice is to keep this setting as a separate profile in your workflow config file. This allows the execution with a simple command.
nextflow run script7.nf

The complete details about AWS Batch deployment are available at this link.

12.5. Volume mounts

EBS volumes (or other supported storage) can be mounted in the job container using the following configuration snippet:

aws {
  batch {
      volumes = '/some/path'
  }
}

Multiple volumes can be specified using comma-separated paths. The usual Docker volume mount syntax can be used to define complex volumes for which the container path is different from the host path or to specify a read-only option:

aws {
  region = 'eu-west-1'
  batch {
      volumes = ['/tmp', '/host/path:/mnt/path:ro']
  }
}
This is a global configuration that has to be specified in a Nextflow config file, as such it’s applied to all process executions.
Nextflow expects those paths to be available. It does not handle the provision of EBS volumes or other kind of storage.

12.6. Custom job definition

Nextflow automatically creates the Batch Job definitions needed to execute your pipeline processes. Therefore it’s not required to define them before you run your workflow.

However, you may still need to specify a custom Job Definition to provide fine-grained control of the configuration settings of a specific job (e.g. to define custom mount paths or other special settings of a Batch Job).

To use your own job definition in a Nextflow workflow, use it in place of the container image name, prefixing it with the job-definition:// string. For example:

process {
    container = 'job-definition://your-job-definition-name'
}

12.7. Custom image

Since Nextflow requires the AWS CLI tool to be accessible in the computing environment, a common solution consists of creating a custom AMI and installing it in a self-contained manner (e.g. using Conda package manager).

When creating your custom AMI for AWS Batch, make sure to use the Amazon ECS-Optimized Amazon Linux AMI as the base image.

The following snippet shows how to install AWS CLI with Miniconda:

sudo yum install -y bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda
$HOME/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
The aws tool will be placed in a directory named bin in the main installation folder. Modifying this directory structure after the installation will cause the tool to not work properly.

Finally specify the aws full path in the Nextflow config file as show below:

aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'

12.8. Launch template

An alternative to is to create a custom AMI using a Launch template that installs the AWS CLI tool during the instance boot via custom user-data.

In the EC2 dashboard, create a Launch template specifying the user data field:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/sh
## install required deps
set -x
export PATH=/usr/local/bin:$PATH
yum install -y jq python27-pip sed wget bzip2
pip install -U boto3

## install awscli
USER=/home/ec2-user
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $USER/miniconda
$USER/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
chown -R ec2-user:ec2-user $USER/miniconda

--//--

Then in the Batch dashboard create a new compute environment and specify the newly created launch template in the corresponding field.

12.9. Hybrid deployments

Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed on the local computer or local computing cluster and some jobs are offloaded to the AWS Batch service.

To enable this feature use one or more process selectors in your Nextflow configuration file.

For example, to apply the AWS Batch configuration only to a subset of processes in your workflow. You can try the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
process {
    executor = 'slurm'  (1)
    queue = 'short'     (2)

    withLabel: bigTask {          (3)
      executor = 'awsbatch'       (4)
      queue = 'my-batch-queue'    (5)
      container = 'my/image:tag'  (6)
  }
}

aws {
    region = 'eu-west-1'    (7)
}
1 Set slurm as the default executor
2 Set the queue for the SLURM cluster
3 Setting of a process named bigTask
4 Set awsbatch as the executor for the bigTask process
5 Set the queue for the bigTask process
6 Set the container image to deploy for the bigTask process
7 Define the region for Batch execution

13. Execution cache and resume

The Nextflow caching mechanism works by assigning a unique ID to each task which is used to create a separate execution directory where the tasks are executed and the results stored.

The task unique ID is generated as a 128-bit hash number obtained composing the task inputs values and files and the command string.

The pipeline work directory is organized as shown below:

work/
├── 12
│   └── 1adacb582d2198cd32db0e6f808bce
│       ├── genome.fa -> /data/../genome.fa
│       └── index
│           ├── hash.bin
│           ├── header.json
│           ├── indexing.log
│           ├── quasi_index.log
│           ├── refInfo.json
│           ├── rsd.bin
│           ├── sa.bin
│           ├── txpInfo.bin
│           └── versionInfo.json
├── 19
│   └── 663679d1d87bfeafacf30c1deaf81b
│       ├── ggal_gut
│       │   ├── aux_info
│       │   │   ├── ambig_info.tsv
│       │   │   ├── expected_bias.gz
│       │   │   ├── fld.gz
│       │   │   ├── meta_info.json
│       │   │   ├── observed_bias.gz
│       │   │   └── observed_bias_3p.gz
│       │   ├── cmd_info.json
│       │   ├── libParams
│       │   │   └── flenDist.txt
│       │   ├── lib_format_counts.json
│       │   ├── logs
│       │   │   └── salmon_quant.log
│       │   └── quant.sf
│       ├── ggal_gut_1.fq -> /data/../ggal_gut_1.fq
│       ├── ggal_gut_2.fq -> /data/../ggal_gut_2.fq
│       └── index -> /data/../asciidocs/day2/work/12/1adacb582d2198cd32db0e6f808bce/index
You can create these plots using the tree function if you have it installed. On unix, simply sudo apt install -y tree or with Homebrew: brew install tree

13.1. How resume works

The -resume command line option allow the continuation of a pipeline execution, since the last step that was successfully completed:

nextflow run <script> -resume

In practical terms the pipeline is executed from the beginning. However, before launching the execution of a process Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files.

If this condition is satisfied the task execution is skipped and previously computed results are used as the process results.

The first task for which a new output is computed, invalidates all downstream executions in the remaining DAG.

13.2. Work directory

The task work directories are created in the folder work in the launching path by default. This is supposed to be a scratch storage area that can be cleaned up once the computation is completed.

Workflow final output(s) are supposed to be stored in a different location specified using one or more publishDir directive.
Make sure to delete you work directory occasionly, else your machine/environment may be filled with unused files.

A different location for the execution work directory can be specified using the command line option -w e.g.

nextflow run <script> -w /some/scratch/dir
If you delete or move the pipeline work directory, it will prevent the use of the resume feature in following runs.

The hash code for input files is computed using:

  • The complete file path

  • The file size

  • The last modified timestamp

Therefore just touching a file will invalidated the related task execution.

13.3. How organize in silico experiments

It’s good practice to organize each experiment in its own folder. The main experiment input parameters should be specified using a Nextflow config file. This makes it simple to track and replicate the experiment over time.

In the same experiment, the same pipeline can be executed multiple times, however it should be avoided to launch two (or more) Nextflow instances in the same directory concurrently.

The nextflow log command lists the executions run in the current folder:

1
2
3
4
5
6
7
$ nextflow log

TIMESTAMP            DURATION  RUN NAME          STATUS  REVISION ID  SESSION ID                            COMMAND
2019-05-06 12:07:32  1.2s      focused_carson    ERR     a9012339ce   7363b3f0-09ac-495b-a947-28cf430d0b85  nextflow run hello
2019-05-06 12:08:33  21.1s     mighty_boyd       OK      a9012339ce   7363b3f0-09ac-495b-a947-28cf430d0b85  nextflow run rnaseq-nf -with-docker
2019-05-06 12:31:15  1.2s      insane_celsius    ERR     b9aefc67b4   4dc656d2-c410-44c8-bc32-7dd0ea87bebf  nextflow run rnaseq-nf
2019-05-06 12:31:24  17s       stupefied_euclid  OK      b9aefc67b4   4dc656d2-c410-44c8-bc32-7dd0ea87bebf  nextflow run rnaseq-nf -resume -with-docker

You can use either the session ID or the run name to recover a specific execution. For example:

nextflow run rnaseq-nf -resume mighty_boyd

13.4. Execution provenance

The log command when provided with a run name or session ID can return many useful bits of information about a pipeline execution that can be used to create a provenance report.

By default, it lists the work directories used to compute each task. For example:

$ nextflow log tiny_fermat

/data/.../work/7b/3753ff13b1fa5348d2d9b6f512153a
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
/data/.../work/82/ba67e3175bd9e6479d4310e5a92f99
/data/.../work/e5/2816b9d4e7b402bfdd6597c2c2403d
/data/.../work/3b/3485d00b0115f89e4c202eacf82eba

Using the option -f (fields) it’s possible to specify which metadata should be printed by the log command. For example:

$ nextflow log tiny_fermat -f 'process,exit,hash,duration'

index    0   7b/3753ff  2.0s
fastqc   0   c1/56a36d  9.3s
fastqc   0   f7/659c65  9.1s
quant    0   82/ba67e3  2.7s
quant    0   e5/2816b9  3.2s
multiqc  0   3b/3485d0  6.3s

The complete list of available fields can be retrieved with the command:

nextflow log -l

The option -F allows the specification of a filtering criteria to print only a subset of tasks. For example:

$ nextflow log tiny_fermat -F 'process =~ /fastqc/'

/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310

This can be useful to locate specific task work directories.

Finally, the -t flag option allows the creation of a basic custom provenance report, showing a template file in any format of your choice. For example:

<div>
<h2>${name}</h2>
<div>
Script:
<pre>${script}</pre>
</div>

<ul>
    <li>Exit: ${exit}</li>
    <li>Status: ${status}</li>
    <li>Work dir: ${workdir}</li>
    <li>Container: ${container}</li>
</ul>
</div>

Save the above snippet in a file named template.html. Then run this command (using the correct id for your run, e.g. not tiny_fermat):

nextflow log tiny_fermat -t template.html > prov.html

Finally open the prov.html file with a browser.

13.5. Resume troubleshooting

If your workflow execution is not resumed as expected with one or more tasks re-executed each time, these may be the most likely causes:

  • Input file changed: Make sure that there’s no change in your input file(s). Don’t forget the task unique hash is computed taking into account the complete file path, the last modified timestamp and the file size. If any of these information changes, the workflow will be re-executed even if the input content is the same.

  • A process modifies an input: A process should never alter input files, otherwise the resume for future executions will be invalidated for the same reason explained in the previous point.

  • Inconsistent file attributes: Some shared file systems, such as NFS, may report an inconsistent file timestamp (i.e. a different timestamp for the same file) even if it has not been modified. To prevent this problem use the lenient cache strategy.

  • Race condition in global variable: Nextflow is designed to simplify parallel programming without taking care about race conditions and the access to shared resources. One of the few cases in which a race condition can arise is when using a global variable with two (or more) operators. For example:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    Channel
        .from(1,2,3)
        .map { it -> X=it; X+=2 }
        .view { "ch1 = $it" }
    
    Channel
        .from(1,2,3)
        .map { it -> X=it; X*=2 }
        .view { "ch2 = $it" }

    The problem in this snippet is that the X variable in the closure definition is defined in the global scope. Therefore, since operators are executed in parallel, the X value can be overwritten by the other map invocation.

    The correct implementation requires the use of the def keyword to declare the variable local.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    Channel
        .from(1,2,3)
        .map { it -> def X=it; X+=2 }
        .println { "ch1 = $it" }
    
    Channel
        .from(1,2,3)
        .map { it -> def X=it; X*=2 }
        .println { "ch2 = $it" }
  • Not deterministic input channels: While dataflow channel ordering is guaranteed (i.e. data is read in the same order in which it’s written in the channel), a process can declare as input two or more channels each of which is the output of a different process, the overall input ordering is not consistent over different executions.

    In practical term, consider the following snippet:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
    process foo {
      input: set val(pair), file(reads) from ...
      output: set val(pair), file('*.bam') into bam_ch
      """
      your_command --here
      """
    }
    
    process bar {
      input: set val(pair), file(reads) from ...
      output: set val(pair), file('*.bai') into bai_ch
      """
      other_command --here
      """
    }
    
    process gather {
      input:
      set val(pair), file(bam) from bam_ch
      set val(pair), file(bai) from bai_ch
      """
      merge_command $bam $bai
      """
    }

    The inputs declared at line 19,20 can be delivered in any order because the execution order of the process foo and bar is not deterministic due to the parallel executions of them.

    Therefore the input of the third process needs to be synchronized using the join operator or a similar approach. The third process should be written as:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    ...
    
    process gather {
      input:
      set val(pair), file(bam), file(bai) from bam_ch.join(bai_ch)
      """
      merge_command $bam $bai
      """
    }

14. Error handling & troubleshooting

14.1. Execution error debugging

When a process execution exits with a non-zero exit status, Nextflow stops the workflow execution and reports the failing task:

ERROR ~ Error executing process > 'index'

Caused by:           (1)
  Process `index` terminated with an error exit status (127)

Command executed:    (2)

  salmon index --threads 1 -t transcriptome.fa -i index

Command exit status: (3)
  127

Command output:      (4)
  (empty)

Command error:       (5)
  .command.sh: line 2: salmon: command not found

Work dir:            (6)
  /Users/pditommaso/work/0b/b59f362980defd7376ee0a75b41f62
1 A description of the error cause
2 The command executed
3 The command exit status
4 The command standard output, when available
5 The command standard error
6 The command work directory

Review carefully all these data, they can provide valuable information on the cause of the error.

If this is not enough, cd into the task work directory. It contains all the files to replicate the issue in a isolated manner.

The task execution directory contains these files:

  • .command.sh: The command script.

  • .command.run: The command wrapped used to run the job.

  • .command.out: The complete job standard output.

  • .command.err: The complete job standard error.

  • .command.log: The wrapper execution output.

  • .command.begin: Sentinel file created as soon as the job is launched.

  • .exitcode: A file containing the task exit code.

  • Task input files (symlinks)

  • Task output files

Verify that the .command.sh file contains the expected command to be executed and all variables are correctly resolved.

Also verify the existence of the .exitcode and .command.begin files, which if absent, suggest the task was never executed by the subsystem (e.g. the batch scheduler). If the .command.begin file exist, the job was launched but was likely killed abruptly.

You can replicate the failing execution using the command bash .command.run and verify the cause of the error.

14.2. Ignore errors

There are cases in which a process error may be expected and it should not stop the overall workflow execution.

To handle this use case, set the process errorStrategy to ignore:

1
2
3
4
5
6
7
process foo {
  errorStrategy 'ignore'
  script:
  """
    your_command --this --that
  """
}

If you want to ignore any error set the same directive in the config file as a default setting:

1
process.errorStrategy = 'ignore'

14.3. Automatic error fail-over

In rare cases, errors may be caused by transient conditions. In this situation, an effective strategy is re-executing the failing task.

1
2
3
4
5
6
7
process foo {
  errorStrategy 'retry'
  script:
  """
    your_command --this --that
  """
}

Using the retry error strategy the task is re-executed a second time if it returns a non-zero exit status before stopping the complete workflow execution.

The directive maxRetries can be used to set number of attempts the task can be re-execute before declaring it failed with an error condition.

14.4. Retry with backoff

There are cases in which the required execution resources may be temporary unavailable (e.g. network congestion). In these cases simply re-executing the same task will likely result in the identical error. A retry with an exponential backoff delay can better recover these error conditions.

1
2
3
4
5
6
7
8
process foo {
  errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
  maxRetries 5
  script:
  '''
  your_command --here
  '''
}

14.5. Dynamic resources allocation

It’s a very common scenario that different instances of the same process may have very different needs in terms of computing resources. In such situations requesting for example, an amount of memory too low will cause some tasks to fail. Instead, using a higher limit that fits all the tasks in your execution could significantly decrease the execution priority of your jobs.

To handle this use case, you can use a retry error strategy and increasing the computing resources allocated by the job at each successive attempt.

1
2
3
4
5
6
7
8
9
10
11
12
process foo {
  cpus 4
  memory { 2.GB * task.attempt }   (1)
  time { 1.hour * task.attempt }   (2)
  errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' }   (3)
  maxRetries 3   (4)

  script:
  """
    your_command --cpus $task.cpus --mem $task.memory
  """
}
1 The memory is defined in a dynamic manner, the first attempt is 2 GB, the second 4 GB, and so on.
2 The wall execution time is set dynamically as well, the first execution attempt is set to 1 hour, the second 2 hours, and so on.
3 If the task returns an exit status equal to 140 it will set the error strategy to retry otherwise it will terminate the execution.
4 It will retry the process execution up to three times.

15. Parsing datasets into Nextflow

An important skill in Nextflow is to know how to properly parse files into your processes.

15.1. Text files

The splitText operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel. See:

Channel
     .fromPath('data/meta/random.txt') (1)
     .splitText()                      (2)
     .view()                           (3)
1 Instructs Nextflow to make a channel from the path "data/meta/random.txt".
2 The splitText operator splits each item into chunks of one line by default.
3 View contents of the channel.

You can define the number of lines in each chunk by using the parameter by, as shown in the following example:

Channel
     .fromPath('data/meta/random.txt')
     .splitText( by: 2 )
     .subscribe {
         print it;
         print "--- end of the chunk ---\n"
     }
The subscribe operator permits to execute a user defined function each time a new value is emitted by the source channel.

An optional closure can be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 10 lines and transform them into capital letters:

Channel
   .fromPath('data/meta/random.txt')
   .splitText( by: 10 ) { it.toUpperCase() }
   .view()

You can also make counts for each line:

count=0

Channel
   .fromPath('data/meta/random.txt')
   .splitText()
   .view { "${count++}: ${it.toUpperCase().trim()}" }

Finally, you can also use the operator on plain files (outside of the channel context), as so:

  def f = file('data/meta/random.txt')
  def lines = f.splitText()
  def count=0
  for( String row : lines ) {
    log.info "${count++} ${row.toUpperCase()}"
  }

15.2. Comma separate values (.csv)

The splitCsv operator allows you to parse text items emitted by a channel, that are formatted using the CSV format.

It then splits them into records or groups them into a list of records with a specified length.

In the simplest case, just apply the splitCsv operator to a channel emitting a CSV formatted text files or text entries, to view only the first and fourth columns. For example:

  Channel
    .fromPath("data/meta/patients_1.csv")
    .splitCsv()
    // row is a list object
    .view { row -> "${row[0]},${row[3]}" }

When the CSV begins with a header line defining the column names, you can specify the parameter header: true which allows you to reference each value by its name, as shown in the following example:

  Channel
    .fromPath("data/meta/patients_1.csv")
    .splitCsv(header: true)
    // row is a list object
    .view { row -> "${row.patient_id},${row.num_samples}" }

Alternatively you can provide custom header names by specifying a the list of strings in the header parameter as shown below:

  Channel
    .fromPath("data/meta/patients_1.csv")
    .splitCsv(header: ['col1', 'col2', 'col3', 'col4', 'col5'] )
    // row is a list object
    .view { row -> "${row.col1},${row.col4}" }

You can also process multiple csv files at the same time:

    Channel
      .fromPath("data/meta/patients_*.csv") // <-- just use a pattern
      .splitCsv(header:true)
      .view { row -> "${row.patient_id}\t${row.num_samples}" }
Notice that you can change the output format simply by adding a different delimiter.

Finally, you can also operate on csv files outside the channel context, as so:

def f = file('data/meta/patients_1.csv')
  def lines = f.splitCsv()
  for( List row : lines ) {
    log.info "${row[0]} -- ${row[2]}"
  }

15.3. Tab separated values (.tsv)

Parsing tsv files works in a similar way, just adding the sep:'\t' option in the splitCsv context:

 Channel
      .fromPath("data/meta/regions.tsv", checkIfExists:true)
      // use `sep` option to parse TAB separated files
      .splitCsv(sep:'\t')
      // row is a list object
      .view()

Exercise

Try using the tab separation technique on the file "data/meta/regions.tsv", but print just the first column, and remove the header.

Answer:
Channel
     .fromPath("data/meta/regions.tsv", checkIfExists:true)
     // use `sep` option to parse TAB separated files
     .splitCsv(sep:'\t', header:true )
     // row is a list object
     .view { row -> "${row.patient_id}" }

15.4. More complex file formats

15.4.1. JSON

We can also easily parse the JSON file format using the following groovy schema:

import groovy.json.JsonSlurper

def f = file('data/meta/regions.json')
def records = new JsonSlurper().parse(f)


for( def entry : records ) {
  log.info "$entry.patient_id -- $entry.feature"
}
When using an older JSON version, you may need to replace parse(f) with parseText(f.text)

15.4.2. YAML

In a similar way, this is a way to parse YAML files:

import org.yaml.snakeyaml.Yaml

def f = file('data/meta/regions.json')
def records = new Yaml().load(f)


for( def entry : records ) {
  log.info "$entry.patient_id -- $entry.feature"
}

15.4.3. Storage of parsers into modules

The best way to store parser scripts is to keep them in a nextflow module file.

This follows the DSL2 way of working.

See the following nextflow script:

nextflow.preview.dsl=2

include{ parseJsonFile } from './modules/parsers.nf'

process foo {
  input:
  tuple val(meta), path(data_file)

  """
  echo your_command $meta.region_id $data_file
  """
}

workflow {
    Channel.fromPath('data/meta/regions*.json') \
      | flatMap { parseJsonFile(it) } \
      | map { entry -> tuple(entry,"/some/data/${entry.patient_id}.txt") } \
      | foo
}

To get this script to work, first we need to create a file called parsers.nf, and store it in the modules folder in the current directory.

This file should have the parseJsonFile function present, then Nextflow will use this as a custom function within the workflow scope.