Get started with Nextflow on Verily Workbench

How to use Nextflow with the Workbench CLI

Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow

Purpose: This document provides detailed instructions for configuring and running Nextflow pipelines in Verily Workbench.


Introduction

Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.

To get set up, you will:

  1. create a Workbench workspace
  2. create resources in the workspace
  3. create a cloud environment in the workspace on which to run Nextflow

The following sections walk you through that setup, then show how wb makes it easy to configure and run a Nextflow pipeline. This tutorial has two examples; one shows running an example pipeline from the nextflow-io GitHub org, and one shows how to run a nf-core pipeline, where the nf-core project provides a community-generated, curated, collection of analysis pipelines built using Nextflow.

Both of the examples will use the Google Cloud Life Sciences API as the Nextflow pipeline process.executor. This allows the pipelines to be run scalably, with Nextflow processes executed on separate cloud virtual machines.

1. Create a workspace

If you don’t already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or the web UI.

To create a workspace via the web UI, see the instructions here.

First, check wb status. If no workspace is set, or you are not logged in, first log in and set the workspace that you want to work in. Otherwise, create a new workspace.

wb status
wb auth login  # if need be
wb workspace list

To create a new workspace:

wb workspace create –name=<workspace-name>

To set the Workbench CLI to use an existing workspace:

wb workspace set –id=<workspaceid>

2. Create workspace resources: GitHub repos and a GCS bucket

If you haven’t already, you’ll need to create a GCS bucket resource, which will be used by Nextflow for staging and logging.

We’ll also create a Git resource that points to the Nextflow example repo. Any notebook instances that you subsequently create in your workspace will automatically clone that repo for you.

To create a Cloud Storage bucket resource via the web UI, see the instructions here. Note the name of this resource, which you’ll need below. E.g., name it nf_files.

Then, create referenced resources for the example Git repositories, as described here.
The repository URLs to use are:

These repositories are public, so for this example, you don’t need to set up the Workbench SSH keys.

If you do not already have a bucket resource that you want to use, you can create one as follows. The name of this resource will be nf_files.

wb resource create gcs-bucket --id=nf_files  \
  --description="Bucket for Nextflow run logs and output."

Then, create referenced resources to the Git repositories we’ll use for these examples:

wb resource add-ref git-repo --id=rnaseq-nf-repo --repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core-configs --repo-url=https://github.com/nf-core/configs.git

You can list your newly created resources:

wb resource list

3. Create a cloud environment to run Nextflow

Next, create a Workbench notebook environment on which to run the Nextflow examples.

Your cloud environment will have Nextflow pre-installed, and any workspace Git repo resources — such as the one we just defined — will be automatically cloned. However, if you want to run the example on your local machine, you can install nextflow yourself.

To create a cloud environment via the web UI, see the instructions here.

Once your environment is running, you can click on the link next to it to bring up JupyterLab on the new notebook instance. To reduce costs, you can STOP the instance from its ’three-dot' menu, when you’re not using it, and restart it again later.

Create a new cloud environment:

wb resource create gcp-notebook --id=<notebook_resource_id> \
  --description=<description>

After your notebook resource is created, you can see its details via:

wb resource describe --id <notebook_resource_id>

Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new cloud environment.

Tip: The info in the resource description also indicates whether your cloud environment is ACTIVE or STOPPED. You can stop your cloud environment when you’re not using it, and then restart it, via:
> wb notebook stop --id <notebook_resource_id> and
> wb notebook start --id <notebook_resource_id>.

Using Workbench environment variables in Nextflow config files

Workbench supports running Nextflow via a ‘passthrough’ command, e.g. wb nextflow .... When you use this construct you are able to add Workbench-specific environment variables to Nextflow configuration files. For example, you can use the $WORKBENCH_<bucket_resource_name> construct, and it will be expanded to gs://<underlying_GCS_bucket>. In addition, in a Workbench cloud environment, variables like $GOOGLE_SERVICE_ACCOUNT_EMAIL and $GOOGLE_CLOUD_PROJECT will be set.

In the examples below, we’ll leverage this capability when we create the Nextflow config files.

Configure and run the example ‘rnaseq-nf’ Nextflow pipeline

Because you created a Git repo resource, the rnaseq-nf example should be automatically cloned into your new cloud environment, and you should see its directory, rnaseq-nf, at the top level of your file system. (If it is not there, you can run:
wb git clone --resource=rnaseq-nf-repo).

In the JupyterLab Terminal window, change to the rnaseq-nf directory and check out the tagged v2.1 version.

cd rnaseq-nf
git checkout v2.1

Edit the nextflow configuration file

Then, still in the rnaseq-nf directory, edit the nextflow.config file as follows. Replace the gls entry with the following snippet. Edit the workDir line to replace <your_bucket_resource_name> with the name of the GCS bucket resource you created.

As you’ll see below, you will run the Nextflow pipeline via wb, and wb will substitute the correct values for the environment variables in the config before it runs.

 gls {
    params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
    params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
    params.multiqc = 'gs://rnaseq-nf/multiqc'
    process.executor = 'google-lifesciences'
    process.container = 'nextflow/rnaseq-nf:latest'
    // edit the following line for your bucket resource
    workDir = "$WORKBENCH_<your_bucket_resource_name>/nf"
    google.location = 'us-central1'
    google.region  = 'us-central1'
    google.project = "$GOOGLE_CLOUD_PROJECT"
    google.lifeSciences.usePrivateAddress = true
    google.lifeSciences.network = 'network'
    google.lifeSciences.subnetwork = 'subnetwork'
    google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
  }

If you’ve forgotten the name of the bucket resource you created, you can find it via:

wb resource list  # this shows the resource names

or in the “Resources” tab of your workspace.

Then, in a Terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the instructions above), and sanity-check your config changes.

cd ..
wb nextflow config rnaseq-nf/main.nf -profile gls

You should see output that shows instantiated values for your workspace project, GCS bucket, and service account email.

Run the Nextflow example workflow via wb

After you check your config, you’re ready to run the Nextflow example. In the parent directory of rnaseq-nf, run:

wb nextflow run rnaseq-nf/main.nf -profile gls

The workflow will take about 10 minutes to complete.

Configure and run a nf-core example

The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines— so if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.

Edit the nf-core google.config configuration file

In the left-hand File navigator, navigate to the ~/repos/nf-core-configs/conf directory. Find the google.config file in the listing and double-click on it to edit it. (If you gave the repo a different name in Step #2, find its folder instead under repos).

Edit the google.config file to be the following. Edit the google_bucket line to replace <your_bucket_name> with the name of the Cloud Storage bucket resource you will be using. E.g., if you used the suggested bucket name in Step #2, <your_bucket_name> would be replaced with nf_files.

As you’ll see below, you will run the Nextflow pipeline via wb, and wb will substitute the correct value for the environment variables in the config before it runs.

// Nextflow config file for running on Google Cloud Life Sciences
// Edit the 'google_bucket' param before using.
params {
    config_profile_description = 'Google Cloud Life Sciences Profile'
    config_profile_contact = 'Evan Floden, Seqera Labs (@evanfloden)'
    config_profile_url = 'https://cloud.google.com/life-sciences'

    google_zone = 'us-central1-c'
    google_bucket = "$WORKBENCH_<your_bucket_name>/nf-core"
    google_debug = true
    google_preemptible = true

    boot_disk                       = '100 GB'
    workers_service_account          = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
    project_id = "$GOOGLE_CLOUD_PROJECT"
}

process.executor = 'google-lifesciences'
google.zone = params.google_zone
google.project = params.project_id
google.lifeSciences.serviceAccountEmail = params.workers_service_account
google.lifeSciences.usePrivateAddress = true
google.lifeSciences.debug = params.google_debug
workDir = params.google_bucket
google.lifeSciences.preemptible = params.google_preemptible
google.lifeSciences.network = 'network'
google.lifeSciences.subnetwork = 'subnetwork'

if (google.lifeSciences.preemptible) {
    process.errorStrategy = { task.exitStatus in [8,10,14] ? 'retry' : 'terminate' }
    process.maxRetries = 5
}

process.machineType = { task.memory > task.cpus * 6.GB ? ['custom', task.cpus, task.cpus * 6656].join('-') : null }

Run a nf-core pipeline with a test profile via wb

When you chose a NF-core pipeline to run, the pipeline definition will automatically be fetched (and stored under: ~/.nextflow/assets/nf-core). For every pipeline, the test profile can be used in conjunction with the google profile (or any other config) to run the pipeline with some test data.

For this example, you’ll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.

You can first confirm your config by running the following command in the Terminal. This will check out the given pipeline from its repo and make it available locally if need be. Run this command from the nf-core-config (repository checkout) directory.

cd ~/repos/nf-core-config
wb nextflow config nf-core/viralrecon -profile test,google

Then, run the pipeline, still in the nf-core-config repo checkout directory in the Terminal. Before you run the following command, edit it to replace <your_bucket_name> with your bucket resource name in the --outdir param. If you used the suggested bucket name in Step #2, <your_bucket_name> would be replaced with nf_files. The outdir holds run results, so each time you run the pipeline, use a different ‘outdir’ path.

# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile test,google --outdir '$WORKBENCH_<your_bucket_name>'/viralrecon_outdir1

Summary

This tutorial showed two examples of running Nextflow pipelines on Workbench. One was an example from https://github.com/nextflow-io/, and the other showed how to set up and use a nf-core config file for any nf-core pipeline.

Workbench makes it easy to set up config files that need minimal editing to work in any Workbench cloud environment, and to run pipeline tasks scalably in the cloud.

Last Modified: 22 August 2024