Get started with Nextflow on Verily Workbench

Step-by-step instructions for running a Nextflow pipeline

Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow

Purpose: This document provides detailed instructions for configuring and running Nextflow pipelines in Verily Workbench.


Introduction

Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.

To get set up, you will:

  1. Create a Workbench workspace
  2. Create resources in the workspace
  3. Create a cloud app in the workspace on which to run Nextflow

The following sections walk you through that setup, then show how wb makes it easy to configure and run a Nextflow pipeline. This tutorial has two examples; one shows running an example pipeline from the nextflow-io GitHub org, and one shows how to run a nf-core pipeline, where the nf-core project provides a community-generated, curated, collection of analysis pipelines built using Nextflow.

Both examples include configurations for Google Cloud Life Sciences API and Google Batch API as the Nextflow pipeline process.executor. This allows the pipelines to be run at scale, with Nextflow processes executed on separate cloud virtual machines.

1. Create a workspace

If you don't already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or the web UI.

To create a workspace via the web UI, see the instructions here.

First, check wb status. If no workspace is set, or you are not logged in, first log in and set the workspace that you want to work in. Otherwise, create a new workspace.

wb status
wb auth login  # if need be
wb workspace list

To create a new workspace:

wb workspace create –name=<workspace-name>

To set the Workbench CLI to use an existing workspace:

wb workspace set –id=<workspaceid>

2. Create workspace resources: GitHub repos and a Cloud Storage bucket

If you haven't already, you'll need to create a Cloud Storage bucket resource, which will be used by Nextflow for staging and logging.

We'll also create a Git resource that points to the Nextflow example repo. Any notebook instances that you subsequently create in your workspace will automatically clone that repo for you.

To create a Cloud Storage bucket resource via the web UI, see the instructions here. Note the name of this resource, which you'll need below. E.g., name it nf_files.

Then, create referenced resources for the example Git repositories, as described here.
The repository URLs to use are:

These repositories are public, so for this example, you don't need to set up the Workbench SSH keys.

If you do not already have a bucket resource that you want to use, you can create one as follows. The name of this resource will be nf_files.

wb resource create gcs-bucket --id=nf_files  \
  --description="Bucket for Nextflow run logs and output."

Then, create referenced resources to the Git repositories we'll use for these examples:

wb resource add-ref git-repo --id=rnaseq-nf-repo --repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core-configs --repo-url=https://github.com/nf-core/configs.git

You can list your newly created resources:

wb resource list

3. Create an app to run Nextflow

Next, create a Workbench notebook app on which to run the Nextflow examples.

Your app will have Nextflow pre-installed, and any workspace Git repo resources — such as the one we just defined — will be automatically cloned. However, if you want to run the example on your local machine, you can install nextflow yourself.

To create a cloud app via the web UI, see the instructions here.

Once your app is running, you can click on the link next to it to bring up JupyterLab on the new app instance. To reduce costs, you can STOP the instance from its 'three-dot' menu, when you're not using it, and restart it again later.

Create a new app:

wb app create gcp --app-config=<config_type> \
  --id=<notebook_resource_id> \
  --description=<description>

After your notebook resource is created, you can see its details via:

wb resource describe --id <notebook_resource_id>

Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new app.

Tip: The info in the resource description also indicates whether your app is RUNNING or TERMINATED. You can stop your app when you’re not using it, and then restart it, via:
> wb app stop --id <notebook_resource_id> and
> wb app start --id <notebook_resource_id>.

Using Workbench environment variables in Nextflow config files

Workbench supports running Nextflow via a 'passthrough' command, e.g. wb nextflow .... When you use this construct you are able to add Workbench-specific environment variables to Nextflow configuration files. For example, you can use the $WORKBENCH_<bucket_resource_name> construct, and it will be expanded to gs://<underlying_GCS_bucket>. In addition, in a Workbench app, variables like $GOOGLE_SERVICE_ACCOUNT_EMAIL and $GOOGLE_CLOUD_PROJECT will be set.

In the examples below, we'll leverage this capability when we create the Nextflow config files.

Configure and run the example 'rnaseq-nf' Nextflow pipeline

Because you created a Git repo resource, the rnaseq-nf example should be automatically cloned into your new app, and you should see its directory, rnaseq-nf, at the top level of your file system. (If it is not there, you can run:
wb git clone --resource=rnaseq-nf-repo).

In the JupyterLab Terminal window, change to the rnaseq-nf directory and check out the tagged v2.1 version.

cd rnaseq-nf
git checkout v2.1

Edit the Nextflow configuration file

You can determine how a workflow is run by specifying the executor. The configuration for two executors are given below.

In the rnaseq-nf directory, edit the nextflow.config file. Replace the entry corresponding with your chosen job executor with the snippet given in the following sections. Edit the workDir line to replace <your_bucket_resource_name> with the name of the Cloud Storage bucket resource you created.

As you'll see below, you will run the Nextflow pipeline via wb, and wb will substitute the correct values for the environment variables in the config before it runs.

If you’ve forgotten the name of the bucket resource you created, you can find it via:

wb resource list  # this shows the resource names

or in the Resources tab of your workspace.

Google Cloud Life Sciences API (deprecated)

Update the gls entry in nextflow.config.

 gls {
    // Workflow params
    params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
    params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
    params.multiqc = 'gs://rnaseq-nf/multiqc'

    // Google Life Sciences config
    process.executor = 'google-lifesciences'
    process.container = 'nextflow/rnaseq-nf:latest'
    // Edit the following line for your bucket resource
    workDir = "$WORKBENCH_<your_bucket_resource_name>/nf"
    google.location = 'us-central1'
    google.region  = 'us-central1'
    google.project = "$GOOGLE_CLOUD_PROJECT"
    google.lifeSciences.usePrivateAddress = true
    google.lifeSciences.network = 'network'
    google.lifeSciences.subnetwork = 'subnetwork'
    google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
  }

In a terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the instructions above), and sanity-check your config changes.

cd ..
wb nextflow config rnaseq-nf/main.nf -profile gls

You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.

Google Batch API

Update the google-batch entry in nextflow.config.

'google-batch' {
    // Workflow params
    params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
    params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
    params.multiqc = 'gs://rnaseq-nf/multiqc'

    // Google Batch config
    process.executor = 'google-batch'
    process.container = 'nextflow/rnaseq-nf:latest'
    // Edit the following line for your bucket resource
    workDir = "$WORKBENCH_mybucket/scratch"

    google.region  = 'us-east1'
    google.project = "$GOOGLE_CLOUD_PROJECT"

    google.batch.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
    google.batch.usePrivateAddress = true
    google.batch.network = 'global/networks/network'
    google.batch.subnetwork = 'regions/us-east1/subnetworks/subnetwork'
}

In a terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the instructions above), and sanity-check your config changes.

cd ..
wb nextflow config rnaseq-nf/main.nf -profile google-batch

You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.

Run the Nextflow example workflow via wb

After you check your config, you’re ready to run the Nextflow example. Select the appropriate job executor as the profile. The following command selects Google Batch API. For Cloud Life Sciences API, specify -profile gls.

In the parent directory of rnaseq-nf, run:

wb nextflow run rnaseq-nf/main.nf -profile google-batch

The workflow will take about 10 minutes to complete.

Configure and run a nf-core example

The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines— so if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.

Edit the nf-core google.config configuration file

In the left-hand File navigator, navigate to the ~/repos/nf-core-configs/conf directory. Find the config file corresponding to your chosen job executor. (If you gave the repo a different name in Step #2, find its folder instead under repos).

In the config file, edit the google_bucket line to replace <your_bucket_name> with the name of the Cloud Storage bucket resource you will be using. E.g., if you used the suggested bucket name in Step #2, <your_bucket_name> would be replaced with nf_files.

As you'll see in the following step, you will run the Nextflow job via wb, and wb will substitute the correct value for the environment variables in the config before it runs.

Google Cloud Life Sciences API (deprecated)

The configuration file is google.conf.

// Nextflow config file for running on Google Cloud Life Sciences
// Edit the 'google_bucket' param before using.
params {
    config_profile_description = 'Google Cloud Life Sciences Profile'
    config_profile_contact = 'Evan Floden, Seqera Labs (@evanfloden)'
    config_profile_url = 'https://cloud.google.com/life-sciences'

    google_zone = 'us-central1-c'
    google_bucket = "$WORKBENCH_<your_bucket_name>/nf-core"
    google_debug = true
    google_preemptible = true

    boot_disk                       = '100 GB'
    workers_service_account          = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
    project_id = "$GOOGLE_CLOUD_PROJECT"
}

process.executor = 'google-lifesciences'
google.zone = params.google_zone
google.project = params.project_id
google.lifeSciences.serviceAccountEmail = params.workers_service_account
google.lifeSciences.usePrivateAddress = true
google.lifeSciences.debug = params.google_debug
workDir = params.google_bucket
google.lifeSciences.preemptible = params.google_preemptible
google.lifeSciences.network = 'network'
google.lifeSciences.subnetwork = 'subnetwork'

if (google.lifeSciences.preemptible) {
    process.errorStrategy = { task.exitStatus in [8,10,14] ? 'retry' : 'terminate' }
    process.maxRetries = 5
}

process.machineType = { task.memory > task.cpus * 6.GB ? ['custom', task.cpus, task.cpus * 6656].join('-') : null }

Google Batch API

The configuration file is googlebatch.conf.

// Nextflow config file for running on Google Batch API
// Edit the 'google_bucket' param before using.
params {
    config_profile_description = 'Google Cloud Batch API Profile'
    config_profile_contact     = 'Hatem Nawar @hnawar'
    config_profile_url         = 'https://cloud.google.com/batch'

    google_location = 'us-central1'
    google_zone = 'us-central1-c'
    google_bucket = "$WORKBENCH_<your_bucket_name>/nf-core"
    google_debug = true
    google_preemptible = true

    //networking
    use_private_ip             = true
    // Custom VPC should be in this format 'global/networks/[custom_VPC]'
    custom_vpc                 = 'global/networks/network'
    //Custom subnet should be in this format 'regions/[GCP_Region]/subnetworks/[custom_subnet]'
    custom_subnet              = 'regions/us-central1/subnetworks/subnetwork'


    boot_disk                  = '100 GB'
    workers_service_account    = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
    project_id                 = "$GOOGLE_CLOUD_PROJECT"
}

workDir = params.google_bucket
google {
    zone                      = params.google_zone
    location                  = params.google_location
    project                   = params.project_id
    batch.network             = params.custom_vpc
    batch.subnetwork          = params.custom_subnet
    batch.usePrivateAddress   = params.use_private_ip
    batch.debug               = params.google_debug
    batch.serviceAccountEmail = params.workers_service_account
    batch.bootDiskSize        = params.boot_disk
    batch.preemptible         = params.google_preemptible
}

process.executor = 'google-batch'
if (google.batch.preemptible) {
    process.errorStrategy = { task.exitStatus in [8,10,14] ? 'retry' : 'terminate' }
    process.maxRetries = 5
}
process.machineType = { task.memory > task.cpus * 6.GB ? ['custom', task.cpus, task.cpus * 6656].join('-') : null }

Run a nf-core pipeline via wb

When you chose a NF-core pipeline to run, the pipeline definition will automatically be fetched (and stored under: ~/.nextflow/assets/nf-core). For every pipeline, the test profile can be used in conjunction with the google profile (or any other config) to run the pipeline with some test data.

For this example, you'll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.

You can first confirm your config by running the following command in the Terminal. This will check out the given pipeline from its repo and make it available locally if need be. Run this command from the nf-core-config (repository checkout) directory.

Choose the appropriate job executor for the profile. The following commands select Google Batch API. For Cloud Life Sciences API, specify -profile test,google.

cd ~/repos/nf-core-config
wb nextflow config nf-core/viralrecon -profile test,google-batch

Then, run the pipeline, still in the nf-core-config repo checkout directory in the Terminal. Before you run the following command, edit it to replace <your_bucket_name> with your bucket resource name in the --outdir param. If you used the suggested bucket name in Step #2, <your_bucket_name> would be replaced with nf_files. The outdir holds run results, so each time you run the pipeline, use a different 'outdir' path.

# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile test,google-batch --outdir '$WORKBENCH_<your_bucket_name>'/viralrecon_outdir1

Summary

This tutorial showed two examples of running Nextflow pipelines on Workbench. One was an example from https://github.com/nextflow-io/, and the other showed how to set up and use a nf-core config file for any nf-core pipeline.

Workbench makes it easy to set up config files that need minimal editing to work in any Workbench app and to run pipeline tasks scalably in the cloud.

Last Modified: 20 June 2025