Get started with Nextflow on Verily Workbench

Step-by-step instructions for running a Nextflow pipeline

Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow

Purpose: This document provides detailed instructions for configuring and running Nextflow pipelines in Verily Workbench.


Introduction

Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.

The following guide will help you create a workspace with a cloud app that runs Nextflow. There will be two examples: one running an example pipeline from the nextflow-io GitHub org, and the other running a nf-core pipeline, where the nf-core project provides a community-generated, curated collection of analysis pipelines built using Nextflow.

Both examples include configurations for Google Batch API as the Nextflow pipeline process.executor. This allows the pipelines to be run at scale, with Nextflow processes executed on separate cloud virtual machines.

Step-by-step guide

1. Create a workspace

If you don't already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or Workbench UI.

Log in to the Workbench UI and follow the Create a new workspace instructions.

First, check wb status. If no workspace is set, or you're not logged in, first log in and set the workspace that you want to work in. Otherwise, create a new workspace.

wb status
wb auth login  # if need be
wb workspace list

To create a new workspace:

wb workspace create –name=<workspace-name>

To set the Workbench CLI to use an existing workspace:

wb workspace set –id=<workspace-id>

2. Create workspace resources

If you haven't already, you'll need to create a Cloud Storage bucket resource, which will be used by Nextflow for staging and logging.

We'll also create a Git resource that points to the Nextflow example repo. Any JupyterLab apps that you subsequently create in your workspace will automatically clone that repo for you.

To create a Cloud Storage bucket resource via the Workbench UI, go to the Resources tab, select + Add new resource, and then select New Cloud Storage bucket (see detailed instructions here). Note the name of this resource, which you'll need below. In this example, we'll name it nf_files.

Then, go to the Apps tab in your workspace and select + Add repository in the Git repositories box (see details here).

Add the following repositories:

These repositories are public, so for this example, you don't need to set up the Workbench SSH keys.

First, create a GCS bucket for Nextflow run logs and outputs.

wb resource create gcs-bucket --id=nf_files  \
  --description="Bucket for Nextflow run logs and output."

Then, create referenced resources to the Git repositories we'll use for these examples:

wb resource add-ref git-repo --id=rnaseq-nf--repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core --repo-url=https://github.com/nf-core/configs.git

You can list your newly created resources:

wb resource list

3. Create an app to run Nextflow

Next, create a Workbench notebook app to run the Nextflow examples.

Your app will have Nextflow pre-installed, and any workspace Git repo resources — such as the one we just defined — will be automatically cloned. However, if you want to run the example on your local machine, you can install nextflow yourself.

Go to the Apps tab, select + New app instance, and then select JupyterLab (see detailed instructions here).

Once your app is running, you can select the link next to it to bring up JupyterLab on the new app instance.

Run the following to create a JupyterLab app:

wb app create gcp --app-config=jupyter-lab \
  --id=nextflow-jupyterlab \
  --description="JupyterLab notebook for running Nextflow"

After your notebook resource is created, you can see its details via:

wb resource describe --id <notebook_resource_id>

Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new app.

The info in the resource description also indicates whether your app is RUNNING or TERMINATED. You can stop your app when you’re not using it, and then restart it, via: wb app stop --id <notebook_resource_id> and wb app start --id <notebook_resource_id>.

4. Configure and run the example 'rnaseq-nf' Nextflow pipeline

Because you created a Git repo resource, the rnaseq-nf example should be automatically cloned into your new app, and you should see its directory, rnaseq-nf, at the top level of your file system. (If it's not there, you can run wb git clone --all.)

In the JupyterLab Terminal window, change to the rnaseq-nf directory:

cd rnaseq-nf

Edit the Nextflow configuration file

Next, we'll use the Google Batch API as the pipeline process executor.

In the rnaseq-nf directory, edit the google-batch entry in the nextflow.config file. Edit the workDir line to replace <your_bucket_resource_name> with the name of the Cloud Storage bucket resource you created earlier. If you’ve forgotten the bucket resource name, you can find it in the Resources tab of your workspace or by running wb resource list.

'google-batch' {
    // Workflow params
    params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
    params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
    params.multiqc = 'gs://rnaseq-nf/multiqc'

    // Google Batch config
    process.executor = 'google-batch'
    process.container = 'nextflow/rnaseq-nf:latest'
    // Edit the following line for your bucket resource
    // Use a unique name for each run to avoid cluttering results
    workDir = "$WORKBENCH_nf_files/rnaseq/output1"

    google.region  = "$PROJECT_DEFAULT_REGION"
    google.project = "$GOOGLE_CLOUD_PROJECT"

    google.batch.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
    google.batch.usePrivateAddress = true
    google.batch.network = 'global/networks/network'
    google.batch.subnetwork = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"
}

In a terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the instructions above), and sanity-check your config changes.

cd ..
wb nextflow config rnaseq-nf/main.nf -profile google-batch

You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.

Run the Nextflow example workflow via wb

After you check your config, you’re ready to run the Nextflow example via wb. wb will substitute the correct values for the environment variables in the config before it runs.

In the parent directory of rnaseq-nf, run:

wb nextflow run rnaseq-nf/main.nf -profile google-batch # the Google Batch API job executor is declared here

The workflow will take about 10 minutes to complete.

Find task logs and outputs

When Nextflow executes tasks, each task is granted a unique hash. During execution, you'll see output like:

executor >  google-batch (2)
[71/2a7061] RNASEQ:INDEX (transcript)     [  0%] 0 of 1
[c7/621631] RNASEQ:FASTQC (FASTQC on gut) [  0%] 0 of 1

To find the output of individual tasks, navigate to your bucket's scratch directory and find the corresponding folder. In this example, one task's output would be in folder ~/workspace/nf_files/rnaseq/output1/71 and another in folder ~/workspace/nf_files/output1/c7. You can find a multiqc_report.html for the output of the particular task.

5. Configure and run a nf-core example

The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines; if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.

Edit the nf-core googlebatch.config configuration file

In the left-hand File navigator, navigate to the ~/repos/nf-core/conf directory. (If you gave the repo a different name in Step #2, find its folder instead under repos.)

In the ~/repos/nf-core/conf/googlebatch.config file, update google_zone to the zone your VM was created in.

// Nextflow config file for running on Google Batch API
// Edit the 'google_bucket' param before using.
params {
    config_profile_description = 'Google Cloud Batch API Profile'
    config_profile_contact     = 'Hatem Nawar @hnawar'
    config_profile_url         = 'https://cloud.google.com/batch'

    //project
    project_id                 = "$GOOGLE_CLOUD_PROJECT"
    location                   = "$PROJECT_DEFAULT_REGION"
    google_zone                = '<your zone>'
    // Use a unique name for each run to avoid cluttering results
    workdir_bucket             = "$WORKBENCH_nf_files/nf-core/output1"

    //compute
    use_spot                   = false
    boot_disk                  = '100 GB'
    workers_service_account    = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"

    //networking
    use_private_ip             = true
    // Custom VPC should be in this format 'global/networks/[custom_VPC]'
    custom_vpc                 = 'global/networks/network'
    //Custom subnet should be in this format 'regions/[GCP_Region]/subnetworks/[custom_subnet]'
    custom_subnet              = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"

    google_debug = true
    google_preemptible = true
}

workDir = params.workdir_bucket

process {
    executor = 'google-batch'
}
google {
    zone                      = params.google_zone
    location                  = params.location
    project                   = params.project_id
    batch.network             = params.custom_vpc
    batch.subnetwork          = params.custom_subnet
    batch.usePrivateAddress   = params.use_private_ip
    batch.spot                = params.use_spot
    batch.serviceAccountEmail = params.workers_service_account
    batch.bootDiskSize        = params.boot_disk
    batch.debug               = params.google_debug
    batch.preemptible         = params.google_preemptible
}

Run a nf-core pipeline via wb

For this example, you'll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.

You can first confirm your config by running the following command in the terminal. This will check out the given pipeline from its repo and make it available locally if need be. Run this command from the nf-core (repository checkout) directory.

Choose the appropriate job executor for the profile. The following commands select Google Batch API.

cd ~/repos/nf-core
wb nextflow config nf-core/viralrecon -profile googlebatch # the Google Batch API job executor is declared here

Then, run the pipeline, still in the nf-core repo checkout directory in the terminal. The outdir holds run results, so each time you run the pipeline, use a different 'outdir' path. Running wb will substitute the correct value for the environment variables in the config before it runs.

# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile googlebatch --outdir '$WORKBENCH_nf_files'/viralrecon_outdir1

This run takes approximately 1 hour.

Find nf-core task logs and outputs

You can find all the run output per task in ~/workspace/nf_files/nf-core/output1 using the hash generated during the run. The final output of the workflow will be in viralrecon_outdir1.

Last Modified: 12 September 2025