Get started with Nextflow on Verily Workbench
Categories:
Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow
Purpose: This document provides detailed instructions for configuring and running Nextflow pipelines in Verily Workbench.
Introduction
Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.
The following guide will help you create a
workspace with a
cloud app that runs Nextflow. There will be two
examples: one running an example pipeline from the nextflow-io GitHub org, and the other running a
nf-core pipeline, where the nf-core project provides a community-generated,
curated collection of analysis pipelines built using Nextflow.
Both examples include configurations for
Google Batch API as the Nextflow pipeline
process.executor. This allows the pipelines to be run at scale, with Nextflow processes executed
on separate cloud virtual machines.
Note
On Workbench, the easiest way to get started using Nextflow is via an app, where Nextflow and the Workbench CLI (command-line interface) are already installed for you. This tutorial walks through that process.Step-by-step guide
1. Create a workspace
If you don't already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or Workbench UI.
Log in to the Workbench UI and follow the Create a new workspace instructions.
First, check wb status. If no workspace is set, or you're not
logged in, first log in and set the workspace that you want to work in. Otherwise, create a new
workspace.
wb status
wb auth login # if need be
wb workspace list
To create a new workspace:
wb workspace create –name=<workspace-name>
To set the Workbench CLI to use an existing workspace:
wb workspace set –id=<workspace-id>
2. Create workspace resources
If you haven't already, you'll need to create a Cloud Storage bucket resource, which will be used by Nextflow for staging and logging.
We'll also create a Git resource that points to the Nextflow example repo. Any JupyterLab apps that you subsequently create in your workspace will automatically clone that repo for you.
To create a Cloud Storage bucket
resource via the Workbench UI, go to the Resources tab, select + Add new resource,
and then select New Cloud Storage bucket (see detailed instructions
here). Note the name
of this resource, which you'll need below. In this example, we'll name it nf_files.
Then, go to the Apps tab in your workspace and select + Add repository in the Git repositories box (see details here).
Add the following repositories:
- Repository URL https://github.com/nextflow-io/rnaseq-nf.git, named rnaseq-nf.
- Repository URL https://github.com/nf-core/configs.git, named nf-core.
These repositories are public, so for this example, you don't need to set up the Workbench SSH keys.
First, create a GCS bucket for Nextflow run logs and outputs.
wb resource create gcs-bucket --id=nf_files \
--description="Bucket for Nextflow run logs and output."
Then, create referenced resources to the Git repositories we'll use for these examples:
wb resource add-ref git-repo --id=rnaseq-nf--repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core --repo-url=https://github.com/nf-core/configs.git
You can list your newly created resources:
wb resource list
3. Create an app to run Nextflow
Next, create a Workbench notebook app to run the Nextflow examples.
Your app will have Nextflow pre-installed, and any workspace
Git repo resources — such as the one we
just defined — will be automatically cloned. However, if you want to run the example on your local
machine, you can install
nextflow yourself.
Go to the Apps tab, select + New app instance, and then select JupyterLab (see detailed instructions here).
Once your app is running, you can select the link next to it to bring up JupyterLab on the new app instance.
Run the following to create a JupyterLab app:
wb app create gcp --app-config=jupyter-lab \
--id=nextflow-jupyterlab \
--description="JupyterLab notebook for running Nextflow"
After your notebook resource is created, you can see its details via:
wb resource describe --id <notebook_resource_id>
Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new app.
The info in the resource description also indicates whether your app is RUNNING or TERMINATED.
You can stop your app when you’re not using it, and then restart it, via:
wb app stop --id <notebook_resource_id> and wb app start --id <notebook_resource_id>.
4. Configure and run the example 'rnaseq-nf' Nextflow pipeline
Because you created a Git repo resource,
the rnaseq-nf example should be automatically cloned into your new
app, and you should see its directory,
rnaseq-nf, at the top level of your file system. (If it's not there, you can run
wb git clone --all.)
In the JupyterLab Terminal window, change to
the rnaseq-nf directory:
cd rnaseq-nf
Edit the Nextflow configuration file
Next, we'll use the Google Batch API as the pipeline process executor.
In the rnaseq-nf directory, edit the google-batch entry in the nextflow.config file. Edit the
workDir line to replace <your_bucket_resource_name> with the name of the Cloud
Storage bucket resource you created earlier. If you’ve forgotten the bucket resource name, you can
find it in the Resources tab of your workspace or by running wb resource list.
'google-batch' {
// Workflow params
params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
params.multiqc = 'gs://rnaseq-nf/multiqc'
// Google Batch config
process.executor = 'google-batch'
process.container = 'nextflow/rnaseq-nf:latest'
// Edit the following line for your bucket resource
// Use a unique name for each run to avoid cluttering results
workDir = "$WORKBENCH_nf_files/rnaseq/output1"
google.region = "$PROJECT_DEFAULT_REGION"
google.project = "$GOOGLE_CLOUD_PROJECT"
google.batch.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
google.batch.usePrivateAddress = true
google.batch.network = 'global/networks/network'
google.batch.subnetwork = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"
}
Note
Workbench supports running Nextflow via thewb nextflow "passthrough" command. When you use this construct, you're able to add
Workbench-specific environment variables to Nextflow configuration files. For example, you can
use the $WORKBENCH_nf_files construct, and it will be expanded to gs://<underlying_GCS_bucket>.
In addition, in a Workbench app, variables like $GOOGLE_SERVICE_ACCOUNT_EMAIL and
$GOOGLE_CLOUD_PROJECT will be set.
In a terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the
instructions above), and sanity-check your config changes.
cd ..
wb nextflow config rnaseq-nf/main.nf -profile google-batch
You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.
Run the Nextflow example workflow via wb
After you check your config, you’re ready to run the Nextflow example via wb. wb will substitute
the correct values for the environment variables in the config before it runs.
In the parent directory of rnaseq-nf, run:
wb nextflow run rnaseq-nf/main.nf -profile google-batch # the Google Batch API job executor is declared here
The workflow will take about 10 minutes to complete.
Find task logs and outputs
When Nextflow executes tasks, each task is granted a unique hash. During execution, you'll see output like:
executor > google-batch (2)
[71/2a7061] RNASEQ:INDEX (transcript) [ 0%] 0 of 1
[c7/621631] RNASEQ:FASTQC (FASTQC on gut) [ 0%] 0 of 1
To find the output of individual tasks, navigate to your bucket's scratch directory and find the
corresponding folder. In this example, one task's output would be in folder
~/workspace/nf_files/rnaseq/output1/71 and another in folder ~/workspace/nf_files/output1/c7.
You can find a multiqc_report.html for the output of the particular task.
5. Configure and run a nf-core example
The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines; if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.
Edit the nf-core googlebatch.config configuration file
In the left-hand File navigator, navigate to the ~/repos/nf-core/conf directory. (If you gave the
repo a different name in Step #2, find its folder instead under repos.)
In the ~/repos/nf-core/conf/googlebatch.config file, update google_zone to the zone your VM was
created in.
// Nextflow config file for running on Google Batch API
// Edit the 'google_bucket' param before using.
params {
config_profile_description = 'Google Cloud Batch API Profile'
config_profile_contact = 'Hatem Nawar @hnawar'
config_profile_url = 'https://cloud.google.com/batch'
//project
project_id = "$GOOGLE_CLOUD_PROJECT"
location = "$PROJECT_DEFAULT_REGION"
google_zone = '<your zone>'
// Use a unique name for each run to avoid cluttering results
workdir_bucket = "$WORKBENCH_nf_files/nf-core/output1"
//compute
use_spot = false
boot_disk = '100 GB'
workers_service_account = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
//networking
use_private_ip = true
// Custom VPC should be in this format 'global/networks/[custom_VPC]'
custom_vpc = 'global/networks/network'
//Custom subnet should be in this format 'regions/[GCP_Region]/subnetworks/[custom_subnet]'
custom_subnet = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"
google_debug = true
google_preemptible = true
}
workDir = params.workdir_bucket
process {
executor = 'google-batch'
}
google {
zone = params.google_zone
location = params.location
project = params.project_id
batch.network = params.custom_vpc
batch.subnetwork = params.custom_subnet
batch.usePrivateAddress = params.use_private_ip
batch.spot = params.use_spot
batch.serviceAccountEmail = params.workers_service_account
batch.bootDiskSize = params.boot_disk
batch.debug = params.google_debug
batch.preemptible = params.google_preemptible
}
Run a nf-core pipeline via wb
For this example, you'll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.
You can first confirm your config by running the following command in the terminal. This will check
out the given pipeline from its repo and make it available locally if need be. Run this command from
the nf-core (repository checkout) directory.
Choose the appropriate job executor for the profile. The following commands select Google Batch API.
cd ~/repos/nf-core
wb nextflow config nf-core/viralrecon -profile googlebatch # the Google Batch API job executor is declared here
Then, run the pipeline, still in the nf-core repo checkout directory in the terminal. The outdir
holds run results, so each time you run the pipeline, use a different 'outdir' path. Running wb
will substitute the correct value for the environment variables in the config before it runs.
# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile googlebatch --outdir '$WORKBENCH_nf_files'/viralrecon_outdir1
This run takes approximately 1 hour.
Find nf-core task logs and outputs
You can find all the run output per task in ~/workspace/nf_files/nf-core/output1 using the hash
generated during the run. The final output of the workflow will be in viralrecon_outdir1.
Last Modified: 12 September 2025