Get started with Nextflow on Verily Workbench
Categories:
Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow
Purpose: This document provides detailed instructions for configuring and running Nextflow pipelines in Verily Workbench.
Introduction
Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.
The following guide will help you create a
workspace with a
cloud app that runs Nextflow. There will be two
examples: one running an example pipeline from the nextflow-io GitHub org, and the other running a
nf-core pipeline, where the nf-core project provides a community-generated,
curated collection of analysis pipelines built using Nextflow.
Both examples include configurations for
Google Batch API as the Nextflow pipeline
process.executor. This allows the pipelines to be run at scale, with Nextflow processes executed
on separate cloud virtual machines.
Note
On Workbench, the easiest way to get started using Nextflow is via an app, where Nextflow and the Workbench CLI (command-line interface) are already installed for you. This tutorial walks through that process.Step-by-step guide
1. Create a workspace
If you don't already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or Workbench UI.
Log in to the Workbench UI and follow the Create a new workspace instructions.
First, check wb status. If no workspace is set, or you're not
logged in, first log in and set the workspace that you want to work in. Otherwise, create a new
workspace.
wb status
wb auth login # if need be
wb workspace list
To create a new workspace:
wb workspace create –-id=<workspace-name> --pod=<pod-id>
To set the Workbench CLI to use an existing workspace:
wb workspace set –id=<workspace-id>
2. Create workspace resources
If you haven't already, you'll need to create a Cloud Storage bucket resource, which will be used by Nextflow for staging and logging.
We'll also create a Git resource that points to the Nextflow example repo. Any JupyterLab apps that you subsequently create in your workspace will automatically clone that repo for you.
To create a Cloud Storage bucket
resource via the Workbench UI, go to the Resources tab, select + Add new resource,
and then select New Cloud Storage bucket (see detailed instructions
here). Note the name
of this resource, which you'll need below. In this example, we'll name it nf_files.
Then, go to the Apps tab in your workspace and select + Add repository in the Git repositories box (see details here).
Add the following repositories:
- Repository URL https://github.com/nextflow-io/rnaseq-nf.git, named rnaseq-nf.
- Repository URL https://github.com/nf-core/configs.git, named nf-core.
These repositories are public, so for this example, you don't need to set up the Workbench SSH keys.
First, create a GCS bucket for Nextflow run logs and outputs.
wb resource create gcs-bucket --id=nf_files \
--description="Bucket for Nextflow run logs and output."
Then, create referenced resources to the Git repositories we'll use for these examples:
wb resource add-ref git-repo --id=rnaseq-nf --repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core --repo-url=https://github.com/nf-core/configs.git
You can list your newly created resources:
wb resource list
3. Create an app to run Nextflow
Next, create a Workbench notebook app to run the Nextflow examples.
Your app will have Nextflow pre-installed, and any workspace
Git repo resources — such as the one we
just defined — will be automatically cloned. However, if you want to run the example on your local
machine, you can install
nextflow yourself.
Go to the Apps tab, select + New app instance, and then select JupyterLab (see detailed instructions here).
Once your app is running, you can select the link next to it to bring up JupyterLab on the new app instance.
Run the following to create a JupyterLab app:
wb app create gcp --app-config=jupyter-lab \
--id=nextflow-jupyterlab \
--description="JupyterLab notebook for running Nextflow"
After your notebook resource is created, you can see its details via:
wb resource describe --id <notebook_resource_id>
Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new app.
The info in the resource description also indicates whether your app is RUNNING or TERMINATED.
You can stop your app when you’re not using it, and then restart it, via:
wb app stop --id <notebook_resource_id> and wb app start --id <notebook_resource_id>.
4. Configure and run the example 'rnaseq-nf' Nextflow pipeline
Because you created a Git repo resource,
the rnaseq-nf example should be automatically cloned into your new
app, and you should see its directory,
rnaseq-nf, at the top level of your file system. (If it's not there, you can run
wb git clone --all.)
In the JupyterLab Terminal window, change to
the rnaseq-nf directory:
cd repos/rnaseq-nf
Edit the Nextflow configuration file
Next, we'll use the Google Batch API as the pipeline process executor.
In the rnaseq-nf directory, edit the google-batch entry in the nextflow.config file. Replace
<your_bucket_resource_name> in the workDir line with the full name of the Cloud Storage bucket
resource you created earlier. In this case, it should be something like
gs://nf-files-<google-project-id>. You can find the full bucket name in the Resources tab of
your workspace or by running wb resource list.
In addition, add the google.project and four google.batch values indicated below.
'google-batch' {
// Workflow params
params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
params.multiqc = 'gs://rnaseq-nf/multiqc'
// Google Batch config
process.executor = 'google-batch'
process.container = 'nextflow/rnaseq-nf:latest'
// Edit the following line for your bucket resource
// Use a unique name for each run to avoid cluttering results. Here, we append /output1
workDir = "gs://nf-files-<your-google-project-id>/output1"
google.region = "$PROJECT_DEFAULT_REGION"
google.project = "$GOOGLE_CLOUD_PROJECT"
google.batch.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
google.batch.usePrivateAddress = true
google.batch.network = 'global/networks/network'
google.batch.subnetwork = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"
}
Note
In a Workbench app, environment variables like$GOOGLE_SERVICE_ACCOUNT_EMAIL, $GOOGLE_CLOUD_PROJECT, and $PROJECT_DEFAULT_REGION will
automatically be set.
In a terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the
instructions above), and sanity-check your config changes.
cd ..
wb nextflow config rnaseq-nf/main.nf -profile google-batch
You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.
Run the Nextflow example workflow via wb
After you check your config, you’re ready to run the Nextflow example via wb. wb will substitute
the correct values for the environment variables in the config before it runs.
In the parent directory of rnaseq-nf, run:
wb nextflow run rnaseq-nf/main.nf -profile google-batch # the Google Batch API job executor is declared here
The workflow will take about 40 minutes to complete.
Find task logs and outputs
When Nextflow executes tasks, each task is granted a unique hash. During execution, you'll see output like:
executor > google-batch (2)
[71/2a7061] RNASEQ:INDEX (transcript) [ 0%] 0 of 1
[c7/621631] RNASEQ:FASTQC (FASTQC on gut) [ 0%] 0 of 1
To find the output of individual tasks, navigate to your bucket's scratch directory and find the
corresponding folder. In this example, one task's output would be in folder
~/workspace/nf_files/rnaseq/output1/71 and another in folder ~/workspace/nf_files/output1/c7.
You can find a multiqc_report.html for the output of the particular task.
5. Configure and run a nf-core example
The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines; if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.
Edit the nf-core googlebatch.config configuration file
In the left-hand File navigator, navigate to the ~/repos/nf-core/conf directory. (If you gave the
repo a different name in Step #2, find its folder instead under repos.)
Open the googlebatch.config file and copy and paste the following. The value for google_zone
should be the zone your VM was created in, and the value for workdir_bucket should be your GCS
bucket's gsutil URI.
// Nextflow config file for running on Google Batch API
params {
config_profile_description = 'Google Cloud Batch API Profile'
config_profile_contact = 'Hatem Nawar @hnawar'
config_profile_url = 'https://cloud.google.com/batch'
//project
project_id = "$GOOGLE_CLOUD_PROJECT"
location = "$PROJECT_DEFAULT_REGION"
google_zone = '<your-zone>'
// Use a unique name for each run to avoid cluttering results. Here, we append /output1
workdir_bucket = "gs://nf-files-<your-google-project-id>/output1"
//compute
use_spot = false
boot_disk = '100 GB'
workers_service_account = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
//networking
use_private_ip = true
// Custom VPC should be in this format 'global/networks/[custom_VPC]'
custom_vpc = 'global/networks/network'
//Custom subnet should be in this format 'regions/[GCP_Region]/subnetworks/[custom_subnet]'
custom_subnet = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"
google_debug = true
google_preemptible = true
}
workDir = params.workdir_bucket
process {
executor = 'google-batch'
}
google {
zone = params.google_zone
location = params.location
project = params.project_id
batch.network = params.custom_vpc
batch.subnetwork = params.custom_subnet
batch.usePrivateAddress = params.use_private_ip
batch.spot = params.use_spot
batch.serviceAccountEmail = params.workers_service_account
batch.bootDiskSize = params.boot_disk
batch.debug = params.google_debug
batch.preemptible = params.google_preemptible
}
Run a nf-core pipeline via wb
For this example, you'll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.
You can first confirm your config by running the following command in the terminal. This will check
out the given pipeline from its repo and make it available locally if need be. Run this command from
the nf-core (repository checkout) directory.
wb nextflow config nf-core/viralrecon -profile googlebatch # the Google Batch API job executor is declared here
Choose the appropriate job executor for the profile. The following commands select Google Batch API.
cd ~/repos/nf-core
mkdir datasets
cd datasets
# Download sequencing reads
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/enterovirus/SRR13266665_1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/enterovirus/SRR13266665_2.fastq.gz
# Create the sample sheet
echo -e "sample,fastq_1,fastq_2\nSAMPLE_1,datasets/SRR13266665_1.fastq.gz,datasets/SRR13266665_2.fastq.gz" > samplesheet.csv
# Upload reference genome to gcs
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/genome/NC_002058.3/GCF_000861165.1_ViralProj15288_genomic.fna.gz -O reference.fna.gz
gsutil cp reference.fna.gz gs://nf-files-<your-google-project-id>/
Then, run the pipeline from the nf-core repo directory in the terminal. The outdir flag value
holds run results, so each time you run the pipeline, use a different outdir path (e.g.,
viralrecon_outdir2).
# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile googlebatch --outdir "gs://nf-files-<your-google-project-id>/viralrecon_outdir1" --input "datasets/samplesheet.csv" --platform illumina --project_id="$GOOGLE_CLOUD_PROJECT" --workdir_bucket="gs://nf-files-<your-google-project-id>/work" --fasta "gs://nf-files-<your-google-project-id>/reference.fna.gz"
This run takes approximately 1 hour.
Find nf-core task logs and outputs
You can find all the run output per task in ~/workspace/nf_files/nf-core/output1 using the hash
generated during the run. The final output of the workflow will be in viralrecon_outdir1.
Last Modified: 8 December 2025