Get started with Nextflow on Verily Workbench

Step-by-step instructions for running a Nextflow pipeline

Introduction

Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.

The following guide will help you create a workspace with a cloud app that runs Nextflow. There will be two examples: one running an example pipeline from the nextflow-io GitHub org, and the other running a nf-core pipeline, where the nf-core project provides a community-generated, curated collection of analysis pipelines built using Nextflow.

Both examples include configurations for Google Batch API as the Nextflow pipeline process.executor. This allows the pipelines to be run at scale, with Nextflow processes executed on separate cloud virtual machines.

Note

On Workbench, the easiest way to get started using Nextflow is via an , where Nextflow and the Workbench CLI (command-line interface) are already installed for you. This tutorial walks through that process.

Step-by-step guide

1. Create a workspace

If you don't already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or Workbench UI.

First, check wb status. If no workspace is set, or you're not logged in, first log in and set the workspace that you want to work in. Otherwise, create a new workspace.

wb status
wb auth login  # if need be
wb workspace list

To create a new workspace:

wb workspace create –-id=<workspace-name> --pod=<pod-id>

To set the Workbench CLI to use an existing workspace:

wb workspace set –id=<workspace-id>

2. Create workspace resources

If you haven't already, you'll need to create a Cloud Storage bucket resource, which will be used by Nextflow for staging and logging.

We'll also create a Git resource that points to the Nextflow example repo. Any JupyterLab apps that you subsequently create in your workspace will automatically clone that repo for you.

To create a Cloud Storage bucket resource via the Workbench UI, go to the Resources tab, select + Add new resource, and then select New Cloud Storage bucket (see detailed instructions here). Note the name of this resource, which you'll need below. In this example, we'll name it nf_files.

Then, go to the Apps tab in your workspace and select + Add repository in the Git repositories box (see details here).

Add the following repositories:

Repository URL https://github.com/nextflow-io/rnaseq-nf.git, named rnaseq-nf.
Repository URL https://github.com/nf-core/configs.git, named nf-core.

These repositories are public, so for this example, you don't need to set up the Workbench SSH keys.

First, create a GCS bucket for Nextflow run logs and outputs.

wb resource create gcs-bucket --id=nf_files  \
  --description="Bucket for Nextflow run logs and output."

Then, create referenced resources to the Git repositories we'll use for these examples:

wb resource add-ref git-repo --id=rnaseq-nf --repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core --repo-url=https://github.com/nf-core/configs.git

You can list your newly created resources:

wb resource list

3. Create an app to run Nextflow

Next, create a Workbench notebook app to run the Nextflow examples.

Your app will have Nextflow pre-installed, and any workspace Git repo resources — such as the one we just defined — will be automatically cloned. However, if you want to run the example on your local machine, you can install nextflow yourself.

Go to the Apps tab, select + New app instance, and then select JupyterLab (see detailed instructions here).

Once your app is running, you can select the link next to it to bring up JupyterLab on the new app instance.

Run the following to create a JupyterLab app:

wb app create gcp --app-config=jupyter-lab \
  --id=nextflow-jupyterlab \
  --description="JupyterLab notebook for running Nextflow"

After your notebook resource is created, you can see its details via:

wb resource describe --id <notebook_resource_id>

Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new app.

The info in the resource description also indicates whether your app is RUNNING or TERMINATED. You can stop your app when you’re not using it, and then restart it, via: wb app stop --id <notebook_resource_id> and wb app start --id <notebook_resource_id>.

4. Configure and run the example 'rnaseq-nf' Nextflow pipeline

Because you created a Git repo resource, the rnaseq-nf example should be automatically cloned into your new app, and you should see its directory, rnaseq-nf, at the top level of your file system. (If it's not there, you can run wb git clone --all.)

In the JupyterLab Terminal window, change to the rnaseq-nf directory:

cd repos/rnaseq-nf

Edit the Nextflow configuration file

Next, we'll use the Google Batch API as the pipeline process executor.

In the rnaseq-nf directory, edit the google-batch entry in the nextflow.config file. Replace <your_bucket_resource_name> in the workDir line with the full name of the Cloud Storage bucket resource you created earlier. In this case, it should be something like gs://nf-files-<google-project-id>. You can find the full bucket name in the Resources tab of your workspace or by running wb resource list.

In addition, add the google.project and four google.batch values indicated below.

'google-batch' {
    // Workflow params
    params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
    params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
    params.multiqc = 'gs://rnaseq-nf/multiqc'

    // Google Batch config
    process.executor = 'google-batch'
    process.container = 'nextflow/rnaseq-nf:latest'
    // Edit the following line for your bucket resource
    // Use a unique name for each run to avoid cluttering results. Here, we append /output1
    workDir = "gs://nf-files-<your-google-project-id>/output1"

    google.region  = "$PROJECT_DEFAULT_REGION"
    google.project = "$GOOGLE_CLOUD_PROJECT"

    google.batch.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
    google.batch.usePrivateAddress = true
    google.batch.network = 'global/networks/network'
    google.batch.subnetwork = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"
}

Note

In a Workbench app, environment variables like $GOOGLE_SERVICE_ACCOUNT_EMAIL, $GOOGLE_CLOUD_PROJECT, and $PROJECT_DEFAULT_REGION will automatically be set.

In a terminal window, change to the parent directory of rnaseq-nf (~/repos if you followed the instructions above), and sanity-check your config changes.

cd ..
wb nextflow config rnaseq-nf/main.nf -profile google-batch

You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.

Run the Nextflow example workflow via `wb`

After you check your config, you’re ready to run the Nextflow example via wb. wb will substitute the correct values for the environment variables in the config before it runs.

In the parent directory of rnaseq-nf, run:

wb nextflow run rnaseq-nf/main.nf -profile google-batch # the Google Batch API job executor is declared here

The workflow will take about 40 minutes to complete.

Find task logs and outputs

When Nextflow executes tasks, each task is granted a unique hash. During execution, you'll see output like:

executor >  google-batch (2)
[71/2a7061] RNASEQ:INDEX (transcript)     [  0%] 0 of 1
[c7/621631] RNASEQ:FASTQC (FASTQC on gut) [  0%] 0 of 1

To find the output of individual tasks, navigate to your bucket's scratch directory and find the corresponding folder. In this example, one task's output would be in folder ~/workspace/nf_files/rnaseq/output1/71 and another in folder ~/workspace/nf_files/output1/c7. You can find a multiqc_report.html for the output of the particular task.

5. Configure and run a nf-core example

The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines; if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.

Edit the nf-core `googlebatch.config` configuration file

In the left-hand File navigator, navigate to the ~/repos/nf-core/conf directory. (If you gave the repo a different name in Step #2, find its folder instead under repos.)

Open the googlebatch.config file and copy and paste the following. The value for google_zone should be the zone your VM was created in, and the value for workdir_bucket should be your GCS bucket's gsutil URI.

// Nextflow config file for running on Google Batch API
params {
    config_profile_description = 'Google Cloud Batch API Profile'
    config_profile_contact     = 'Hatem Nawar @hnawar'
    config_profile_url         = 'https://cloud.google.com/batch'

    //project
    project_id                 = "$GOOGLE_CLOUD_PROJECT"
    location                   = "$PROJECT_DEFAULT_REGION"
    google_zone                = '<your-zone>'
    // Use a unique name for each run to avoid cluttering results. Here, we append /output1
    workdir_bucket             = "gs://nf-files-<your-google-project-id>/output1"

    //compute
    use_spot                   = false
    boot_disk                  = '100 GB'
    workers_service_account    = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"

    //networking
    use_private_ip             = true
    // Custom VPC should be in this format 'global/networks/[custom_VPC]'
    custom_vpc                 = 'global/networks/network'
    //Custom subnet should be in this format 'regions/[GCP_Region]/subnetworks/[custom_subnet]'
    custom_subnet              = "regions/$PROJECT_DEFAULT_REGION/subnetworks/subnetwork"

    google_debug = true
    google_preemptible = true
}

workDir = params.workdir_bucket

process {
    executor = 'google-batch'
}
google {
    zone                      = params.google_zone
    location                  = params.location
    project                   = params.project_id
    batch.network             = params.custom_vpc
    batch.subnetwork          = params.custom_subnet
    batch.usePrivateAddress   = params.use_private_ip
    batch.spot                = params.use_spot
    batch.serviceAccountEmail = params.workers_service_account
    batch.bootDiskSize        = params.boot_disk
    batch.debug               = params.google_debug
    batch.preemptible         = params.google_preemptible
}

Run a nf-core pipeline via `wb`

For this example, you'll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.

You can first confirm your config by running the following command in the terminal. This will check out the given pipeline from its repo and make it available locally if need be. Run this command from the nf-core (repository checkout) directory.

wb nextflow config nf-core/viralrecon -profile googlebatch # the Google Batch API job executor is declared here

Choose the appropriate job executor for the profile. The following commands select Google Batch API.

cd ~/repos/nf-core
mkdir datasets
cd datasets
# Download sequencing reads
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/enterovirus/SRR13266665_1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/enterovirus/SRR13266665_2.fastq.gz

# Create the sample sheet
echo -e "sample,fastq_1,fastq_2\nSAMPLE_1,datasets/SRR13266665_1.fastq.gz,datasets/SRR13266665_2.fastq.gz" > samplesheet.csv

# Upload reference genome to gcs
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/genome/NC_002058.3/GCF_000861165.1_ViralProj15288_genomic.fna.gz -O reference.fna.gz

gsutil cp reference.fna.gz gs://nf-files-<your-google-project-id>/

Then, run the pipeline from the nf-core repo directory in the terminal. The outdir flag value holds run results, so each time you run the pipeline, use a different outdir path (e.g., viralrecon_outdir2).

# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile googlebatch --outdir "gs://nf-files-<your-google-project-id>/viralrecon_outdir1" --input "datasets/samplesheet.csv" --platform illumina --project_id="$GOOGLE_CLOUD_PROJECT" --workdir_bucket="gs://nf-files-<your-google-project-id>/work" --fasta "gs://nf-files-<your-google-project-id>/reference.fna.gz"

This run takes approximately 1 hour.

Find nf-core task logs and outputs

You can find all the run output per task in ~/workspace/nf_files/nf-core/output1 using the hash generated during the run. The final output of the workflow will be in viralrecon_outdir1.

Last Modified: 8 December 2025

Get started with Nextflow on Verily Workbench

Tags:

Categories:

Introduction

Note

Step-by-step guide

1. Create a workspace

2. Create workspace resources

3. Create an app to run Nextflow

4. Configure and run the example 'rnaseq-nf' Nextflow pipeline

Edit the Nextflow configuration file

Note

Run the Nextflow example workflow via `wb`

Find task logs and outputs

5. Configure and run a nf-core example

Edit the nf-core `googlebatch.config` configuration file

Run a nf-core pipeline via `wb`

Find nf-core task logs and outputs

Get started with Nextflow on Verily Workbench

Introduction

Note

Step-by-step guide

1. Create a workspace

2. Create workspace resources

3. Create an app to run Nextflow

4. Configure and run the example 'rnaseq-nf' Nextflow pipeline

Edit the Nextflow configuration file

Note

Run the Nextflow example workflow via wb

Find task logs and outputs

5. Configure and run a nf-core example

Edit the nf-core googlebatch.config configuration file

Run a nf-core pipeline via wb

Find nf-core task logs and outputs

Run the Nextflow example workflow via `wb`

Edit the nf-core `googlebatch.config` configuration file

Run a nf-core pipeline via `wb`