Get started with Nextflow on Verily Workbench
Categories:
Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow
Purpose: This document provides detailed instructions for configuring and running Nextflow pipelines in Verily Workbench.
Introduction
Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.
To get set up, you will:
- create a Workbench workspace
- create resources in the workspace
- create a cloud environment in the workspace on which to run Nextflow
The following sections walk you through that setup, then show how wb
makes it easy to configure
and run a Nextflow pipeline. This tutorial has two examples; one shows running an example pipeline
from the nextflow-io
GitHub org, and one shows how to run a nf-core pipeline, where the
nf-core project provides a community-generated, curated, collection of analysis
pipelines built using Nextflow.
Both of the examples will use the Google Cloud Life Sciences
API as the Nextflow pipeline
process.executor
. This allows the pipelines to be run scalably, with Nextflow
processes executed on separate cloud virtual machines.
Note
On Verily Workbench, the easiest way to get started using Nextflow is via a cloud environment, where Nextflow and the Workbench CLI are already installed for you. This tutorial walks through that process. If you like, you can also use a local installation of the Workbench CLI.1. Create a workspace
If you don’t already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or the web UI.
To create a workspace via the web UI, see the instructions here.
First, check wb status
. If no workspace is set, or you are not logged in, first log in and set
the workspace that you want to work in. Otherwise, create a new workspace.
wb status
wb auth login # if need be
wb workspace list
To create a new workspace:
wb workspace create –name=<workspace-name>
To set the Workbench CLI to use an existing workspace:
wb workspace set –id=<workspaceid>
2. Create workspace resources: GitHub repos and a Cloud Storage bucket
If you haven’t already, you’ll need to create a Cloud Storage bucket resource, which will be used by Nextflow for staging and logging.
We’ll also create a Git resource that points to the Nextflow example repo. Any notebook instances that you subsequently create in your workspace will automatically clone that repo for you.
To create a Cloud Storage bucket resource via the web UI, see the instructions
here.
Note the name of this resource, which you’ll need below. E.g., name it nf_files
.
Then, create referenced resources for the example
Git repositories,
as described here.
The repository URLs to use are:
- https://github.com/nextflow-io/rnaseq-nf.git. Give it the name:
rnaseq-nf-repo
. - https://github.com/nf-core/configs.git. Give it the name:
nf-core-configs
These repositories are public, so for this example, you don’t need to set up the Workbench SSH keys.
If you do not already have a bucket resource that you want to use, you can create one as follows.
The name of this resource will be nf_files
.
wb resource create gcs-bucket --id=nf_files \
--description="Bucket for Nextflow run logs and output."
Then, create referenced resources to the Git repositories we’ll use for these examples:
wb resource add-ref git-repo --id=rnaseq-nf-repo --repo-url=https://github.com/nextflow-io/rnaseq-nf.git
wb resource add-ref git-repo --id=nf-core-configs --repo-url=https://github.com/nf-core/configs.git
You can list your newly created resources:
wb resource list
3. Create a cloud environment to run Nextflow
Next, create a Workbench notebook environment on which to run the Nextflow examples.
Your cloud environment will have Nextflow pre-installed, and any workspace
Git repo resources — such as the one we just defined — will be
automatically cloned. However, if you want to run the example on your local machine, you can
install nextflow
yourself.
To create a cloud environment via the web UI, see the instructions here.
Once your environment is running, you can click on the link next to it to bring up JupyterLab on the
new notebook instance. To reduce costs, you can STOP
the instance from its ’three-dot'
menu, when you’re not using it, and restart it again later.
Create a new cloud environment:
wb resource create gcp-notebook --id=<notebook_resource_id> \
--description=<description>
After your notebook resource is created, you can see its details via:
wb resource describe --id <notebook_resource_id>
Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new cloud environment.
Tip: The info in the resource description also indicates whether your cloud environment is
ACTIVE
orSTOPPED
. You can stop your cloud environment when you’re not using it, and then restart it, via:
>wb notebook stop --id <notebook_resource_id>
and
>wb notebook start --id <notebook_resource_id>
.
Using Workbench environment variables in Nextflow config files
Workbench supports running Nextflow via a ‘passthrough’ command, e.g. wb nextflow ...
. When you use this construct you are able to add Workbench-specific environment variables to Nextflow configuration files. For example, you can use the $WORKBENCH_<bucket_resource_name>
construct, and it will be expanded to gs://<underlying_GCS_bucket>
. In addition, in a Workbench cloud environment, variables like $GOOGLE_SERVICE_ACCOUNT_EMAIL
and $GOOGLE_CLOUD_PROJECT
will be set.
In the examples below, we’ll leverage this capability when we create the Nextflow config files.
Configure and run the example ‘rnaseq-nf’ Nextflow pipeline
Because you created a Git repo resource, the rnaseq-nf example should be automatically cloned into your new cloud environment, and you should
see its directory, rnaseq-nf
, at the top level of your file system. (If it is not there, you can
run:wb git clone --resource=rnaseq-nf-repo
).
In the JupyterLab Terminal window, change to the rnaseq-nf
directory and check out the tagged v2.1
version.
cd rnaseq-nf
git checkout v2.1
Edit the nextflow configuration file
Then, still in the rnaseq-nf
directory, edit the nextflow.config
file as follows. Replace the
gls
entry with the following snippet. Edit the workDir
line to replace
<your_bucket_resource_name>
with the name of the Cloud Storage bucket resource you created.
As you’ll see below, you will run the Nextflow
pipeline via wb
, and wb
will substitute the
correct values for the environment variables in the config before it runs.
gls {
params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
params.multiqc = 'gs://rnaseq-nf/multiqc'
process.executor = 'google-lifesciences'
process.container = 'nextflow/rnaseq-nf:latest'
// edit the following line for your bucket resource
workDir = "$WORKBENCH_<your_bucket_resource_name>/nf"
google.location = 'us-central1'
google.region = 'us-central1'
google.project = "$GOOGLE_CLOUD_PROJECT"
google.lifeSciences.usePrivateAddress = true
google.lifeSciences.network = 'network'
google.lifeSciences.subnetwork = 'subnetwork'
google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
}
If you’ve forgotten the name of the bucket resource you created, you can find it via:
wb resource list # this shows the resource names
or in the “Resources” tab of your workspace.
Then, in a Terminal window, change to the parent directory of rnaseq-nf
(~/repos
if you followed
the instructions above), and sanity-check your config changes.
cd ..
wb nextflow config rnaseq-nf/main.nf -profile gls
You should see output that shows instantiated values for your workspace project, Cloud Storage bucket, and service account email.
Run the Nextflow example workflow via wb
After you check your config, you’re ready to run the Nextflow example.
In the parent directory of rnaseq-nf
, run:
wb nextflow run rnaseq-nf/main.nf -profile gls
The workflow will take about 10 minutes to complete.
Configure and run a nf-core example
The nf-core project provides a curated set of analysis pipelines built using Nextflow. The nf-core pipelines adhere to strict guidelines— so if one works for you, any of them should. Once your config file is set up, you should be able to test any nf-core pipeline.
Edit the nf-core google.config
configuration file
In the left-hand File navigator, navigate to the ~/repos/nf-core-configs/conf
directory. Find the
google.config
file in the listing and double-click on it to edit it. (If you gave the repo a
different name in Step #2, find its folder instead under repos
).
Edit the google.config
file to be the following.
Edit the google_bucket
line to replace <your_bucket_name>
with the name of the Cloud Storage bucket
resource you will be using. E.g., if you used the suggested bucket name in Step #2,
<your_bucket_name>
would be replaced with nf_files
.
As you’ll see below, you will run the Nextflow
pipeline via wb
, and wb
will substitute
the correct value for the environment variables in the config before it runs.
// Nextflow config file for running on Google Cloud Life Sciences
// Edit the 'google_bucket' param before using.
params {
config_profile_description = 'Google Cloud Life Sciences Profile'
config_profile_contact = 'Evan Floden, Seqera Labs (@evanfloden)'
config_profile_url = 'https://cloud.google.com/life-sciences'
google_zone = 'us-central1-c'
google_bucket = "$WORKBENCH_<your_bucket_name>/nf-core"
google_debug = true
google_preemptible = true
boot_disk = '100 GB'
workers_service_account = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
project_id = "$GOOGLE_CLOUD_PROJECT"
}
process.executor = 'google-lifesciences'
google.zone = params.google_zone
google.project = params.project_id
google.lifeSciences.serviceAccountEmail = params.workers_service_account
google.lifeSciences.usePrivateAddress = true
google.lifeSciences.debug = params.google_debug
workDir = params.google_bucket
google.lifeSciences.preemptible = params.google_preemptible
google.lifeSciences.network = 'network'
google.lifeSciences.subnetwork = 'subnetwork'
if (google.lifeSciences.preemptible) {
process.errorStrategy = { task.exitStatus in [8,10,14] ? 'retry' : 'terminate' }
process.maxRetries = 5
}
process.machineType = { task.memory > task.cpus * 6.GB ? ['custom', task.cpus, task.cpus * 6656].join('-') : null }
Run a nf-core pipeline with a test profile via wb
When you chose a NF-core pipeline to run, the pipeline definition
will automatically be fetched (and stored under: ~/.nextflow/assets/nf-core
). For every pipeline,
the test
profile can be used in conjunction with the google
profile (or any other config) to run
the pipeline with some test data.
For this example, you’ll run the viralrecon pipeline, which does assembly and intrahost/low-frequency variant calling for viral samples.
You can first confirm your config by running the following command in the Terminal. This will check
out the given pipeline from its repo and make it available locally if need be. Run this command from the
nf-core-config
(repository checkout) directory.
cd ~/repos/nf-core-config
wb nextflow config nf-core/viralrecon -profile test,google
Then, run the pipeline, still in the nf-core-config
repo checkout directory in the Terminal. Before you run the following
command, edit it to replace <your_bucket_name>
with your bucket resource name in the
--outdir
param. If you used the suggested bucket name in Step #2, <your_bucket_name>
would be
replaced with nf_files
. The outdir
holds run results, so each time you run the pipeline, use a
different ‘outdir’ path.
# Edit the 'outdir' bucket name first
wb nextflow run nf-core/viralrecon -profile test,google --outdir '$WORKBENCH_<your_bucket_name>'/viralrecon_outdir1
Summary
This tutorial showed two examples of running Nextflow pipelines on Workbench. One was an example from https://github.com/nextflow-io/, and the other showed how to set up and use a nf-core config file for any nf-core pipeline.
Workbench makes it easy to set up config files that need minimal editing to work in any Workbench cloud environment, and to run pipeline tasks scalably in the cloud.
Last Modified: 4 October 2024