Getting started with Nextflow on Verily Workbench

How to use Nextflow with the Workbench CLI.

Introduction

Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.

On Verily Workbench, the easiest way to get started using nextflow is via a cloud environment, where nextflow and the Workbench CLI are already installed for you. However, you can also use a local installation of the Workbench CLI.

To get set up, you will first need to create a Workbench workspace (if need be), and then create some resources associated with that workspace. Then (optional but recommended), spin up a cloud environment on which to run nextflow.

The following sections walk you through that setup, then show how wb makes it easy to configure and run a nextflow pipeline.

Create a workspace

If you don’t already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or the web UI.

Using the Web UI
Using the CLI

To create a workspace via the web UI, see the instructions here.

First, check wb status. If no workspace is set, or you are not logged in, first log in and set the workspace that you want to work in. Otherwise, create a new workspace.

wb status
wb auth login  # if need be
wb workspace list

To create a new workspace:

wb workspace create –name=<workspace-name>

To set the Workbench CLI to use an existing workspace:

wb workspace set –id=<workspaceid>

Create some workspace resources

If you haven’t already, you’ll need to create a GCS bucket resource, which will be used by Nextflow for staging and logging.

We’ll also create a Git resource that points to the nextflow example repo. Any notebook instances that you subsequently create in your workspace will automatically clone that repo for you.

Using the Web UI
Using the CLI

To create a GCS bucket resource via the web UI, see the instructions here. Note the name of this resource, which you’ll need below.

Then, create a referenced resource to the example Git repository, as described here.
The repo URL to use is: https://github.com/nextflow-io/rnaseq-nf.git. Give it the name: rnaseq-nf-repo. (This repo is public, so for this example, you don’t need to set up the Workbench SSH keys.)

If you do not already have a bucket resource that you want to use, you can create one as follows. The name of this resource is ws_files.

wb resource create gcs-bucket --id=ws_files --bucket-name=${GOOGLE_CLOUD_PROJECT}-ws-files \
  --description="Bucket for reports and provenance records."

Then, create a referenced resource to the Git repository we’ll use for this example:

wb resource add-ref git-repo --id=rnaseq-nf-repo --repo-url=https://github.com/nextflow-io/rnaseq-nf.git

You can list your newly created resources:

wb resource list

Create a cloud environment to run Nextflow (recommended)

Next, create a Workbench cloud environment on which to run the nextflow example.

Your cloud environment will have nextflow pre-installed, and any workspace Git repo resources — such as the one we just defined — will be automatically cloned. However, if you want to run the example on your local machine, you can install nextflow yourself.

Using the Web UI
Using the CLI

To create a cloud environment via the web UI, see the instructions here.

Once your environment is running, you can click on the link next to it to bring up JupyerLab on the new notebook instance. To reduce costs, you can STOP the instance from its ’three-dot' menu, when you’re not using it, and restart it again later.

Create a new cloud environment:

wb resource create gcp-notebook --id=<notebook_resource_id> \
  --description=<description>

After your notebook resource is created, you can see its details via:

wb resource describe --id <notebook_resource_id>

Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new cloud environment.

Tip: The info in the resource description also indicates whether your cloud environment is ACTIVE or STOPPED. You can stop your cloud environment when you’re not using it, and then restart it, via:
> wb notebook stop --id <notebook_resource_id> and
> wb notebook start --id <notebook_resource_id>.

Configure and run the example Nextflow pipeline

Because you created a Git repo resource, the rnaseq-nf example should be automatically cloned into your new cloud environment, and you should see its directory, rnaseq-nf, at the top level of your file system. (If it is not there, you can run:
wb git clone --resource=rnaseq-nf-repo).

In the JupyterLab Terminal window, change to the rnaseq-nf directory and check out the tagged v2.1 version.

cd rnaseq-nf
git checkout v2.1

Then, still in the rnaseq-nf directory, edit the nextflow.config file as follows. Replace the gls entry with the following snippet. Edit the workDir line to replace <your_bucket_resource_name> with the name of the GCS bucket resource you created.

As you’ll see below, you will run the Nextflow pipeline via wb, and wb will substitute the correct value for the environment variables in the config before it runs.

 gls {
    params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
    params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
    params.multiqc = 'gs://rnaseq-nf/multiqc'
    process.executor = 'google-lifesciences'
    process.container = 'nextflow/rnaseq-nf:latest'
    // edit the following line for your bucket resource
    workDir = "$TERRA_<your_bucket_resource_name>/nf"
    google.location = 'us-central1'
    google.region  = 'us-central1'
    google.project = "$GOOGLE_CLOUD_PROJECT"
    google.lifeSciences.network = 'network'
    google.lifeSciences.subnetwork = 'subnetwork'
    google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
  }

If you’ve forgotten the name of the bucket resource you created, you can find it via:

wb resource list  # this shows the resource names

Then, change to the parent directory of rnaseq-nf and sanity-check your config changes.

cd
wb nextflow config rnaseq-nf/main.nf -profile gls

You should see output that shows instantiated values for your workspace project, GCS bucket, and service account email.

Run the Nextflow example workflow via `terra`

After you check your config, you’re ready to run the nextflow example. In the parent directory of rnaseq-nf, run:

wb nextflow run rnaseq-nf/main.nf -profile gls

The workflow will take about 10 minutes to complete.

Last Modified: 12 May 2024

Getting started with Nextflow on Verily Workbench

Tags:

Categories:

Introduction

Create a workspace

Create some workspace resources

Create a cloud environment to run Nextflow (recommended)

Configure and run the example Nextflow pipeline

Run the Nextflow example workflow via `terra`

Getting started with Nextflow on Verily Workbench

Introduction

Create a workspace

Create some workspace resources

Create a cloud environment to run Nextflow (recommended)

Configure and run the example Nextflow pipeline

Run the Nextflow example workflow via terra

Run the Nextflow example workflow via `terra`