Getting started with Nextflow on Verily Workbench
Introduction
Nextflow is a framework for creating data-driven computational pipelines. It allows you to create scalable and reproducible scientific workflows using software containers.
On Verily Workbench, the easiest way to get started using nextflow is via
a cloud environment, where nextflow
and the Workbench CLI are already installed for you. However,
you can also use a local installation of the Workbench CLI.
To get set up, you will first need to create a Workbench workspace (if need be), and then create some resources associated with that workspace. Then (optional but
recommended), spin up a cloud environment on which to run nextflow
.
The following sections walk you through that setup, then show how terra
makes it easy to configure
and run a nextflow
pipeline.
Create a workspace
If you don’t already have a Workbench workspace that you want to use, you can create one via either the Workbench CLI or the web UI.
To create a workspace via the web UI, see the instructions here.
First, check terra status
. If no workspace is set, or you are not logged in, first log in and set
the workspace that you want to work in. Otherwise, create a new workspace.
terra status
terra auth login # if need be
terra workspace list
To create a new workspace:
terra workspace create –name=<workspace-name>
To set the Workbench CLI to use an existing workspace:
terra workspace set –id=<workspaceid>
Create some workspace resources
If you haven’t already, you’ll need to create a GCS bucket resource, which will be used by Nextflow for staging and logging.
We’ll also create a Git resource that points to the nextflow example repo. Any notebook instances that you subsequently create in your workspace will automatically clone that repo for you.
To create a GCS bucket resource via the web UI, see the instructions here. Note the name of this resource, which you’ll need below.
Then, create a referenced resource to the example Git repository, as described here.
The repo URL to use is: https://github.com/nextflow-io/rnaseq-nf.git. Give it the name: rnaseq-nf-repo
.
(This repo is public, so for this example, you don’t need to set up the Workbench SSH keys.)
If you do not already have a bucket resource that you want to use, you can create one as follows.
The name of this resource is ws_files
.
terra resource create gcs-bucket --name=ws_files --bucket-name=${GOOGLE_CLOUD_PROJECT}-ws-files \
--description="Bucket for reports and provenance records."
Then, create a referenced resource to the Git repository we’ll use for this example:
terra resource add-ref git-repo --name=rnaseq-nf-repo --repo-url=https://github.com/nextflow-io/rnaseq-nf.git
You can list your newly created resources:
terra resource list
Create a cloud environment to run Nextflow (recommended)
Next, create a Workbench cloud environment on which to run the nextflow
example.
Your cloud environment will have nextflow
pre-installed, and any workspace git repo
resources— such as the one we just defined— will be automatically cloned. However, if you
want to run the example on your local machine, you can
install nextflow
yourself.
To create a cloud environment via the web UI, see the instructions here.
Once your environment is running, you can click on the link next to it to bring up JupyerLab on the
new notebook instance. To reduce costs, you can STOP
the instance from its ’three-dot'
menu, when you’re not using it, and restart it again later.
Create a new cloud environment:
terra resource create gcp-notebook --name=<notebook_resource_name> \
--description=<description>
After your notebook resource is created, you can see its details via:
terra resource describe --name <notebook_resource_name>
Included in that description is a Proxy URL. Visit that URL in your browser (logged in with your Workbench user account) to bring up JupyterLab on your new cloud environment.
Tip: The info in the resource description also indicates whether your cloud environment is
ACTIVE
orSTOPPED
. You can stop your cloud environment when you’re not using it, and then restart it, via:
terra notebook stop --name <notebook_resource_name>
and
terra notebook start --name <notebook_resource_name>
.
Configure and run the example nextflow pipeline
Because you created a git repo resource, the rnaseq-nf example should be automatically cloned into
your new cloud environment, and you should see its directory, rnaseq-nf
, at the top level of
your file system. (If it is not there, you can run:
terra git clone --resource=rnaseq-nf-repo
).
In the JupyterLab Terminal window, change to the rnaseq-nf
directory and check out the tagged v2.1
version.
cd rnaseq-nf
git checkout v2.1
Then, still in the rnaseq-nf
directory, edit the nextflow.config
file as follows. Replace the
gls
entry with the following snippet. Edit the workDir
line to replace
<your_bucket_resource_name>
with the name of the GCS bucket resource you created.
As you’ll see below, you will run the nextflow pipeline via terra
, and terra
will substitute
the correct value for the environment variables in the config before it runs.
gls {
params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
params.multiqc = 'gs://rnaseq-nf/multiqc'
process.executor = 'google-lifesciences'
process.container = 'nextflow/rnaseq-nf:latest'
// edit the following line for your bucket resource
workDir = "$TERRA_<your_bucket_resource_name>/nf"
google.location = 'us-central1'
google.region = 'us-central1'
google.project = "$GOOGLE_CLOUD_PROJECT"
google.lifeSciences.network = 'network'
google.lifeSciences.subnetwork = 'subnetwork'
google.lifeSciences.serviceAccountEmail = "$GOOGLE_SERVICE_ACCOUNT_EMAIL"
}
If you’ve forgotten the name of the bucket resource you created, you can find it via:
terra resource list # this shows the resource names
Then, change to the parent directory of rnaseq-nf
and sanity-check your config changes.
cd
terra nextflow config rnaseq-nf/main.nf -profile gls
You should see output that shows instantiated values for your workspace project, GCS bucket, and service account email.
Run the Nextflow example workflow via terra
After you check your config, you’re ready to run the nextflow example.
In the parent directory of rnaseq-nf
, run:
terra nextflow run rnaseq-nf/main.nf -profile gls
The workflow will take about 10 minutes to complete.
Last Modified: 16 November 2023