Workflows in Verily Workbench: Cromwell, dsub, and Nextflow
Verily Workbench enables you to run workflows at scale using tools that are popular across the
life sciences community. In this document, you’ll discover how to choose the right workflow engine
for your work, and where you can find and run existing workflows. This document provides
architectural insights into how tools such as
dsub, and Nextflow execute
within the Workbench environment so that you can run workflows on your data at scale.
Running Cromwell workflows
Note that Verily Workbench provides built-in support for running and monitoring WDL-based workflows via the Cromwell workflow engine. Right within the UI, you can add workflows, run them with a set of inputs, and monitor their execution. Visit Using the Cromwell engine to run WDL workflows on Workbench for details.
However, you can use other workflow engines on Workbench as well. This article discusses when you might want to use a particular framework.
Choosing a workflow engine
Though Verily Workbench provides first-class support for running WDL based workflows on Cromwell, users are not locked into this technology. With multiple workflow engines available, how do you choose which one to use? Perhaps you want to make use of Verily Workbench’s built in support for Cromwell; in other cases the decision may be made for you by:
- Standardized engine(s) selected by your organization or team
- Desired tools have existing workflows written for a particular engine
There are a few important things to know about choosing an engine for your work.
Single-stage workflows (tasks)
- Are you only looking to run single-stage “tasks” at scale?
- Are you comfortable with running shell commands, but less comfortable learning new domain-specific languages, such as WDL or Nextflow scripting?
In either of the above cases,
dsub may be the right
choice for you.
dsub is more of a task runner rather than a workflow engine. It doesn’t have
built-in capabilities for sequencing multiple workflow steps; instead, where
dsub excels is in
its simplicity and ease of use.
Multi-stage or multi-environment workflows
If you’re looking to run multi-stage workflows, where each stage runs on a separate VM, and want to take advantage of more sophisticated built-in capabilities (such as automatic horizontal scaling or stage result caching), then Cromwell/WDL and Nextflow may be better choices.
If you are looking to write workflows that you can run in multiple environments, such as Workbench and your institutional HPC cluster, then Cromwell/WDL and Nextflow may again be better choices; both Cromwell and the Nextflow engine support many different backends and executors.
Note that Verily Workbench currently only supports Cromwell through the UI. However, all workflow engines (including Cromwell) can be executed on VMs within a workspace, as described below.
You have flexibility
It’s worth noting that using any one of these workflow engines on Workbench isn’t a deep commitment. These frameworks use the Life Sciences API on Google Cloud to execute individual tasks, which means that each is built around:
- Packaged code in Docker images
- Input files localized from Google Cloud Storage
- Output files delocalized to Google Cloud Storage
Moving a workflow (or a subset of workflow tasks) to a different workflow engine can be focused on rewriting the orchestration, rather than the need to rewrite each individual task.
Where to find existing workflows
Cromwell supports the orchestration of complex workflows written in the Workflow Definition Language (WDL). WDL is used for workflows across the community and many existing WDL workflows can be found in Dockstore.
Nextflow supports the orchestration of complex workflows written in the Nextflow scripting language. Nextflow is used for workflows across the community and many existing Nextflow workflows can be found in Dockstore.
How workflows execute on Workbench
Executing workflows on Workbench involves bringing together a workflow definition and supporting code, along with configuration and data for processing as represented in this overview image:
This section describes each of these components in more detail.
Orchestration and task execution
The image below highlights key elements of a single workflow job executing on Workbench:
- Workflow graph
- Workflow orchestration
- Task execution
The left hand side shows an example of a moderately advanced workflow, which includes multiple tasks that must be executed in a particular order (with some tasks executing in parallel). Workflows can be orchestrated by sophisticated workflow engines such as Cromwell or Nextflow. For less complex (especially single-task) workflows, you can write a script that uses dsub to launch those tasks.
To scale execution on Workbench, orchestration and task execution occur on different machines. The Cloud Life Sciences API is used by dsub, Cromwell, and Nextflow to schedule and monitor task-specific VMs.
Each of the workflow engines documented here use the Google Cloud Life Sciences API to run task VMs in your Workbench workspace. There are two key credentials to be aware of:
- Privileges to call the Life Sciences API
- Credentials used on each Task VM
When you submit workflow execution from your cloud environment, you’ll be calling the Life Sciences API with your pet service account credentials. Each Task VM executes with pet service account credentials.
Workflow client and server software
dsub, Cromwell, and Nextflow each provide client software to submit and monitor workflows.
Cromwell and Nextflow provide server software that you can run on your Workbench cloud environment to scale up execution and management of large numbers of workflows.
The software for dsub, Cromwell, and Nextflow is pre-installed on Workbench cloud environments.
Whether it’s a community-published workflow that you discover in Dockstore, a workflow shared within your organization, or your own private workflow, permanent storage of your workflow code will typically be in source control. To run the workflow, you’ll first need to copy the workflow code from its storage location to your Workbench cloud environment such that it is available to the workflow client.
Workbench supports integrated access to GitHub. Adding a GitHub repo to your workspace will automatically sync that repo to any cloud environment that you create in that workspace. This makes it easy to launch workflows in that workspace, update workflow code, or commit input parameters and logs to source control.
Code for each workflow task is typically embedded in the workflow description, and those commands run inside of a Docker container. The Docker images for your workflows can come from any Docker registry accessible to the Task VM. This can include repositories such as Docker Hub, Quay.io, GitHub, Google Container Registry or Artifact Registry.
Workbench supports authenticated access to Google Container Registry and Artifact Registry using your [pet service account] credentials.
When you submit workflows from your Workbench cloud environment, you’ll need to provide a list of parameters for each workflow, including input paths and output paths (discussed further below).
Each workflow engine includes a way to provide lists of inputs. For example, you can submit a batch of genomic samples to be processed. These lists of input parameters will be stored on your Workbench cloud environment.
While you can create workflows that read from database tables and other sources, the native preferences for dsub, Nextflow, and Cromwell are to use files as inputs. On Workbench, those files are stored in Cloud Storage buckets.
Depending on the workflows that you run, you’ll often use some combination of files from public
datasets, shared access
datasets (where you’ve been given access), and private datasets (that you own). Each of the workflow
engines have native support for files using cloud-native storage path URLs (such as
gs://bucket/path/file), localizing those files from the storage bucket to a Task VMs local disk,
including built-in support for requester pays
For reading from any shared access or private cloud buckets, your pet service account will need to have been granted Read access. Your pet service account will have access to each bucket in your Workbench workspace.
You can create workflows that write to database tables and other destinations;, however, the “native” capabilities for dsub, Nextflow, and Cromwell use files as outputs, and on Workbench those files are stored in Cloud Storage buckets.
Typically, you will write intermediate and final outputs to a bucket in your workspace. Should
you wish to have your workflows write outputs to a bucket outside of your workspace, you will
need to have
Write access granted to your pet service account.
- Visit Using the Cromwell engine to run WDL workflows on Workbench.
- This tutorial walks you through running a Nextflow example: Getting started with Nextflow on Workbench.
Last Modified: 9 February 2024