Workflows in Verily Workbench: : Cromwell, dsub, and Nextflow
Verily Workbench enables you to run workflows at scale using tools that are popular across the life sciences community. In this document, you’ll discover how to choose the right workflow engine for your work, and where you can find and run existing workflows. You’ll also learn how tools like dsub, Cromwell, and Nextflow execute within the Workbench environment. By the end of this document, you’ll have the architectural insights and knowledge needed to run workflows on your data at scale.
Workbench enables you to run workflows at scale using tools that are popular across the life sciences community. This document provides architectural insights into how tools such as Cromwell, dsub, and Nextflow execute within the Workbench environment so that you can run workflows on your data at scale.
This article will cover a few key areas to get you started:
- Choosing a workflow engine
- Where to find existing workflows
- How workflows execute on Workbench
Choosing a workflow engine
With multiple workflow engines available, how do you choose which one to use?
In some cases, the decision will be made for you by:
- Standardized engine(s) selected by your organization or team
- Desired tools have existing workflows written for a particular engine
If you’re writing workflows from scratch and have neither of the above constraints, there are a few important things to know about choosing an engine for your work.
Single-stage workflows (tasks)
- Are you only looking to run single-stage “tasks” at scale?
- Are you comfortable with running shell commands, but less comfortable learning new domain-specific languages, such as WDL or Nextflow scripting?
In either of the above cases,
dsub may be the right choice for you.
dsub is more of a task runner
rather than a workflow engine. It doesn’t have built-in capabilities for sequencing multiple
workflow steps, but instead, where
dsub excels is in its simplicity and ease of use.
Multi-stage or multi-environment workflows
If you’re looking to run multi-stage workflows, where each stage runs on a separate VM, and want to take advantage of more sophisticated built-in capabilities (such as automatic horizontal scaling or stage result caching), then Cromwell/WDL and Nextflow may be better choices. If you are looking to write workflows that you can run in multiple environments, such as Workbench and your institutional HPC cluster, then Cromwell/WDL and Nextflow may again be better choices; both Cromwell and the Nextflow engine support many different backends and executors.
You have flexibility
It’s worth noting that using any one of these workflow engines on Workbench isn’t a deep commitment. These frameworks use the Life Sciences API on Google Cloud to execute individual tasks, which means that each is built around:
- Packaged code in Docker images
- Input files localized from Google Cloud Storage
- Output files delocalized to Google Cloud Storage
Moving a workflow (or a subset of workflow tasks) to a different workflow engine can be focused on rewriting the orchestration, rather than the need to rewrite each individual task.
Where to find existing workflows
Cromwell supports the orchestration of complex workflows written in the Workflow Definition Language (WDL). WDL is used for workflows across the community and many existing WDL workflows can be found in Dockstore.
Nextflow supports the orchestration of complex workflows written in the Nextflow scripting language. Nextflow is used for workflows across the community and many existing Nextflow workflows can be found in Dockstore.
How workflows execute on Workbench
Executing workflows on Workbench involves bringing together a workflow definition and supporting code, along with configuration and data for processing as represented in this overview image:
This section describes each of these components in more detail.
Orchestration and task execution
The image below highlights key elements of a single workflow job executing on Workbench:
- Workflow graph
- Workflow orchestration
- Task execution
The left hand side shows an example of a moderately advanced workflow, which includes multiple tasks that must be executed in a particular order (with some tasks executing in parallel). Workflows can be orchestrated by sophisticated workflow engines such as Cromwell or Nextflow. For less complex (especially single-task) workflows, you can write a script that uses dsub to launch those tasks.
To scale execution on Workbench, orchestration and task execution occur on different machines. The Cloud Life Sciences API is used by dsub, Cromwell, and Nextflow to schedule and monitor task-specific VMs.
Each of the workflow engines documented here use the Google Cloud Life Sciences API to run task VMs in your Workbench workspace. There are two key credentials to be aware of:
- Privileges to call the Life Sciences API
- Credentials used on each Task VM
When you submit workflow execution from your cloud environment, you’ll be calling the Life Sciences API with your pet service account credentials. Each Task VM executes with pet service account credentials.
Workflow client and server software
dsub, Cromwell, and Nextflow each provide client software to submit and monitor workflows.
Cromwell and Nextflow provide server software that you can run on your Workbench cloud environment to scale up execution and management of large numbers of workflows.
The software for dsub, Cromwell, and Nextflow is pre-installed on Workbench cloud environments.
Whether it’s a community-published workflow that you discover in Dockstore, a workflow shared within your organization, or your own private workflow, permanent storage of your workflow code will typically be in source control. To run the workflow, you’ll first need to copy the workflow code from its storage location to your Workbench cloud environment such that it is available to the workflow client.
Workbench supports integrated access to GitHub. Adding a GitHub repo to your workspace will automatically sync that repo to any cloud environment that you create in that workspace. This makes it easy to launch workflows in that workspace, update workflow code, or commit input parameters and logs to source control.
Code for each workflow task is typically embedded in the workflow description, and those commands run inside of a Docker container. The Docker images for your workflows can come from any Docker registry accessible to the Task VM. This can include repositories such as Docker Hub, Quay.io, GitHub, Google Container Registry or Artifact Registry.
Workbench supports authenticated access to Google Container Registry and Artifact Registry using your [pet service account] credentials.
When you submit workflows from your Workbench cloud environment, you’ll need to provide a list of parameters for each workflow, including input paths and output paths (discussed further below).
Each workflow engine includes a way to provide lists of inputs. For example, you can submit a batch of genomic samples to be processed. These lists of input parameters will be stored on your Workbench cloud environment.
While you can create workflows that read from database tables and other sources, the native preferences for dsub, Nextflow, and Cromwell are to use files as inputs. On Workbench, those files are stored in Cloud Storage buckets.
Depending on the workflows that you run, you’ll often use some combination of files from public
datasets, shared access
datasets (where you’ve been given access), and private datasets (that you own). Each of the workflow
engines have native support for files using cloud-native storage path URLs (such as
gs://bucket/path/file), localizing those files from the storage bucket to a Task VMs local disk,
including built-in support for requester pays
For reading from any shared access or private cloud buckets, your pet service account will need to have been granted Read access. Your pet service account will have access to each bucket in your Workbench workspace.
You can create workflows that write to database tables and other destinations;, however, the “native” capabilities for dsub, Nextflow, and Cromwell use files as outputs, and on Workbench those files are stored in Cloud Storage buckets.
Typically, you will write intermediate and final outputs to a bucket in your workspace. Should
you wish to have your workflows write outputs to a bucket outside of your workspace, you will
need to have
Write access granted to your pet service account.
This tutorial walks you through running a Nextflow example: Getting started with Nextflow on Workbench.
Last Modified: 16 November 2023