Workflows in Verily Workbench: Cromwell, dsub, and Nextflow

Workflows in Verily Workbench: Cromwell, dsub, and Nextflow

Introduction

Verily Workbench enables you to run workflows at scale using tools that are popular across the life sciences community. In this document, you’ll discover how to choose the right workflow engine for your work, and where you can find and run existing workflows. This document provides architectural insights into how tools such as Cromwell, dsub, and Nextflow execute within the Workbench environment so that you can run workflows on your data at scale.

Running Cromwell workflows

Note that Verily Workbench provides built-in support for running and monitoring WDL-based workflows via the Cromwell workflow engine. Right within the UI, you can add workflows, run them with a set of inputs, and monitor their execution. Visit Using the Cromwell engine to run WDL workflows on Workbench for details.

However, you can use other workflow engines on Workbench as well. This article discusses when you might want to use a particular framework.

Choosing a workflow engine

Though Verily Workbench provides first-class support for running WDL-based workflows on Cromwell, users are not locked into this technology. With multiple workflow engines available, how do you choose which one to use? Perhaps you want to make use of Verily Workbench’s built-in support for Cromwell; in other cases the decision may be made for you by:

  • Standardized engine(s) selected by your organization or team
  • Desired tools with existing workflows written for a particular engine

There are a few important things to know about choosing an engine for your work.

Single-stage workflows (tasks)

  • Are you only looking to run single-stage “tasks” at scale?
  • Are you comfortable with running shell commands, but less comfortable learning new domain-specific languages, such as WDL or Nextflow scripting?

In either of the above cases, dsub may be the right choice for you. dsub is more of a task runner rather than a workflow engine. It doesn’t have built-in capabilities for sequencing multiple workflow steps; instead, where dsub excels is in its simplicity and ease of use.

Multi-stage or multi-environment workflows

If you’re looking to run multi-stage workflows, where each stage runs on a separate VM, and want to take advantage of more sophisticated built-in capabilities (such as automatic horizontal scaling or stage result caching), then Cromwell/WDL and Nextflow may be better choices.

If you are looking to write workflows that you can run in multiple environments, such as Workbench and your institutional HPC cluster, then Cromwell/WDL and Nextflow may again be better choices; both Cromwell and the Nextflow engine support many different backends and executors.

Note that Verily Workbench currently only supports Cromwell through the UI. However, all workflow engines (including Cromwell) can be executed on VMs within a workspace, as described below.

You have flexibility

It’s worth noting that using any one of these workflow engines on Workbench isn’t a deep commitment. These frameworks use the Life Sciences API on Google Cloud to execute individual tasks, which means that each is built around:

  • Packaged code in Docker images
  • Input files localized from Google Cloud Storage
  • Output files delocalized to Google Cloud Storage

Moving a workflow (or a subset of workflow tasks) to a different workflow engine can be focused on rewriting the orchestration, rather than the need to rewrite each individual task.

Where to find existing workflows

Cromwell supports the orchestration of complex workflows written in the Workflow Definition Language (WDL). WDL is used for workflows across the community and many existing WDL workflows can be found in Dockstore.

Nextflow supports the orchestration of complex workflows written in the Nextflow scripting language. Nextflow is used for workflows across the community and many existing Nextflow workflows can be found in Dockstore.

How workflows execute on Workbench

Executing workflows on Workbench involves bringing together a workflow definition and supporting code, along with configuration and data for processing as represented in this overview image:

Diagram showing workflow execution and relationships among task virtual machines, buckets, Workbench cloud environments, and packaged code.

This section describes each of these components in more detail.

Orchestration and task execution

The image below highlights key elements of a single workflow job executing on Workbench:

Diagram showing how the workflow graph, workflow orchestration, and task execution relate to one another when a single workflow job is executed on Workbench.

The left hand side shows an example of a moderately advanced workflow, which includes multiple tasks that must be executed in a particular order (with some tasks executing in parallel). Workflows can be orchestrated by sophisticated workflow engines such as Cromwell or Nextflow. For less complex (especially single-task) workflows, you can write a script that uses dsub to launch those tasks.

To scale execution on Workbench, orchestration and task execution occur on different machines. The Cloud Life Sciences API is used by dsub, Cromwell, and Nextflow to schedule and monitor task-specific VMs.

Workflow credentials

Each of the workflow engines documented here use the Google Cloud Life Sciences API to run task VMs in your Workbench workspace. There are two key credentials to be aware of:

  • Privileges to call the Life Sciences API
  • Credentials used on each task VM

When you submit workflow execution from your cloud environment, you’ll be calling the Life Sciences API with your pet service account credentials. Each task VM executes with pet service account credentials.

Workflow client and server software

dsub, Cromwell, and Nextflow each provide client software to submit and monitor workflows.

Cromwell and Nextflow provide server software that you can run on your Workbench cloud environment to scale up execution and management of large numbers of workflows.

The software for dsub, Cromwell, and Nextflow is pre-installed on Workbench cloud environments.

Workflow code

Whether it’s a community-published workflow that you discover in Dockstore, a workflow shared within your organization, or your own private workflow, permanent storage of your workflow code will typically be in source control. To run the workflow, you’ll first need to copy the workflow code from its storage location to your Workbench cloud environment such that it is available to the workflow client.

Workbench supports integrated access to GitHub. Adding a GitHub repo to your workspace will automatically sync that repo to any cloud environment that you create in that workspace. This makes it easy to launch workflows in that workspace, update workflow code, or commit input parameters and logs to source control.

Task code

Code for each workflow task is typically embedded in the workflow description, and those commands run inside of a Docker container. The Docker images for your workflows can come from any Docker registry accessible to the task VM. This can include repositories such as Docker Hub, Quay.io, GitHub, Google Container Registry, or Artifact Registry.

Workbench supports authenticated access to Google Container Registry and Artifact Registry using your pet service account credentials.

Workflow configuration

When you submit workflows from your Workbench cloud environment, you’ll need to provide a list of parameters for each workflow, including input paths and output paths (discussed further below).

Each workflow engine includes a way to provide lists of inputs. For example, you can submit a batch of genomic samples to be processed. These lists of input parameters will be stored on your Workbench cloud environment.

Workflow inputs

While you can create workflows that read from database tables and other sources, the native preferences for dsub, Nextflow, and Cromwell are to use files as inputs. On Workbench, those files are stored in Cloud Storage buckets.

Depending on the workflows that you run, you’ll often use some combination of files from public datasets, shared access datasets (where you’ve been given access), and private datasets (that you own). Each of the workflow engines have native support for files using cloud-native storage path URLs (such as gs://bucket/path/file), localizing those files from the storage bucket to a task VM’s local disk, including built-in support for Requester Pays buckets.

For reading from any shared access or private cloud buckets, your pet service account will need to have been granted Read access. Your pet service account will have access to each bucket in your Workbench workspace.

Workflow outputs

You can create workflows that write to database tables and other destinations;, however, the “native” capabilities for dsub, Nextflow, and Cromwell use files as outputs, and on Workbench those files are stored in Cloud Storage buckets.

Typically, you will write intermediate and final outputs to a bucket in your workspace. Should you wish to have your workflows write outputs to a bucket outside of your workspace, you will need to have Write access granted to your pet service account.

Next steps

Last Modified: 12 May 2024