Workflows in Verily Workbench: Cromwell, dsub, and Nextflow
Categories:
Purpose: This document provides information on creating workflows in Verily Workbench using Cromwell, dsub, and Nextflow.
Introduction
Verily Workbench enables you to run workflows at scale using tools that are popular across the life sciences community. In this document, you’ll discover how to choose the right workflow engine for your work, and where you can find and run existing workflows. This document provides architectural insights into how tools such as Cromwell, dsub, and Nextflow execute within the Workbench environment so that you can run workflows on your data at scale.
What are workflows and what capabilities do they provide?
A computational workflow is a sequence of computational steps that are used to process data. It's a formalized description of how data is input, how it flows between steps, and how it is output. Computational workflows are widely used in data analysis, scientific computing, and engineering.
Using workflows for your analyses offers the following advantages over running computational steps individually:
- Reproducibility: Using workflows helps ensure that the results of an analysis can be reproduced by others. This is important for scientific research, where it is essential that the results of an experiment can be verified by other researchers.
- Portability and sharing: If you use a widely-supported workflow language, you can run the same workflow on different platforms, which gives you the freedom to choose the platform you want to use. It also enables you to share your workflows with others who use different platforms, and to use workflows developed by others in the community.
- Scalability: Workflows enable you to handle large datasets by automating execution and reducing the possibility of error.
Verily Workbench provides built-in support for running and monitoring WDL-based workflows via the Cromwell workflow engine. Right within the UI, you can add workflows, run them with a set of inputs, and monitor their execution. Visit Using the Cromwell engine to run WDL workflows on Workbench for details.
Running Cromwell workflows
Note that Verily Workbench provides built-in support for running and monitoring WDL-based workflows via the Cromwell workflow engine. Right within the UI, you can add workflows, run them with a set of inputs, and monitor their execution. Visit Using the Cromwell engine to run WDL workflows on Workbench for details.
However, you can use other workflow engines on Workbench as well. This article discusses when you might want to use a particular framework.
Choosing a workflow engine
Though Verily Workbench provides first-class support for running WDL-based workflows on Cromwell, users are not locked into this technology. With multiple workflow engines available, how do you choose which one to use? Perhaps you want to make use of Verily Workbench's built-in support for Cromwell; in other cases the decision may be made for you by:
- Standardized engine(s) selected by your organization or team
- Desired tools with existing workflows written for a particular engine
There are a few important things to know about choosing an engine for your work.
Single-stage workflows (tasks)
- Are you only looking to run single-stage "tasks" at scale?
- Are you comfortable with running shell commands, but less comfortable learning new domain-specific languages, such as WDL or Nextflow scripting?
In either of the above cases, dsub may be the right choice for you. dsub is more of a task runner rather than a workflow engine. It doesn’t have built-in capabilities for sequencing multiple workflow steps; instead, where dsub excels is in its simplicity and ease of use.
Multi-stage or multi-environment workflows
If you’re looking to run multi-stage workflows, where each stage runs on a separate VM, and want to take advantage of more sophisticated built-in capabilities (such as automatic horizontal scaling or stage result caching), then Cromwell/WDL and Nextflow may be better choices.
If you are looking to write workflows that you can run in multiple environments, such as Workbench and your institutional HPC cluster, then Cromwell/WDL and Nextflow may again be better choices; both Cromwell and the Nextflow engine support many different backends and executors.
Note that Verily Workbench currently only supports Cromwell through the UI. However, all workflow engines (including Cromwell) can be executed on VMs within a workspace, as described below.
You have flexibility
It’s worth noting that using any one of these workflow engines on Workbench isn’t a deep commitment. These frameworks use the Life Sciences API on Google Cloud to execute individual tasks, which means that each is built around:
- Packaged code in Docker images
- Input files localized from Google Cloud Storage
- Output files delocalized to Google Cloud Storage
Moving a workflow (or a subset of workflow tasks) to a different workflow engine can be focused on rewriting the orchestration, rather than the need to rewrite each individual task.
Where to find existing workflows
Cromwell supports the orchestration of complex workflows written in the Workflow Definition Language (WDL). WDL is used for workflows across the community and many existing WDL workflows can be found in Dockstore.
Nextflow supports the orchestration of complex workflows written in the Nextflow scripting language. Nextflow is used for workflows across the community and many existing Nextflow workflows can be found in Dockstore.
How workflows execute on Workbench
Executing workflows on Workbench involves bringing together a workflow definition and supporting code, along with configuration and data for processing as represented in this overview image:
This section describes each of these components in more detail.
Orchestration and task execution
The image below highlights key elements of a single workflow job executing on Workbench:
- Workflow graph
- Workflow orchestration
- Task execution
The left hand side shows an example of a moderately advanced workflow, which includes multiple tasks that must be executed in a particular order (with some tasks executing in parallel). Workflows can be orchestrated by sophisticated workflow engines such as Cromwell or Nextflow. For less complex (especially single-task) workflows, you can write a script that uses dsub to launch those tasks.
To scale execution on Workbench, orchestration and task execution occur on different machines. The Cloud Life Sciences API is used by dsub, Cromwell, and Nextflow to schedule and monitor task-specific VMs.
Workflow credentials
Each of the workflow engines documented here use the Google Cloud Life Sciences API to run task VMs in your Workbench workspace. There are two key credentials to be aware of:
- Privileges to call the Life Sciences API
- Credentials used on each task VM
When you submit workflow execution from your cloud app, you’ll be calling the Life Sciences API with your pet service account credentials. Each task VM executes with pet service account credentials.
Workflow client and server software
dsub, Cromwell, and Nextflow each provide client software to submit and monitor workflows.
Cromwell and Nextflow provide server software that you can run on your Workbench app to scale up execution and management of large numbers of workflows.
The software for dsub, Cromwell, and Nextflow is pre-installed on Workbench apps.
Workflow code
Whether it’s a community-published workflow that you discover in Dockstore, a workflow shared within your organization, or your own private workflow, permanent storage of your workflow code will typically be in source control. To run the workflow, you’ll first need to copy the workflow code from its storage location to your Workbench app such that it's available to the workflow client.
Workbench supports integrated access to GitHub. Adding a GitHub repo to your workspace will automatically sync that repo to any app that you create in that workspace. This makes it easy to launch workflows in that workspace, update workflow code, or commit input parameters and logs to source control.
Task code
Code for each workflow task is typically embedded in the workflow description, and those commands run inside of a Docker container. The Docker images for your workflows can come from any Docker registry accessible to the task VM. This can include repositories such as Docker Hub, Quay.io, GitHub, Google Container Registry, or Artifact Registry.
Workbench supports authenticated access to Google Container Registry and Artifact Registry using your pet service account credentials.
Workflow configuration
When you submit workflows from your Workbench app, you’ll need to provide a list of parameters for each workflow, including input paths and output paths (discussed further below).
Each workflow engine includes a way to provide lists of inputs. For example, you can submit a batch of genomic samples to be processed. These lists of input parameters will be stored on your Workbench app.
Workflow inputs
While you can create workflows that read from database tables and other sources, the native preferences for dsub, Nextflow, and Cromwell are to use files as inputs. On Workbench, those files are stored in Cloud Storage buckets.
Depending on the workflows that you run, you'll often use some combination of files from public
datasets, shared access
datasets (where you've been given access), and private datasets (that you own). Each of the workflow
engines have native support for files using cloud-native storage path URLs (such as
gs://bucket/path/file
), localizing those files from the storage bucket to a task VM's local disk,
including built-in support for Requester Pays
buckets.
For reading from any shared access or private cloud buckets, your pet service account will need to have been granted Read access. Your pet service account will have access to each bucket in your Workbench workspace.
Workflow outputs
You can create workflows that write to database tables and other destinations;, however, the "native" capabilities for dsub, Nextflow, and Cromwell use files as outputs, and on Workbench those files are stored in Cloud Storage buckets.
Typically, you will write intermediate and final outputs to a bucket in your workspace. Should
you wish to have your workflows write outputs to a bucket outside of your workspace, you will
need to have Write
access granted to your pet service account.
Next steps
- Visit Using the Cromwell engine to run WDL workflows on Workbench.
- This tutorial walks you through running a Nextflow example: Getting started with Nextflow on Workbench.
Last Modified: 13 November 2024