Using Dataproc (Spark) and Hail on Verily Workbench

How to use Dataproc (managed Spark) environments and run Hail on Verily Workbench

Introduction

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and many other tools and frameworks. The Verily Workbench Dataproc environment also supports one-click Hail installation. Hail is an open-source library for scalable data exploration and analysis, with a particular emphasis on genomics.

This document gives an overview of how to create and use Dataproc clusters, optionally with Hail, in Workbench.

Starting up a Dataproc cluster from the Workbench web UI

Navigate to the Environments tab in your workspace, click on + New cloud environment, then select the JupyterLab Spark cluster option.

Screenshot of Select app dialog, the first step when creating a cloud environment, with JupyterLab Spark cluster option selected.
Creating a Dataproc cluster environment.

Next, choose a cloud environment configuration. You should see a default configuration, as well as any other configurations you previously created. Click Next.

Screenshot of Choose configuration dialog showing cost estimate, the second step when creating a cloud environment. The default configuration is selected.
Choosing a cloud environment configuration for your cluster.

On the next screen, you can customize various aspects of your cluster:

Screenshot of Customize dialog, the third step when creating a cloud environment, showing all of the cluster and node options you can customize.
Customizing cluster options.

Selected parts of this form are highlighted in more detail below. After you’ve configured your new cluster, click Next.

Finally, you can review your environment details and enter an ID, name, and description (optional) for your environment. You can also optionally add a cost estimate:

Screenshot of Review details dialog, the last step when creating a cloud environment.
Reviewing a newly created cluster.

Click Create environment.

Note: Cluster creation may take ~10 minutes.

Autoscaling policies

The cluster creation form allows you to specify an autoscaling policy. The autoscaling policy must be created before you create the cluster.

You can read more about setting up autoscaling policies here. Once you’ve created one or more policies, the Autoscaling policy dropdown in the cluster creation dialog will be auto-populated with those policies, and you can select the one you want.

Screenshot of 'Autoscaling policy' dropdown showing three different options.
Selecting a defined autoscaling policy.

The autoscaling policies are common to the workspace’s underlying project. So, other users with access to the workspace will be able to view the policies, and users with Writer or Owner workspace privileges will be able to modify them. You can create policies via the Cloud console UI, or via the gcloud SDK from the command line.

Creating an autoscaling policy via the Cloud console

Visit https://console.cloud.google.com/dataproc/autoscalingPolicies:

Screenshot of Autoscaling policies page in Google Cloud console, with 'Autoscaling policies' link highlighted in left menu.
You can create autoscaling policies via the Cloud console UI.

Creating an autoscaling policy via the gcloud SDK

Create a yaml config file (below is an example). See the Dataproc documentation for more details.

workerConfig:
  # Best practice: keep min and max values identical for primary workers
  # https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#avoid_scaling_primary_workers
  minInstances: 2
  maxInstances: 2
secondaryWorkerConfig:
  maxInstances: 50
basicAlgorithm:
  cooldownPeriod: 4m
  yarnConfig:
    scaleUpFactor: 0.05
    scaleDownFactor: 1.0
    gracefulDecommissionTimeout: 1h

Import a yaml autoscaling config to your project via gcloud. You can run this command from a workspace notebook environment, or from your local machine. (Locally, you may need to set the
--project flag as well, if gcloud is not configured for the correct project.)

gcloud dataproc autoscaling-policies import <autoscaling_policy_name> \
   --source=<your_autoscaling_policy>.yaml \
   --region=us-central1

Installing Hail

To install Hail on the cluster, select it from the Software to install dropdown in the cluster creation dialog.

Screenshot of Advanced options section in Customize dialog, showing 'Software to install' dropdown and highlighting Hail option.
Install Hail on the Dataproc cluster.

Note: In future, it will also be possible to provide your own initialization-actions script(s) as well, to support other custom installations.

Dataproc GCS buckets

The first time a Dataproc cluster is created in a workspace, two controlled resource Google Cloud Storage (GCS) buckets are automatically created: a staging bucket, and a ‘temp’ bucket.

Their resource names are: dataproc-staging-<workspace-name>, and dataproc-temp-<workspace-name>, where <workspace name> is replaced with the name of your workspace.

These buckets are used by default for all clusters created within that workspace, though when you create a cluster, you can instead configure the cluster to use other controlled resource buckets from the workspace. Any other buckets must pre-exist before you create the cluster, and will appear in the Storage buckets dropdowns in the cluster creation dialog.

Screenshot of Storage buckets section in Customize dialog, showing the 'Resource ID' input field with a staging bucket name.
Selecting GCS buckets to use for a Dataproc cluster.

Setting scheduled deletion

To delete your clusters after an idle period, check the following box and select the time period. By default, this box is not checked.

Currently, “autopause” is not supported, though you can STOP and then later restart the clusters (see below for more details). Note that scheduled deletion also applies to stopped clusters.

Screenshot of Scheduled deletion section in Customize dialog, showing the option checked off and a cluster idle time of 2 hours set.
You can set a cluster to auto-delete after a defined idle period.

Starting up a Dataproc cluster from the Workbench CLI

You can also create a Dataproc cluster via the Workbench CLI.

Run wb resource create dataproc-cluster to see the options. Many of the parameters of this command are passed through from the gcloud dataproc command. See the Dataproc docs and reference documentation for more details on these parameters.

Start a cluster from the command line without Hail

If you don’t want to install Hail on your Dataproc cluster, then use the following command as a starting point.

Before running the command, replace <your_cluster_name> with the name of your new cluster.

wb resource create dataproc-cluster \
  --id=<your_cluster_name>

Default buckets are auto-created when you first create a cluster via the Workbench web UI. Their resource names are: dataproc-staging-<workspace-name> and dataproc-temp-<workspace-name>, where <workspace name> is replaced with the name of your workspace. However, you can specify other existing buckets:

wb resource create dataproc-cluster \
  --id=<your_cluster_name> \
  --bucket=<staging_bucket_name> \
  --temp-bucket=<temp_bucket_name>

You can also include parameters to indicate --initialization-actions scripts, set time to deletion when idle (--idle-delete-ttl, specified in seconds), set an --autoscaling-policy, indicate components to install, and much more.

For example, this command template sets up four primary and eight secondary workers, specifies an autoscaling policy, and indicates deletion after two hours of idle time:

wb resource create dataproc-cluster \
  --id=<your_cluster_name> \
  --bucket=<staging_bucket_name> \
  --temp-bucket=<temp_bucket_name> \
  --num-workers 4 --num-secondary-workers 8 \
  --autoscaling-policy <your-policy_id> --idle-delete-ttl 7200s

Creating a cluster from the command line with Hail installed

If you’d like to install Hail on your Dataproc cluster, then start from this command instead:

wb resource create dataproc-cluster \
  --id=<your_cluster_name> --software-framework=HAIL

As with the example in the section above, you can add additional configuration parameters (e.g., add --idle-delete-ttl 7200s to delete after two hours of idle time).

Working with a Dataproc cluster

After a cluster is up and running, you’ll see a card like this under the Environments tab:

Screenshot of a dataproc_tests environment card in the Environments tab, highlighting the cluster's name and the Edit, Open cloud console, and Delete options.
Managing a running cluster.

Click the cluster name (here, dataproc_tests) to open the JupyterLab server that is running on the “main” cluster node. Click the three-dot menu to open the Dataproc dashboard in the Google Cloud console, or to delete the cluster.

As noted above, click the cluster name to open the cluster’s JupyterLab server, which is running on the “main” node of the cluster.

In the left-hand File navigator, the top-level directory maps to /home/dataproc. The notebook user is dataproc, and if you open a terminal window, it will be set to the /home/dataproc directory as well.

Screenshot of JupyterLab file navigator showing list of files including 'repos' and 'workspace' subdirectories.
Navigating the JupyterLab file system

In /home/dataproc, you’ll see a repos subdirectory and a workspace subdirectory.

The repos directory holds clones of any GitHub repositories that you’ve added to your workspace, as described here.

The workspace directory holds mounted workspace buckets and referenced folder resources, as described here.

The first time a Dataproc cluster is created in a workspace, two controlled resource buckets are automatically created: a staging bucket, and a temp bucket. The files in these buckets will be automounted under /home/dataproc/workspace, along with automounts of other resources you’ve created.

Persisting notebooks across clusters

When a Dataproc cluster is deleted, its node’s disks are deleted as well.

There are multiple approaches for persisting your work across clusters:

  • You can create notebook files and other artifacts in the mounted resource directories under ~/workspace. This will save your files in GCS, as discussed above and described in more detail here. Note that all users with Writer or Owner privileges for the workspace have write access to these controlled resource buckets.
  • You can work with git, and check in your changes to a GitHub repo before you shut down the cluster.

These two features work the same on the Dataproc JupyterLab server as they do in a notebook environment server.

Accessing the Workbench CLI from on-cluster notebooks

The Workbench CLI utility, wb, is installed on the Dataproc JupyterLab server node, and initialized for the current workspace (similarly to how the notebook VM servers are configured).

You can use wb from the terminal window or from a notebook cell to do things like create clusters (as described above), create resources, resolve resource names, and more.

See the example notebooks for an example of this.

Pausing (stopping), restarting, or deleting a cluster

You can stop (pause) a running cluster, restart a stopped cluster, or delete it in the Workbench web UI.

Screenshot of cloud environment card, highlighting the 'Stop' button.
Stop a running cluster from the Workbench web UI by clicking the Stop button.
Screenshot of cloud environment card, highlighting the 'Start' button.
Restart a stopped cluster from the Workbench web UI by clicking the Start button.

Screenshot of cloud envrionment card, highlighting the Edit, Open cloud console, and Delete options in three-dot menu.
Delete a cluster from the Workbench web UI by opening the three-dot menu and clicking Delete.

You can also stop/restart/delete a cluster from the Cloud console, as described below.

Scheduled deletion still applies to stopped clusters.

If you have any secondary workers running as part of a cluster, the cluster cannot be stopped. You will first need to edit your cluster configuration, as described in the next section, to set the secondary workers to 0 nodes. Note that if the cluster is “UPDATING” (e.g., if autoscaling is engaged to turn nodes up or down), you will need to wait until the update is finished to make edits.

Changing the configuration of an existing Dataproc cluster

You can change some aspects of a cluster’s configuration after it has been created. Specifically, you can edit the cluster’s environment name (distinct from the cluster’s name in the Cloud console), change the number of primary and secondary (spot) worker nodes, select a different autoscaling policy, and update the scheduled deletion value. Note that machine type and disk size cannot be modified after creation.

To make changes to a cluster environment, click on the three-dot menu for the cluster, then select Edit.

Screenshot of cloud environment card, highlighting the Edit option in three-dot menu.
Edit a cluster to change some aspects of its configuration.

The figure below shows how to edit the number of spot workers:

Screenshot of Node configuration section in Customize dialog, highlighting the configurable 'Number of workers' field in 'secondary worker nodes' subsection.
Changing the number of a cluster's spot workers.

Note: The cluster needs to be in a RUNNING state to edit nearly all configuration settings. Only the cluster’s environment name can be edited if the cluster is in a STOPPED or UPDATING state.

Screenshot of cloud environment card, highlighting the 'Updating' status.
A cluster with UPDATING status.

Accessing the Dataproc dashboard in the Google Cloud console

You can open the Dataproc dashboard in the Google Cloud console by clicking the Open Cloud console link under the three-dot menu for the running cluster. You can also click on the project link on your workspace’s Overview tab to open the Cloud console and navigate to the Dataproc panel.

The console’s Dataproc dashboard gives you access to a range of monitoring panels, logs, and web interfaces, and allows submission of batch jobs.

Screenshot of Dataproc dashboard in Google Cloud console showing various line graphs regarding monitoring metrics including memory, CPU utilization, and network bytes.
The Dataproc dashboard in the Cloud console.

You can also stop and start a cluster via the Cloud console, and edit a cluster config to change its number of primary and secondary workers.

Note: Cluster creation must be done via the Verily Workbench web UI or CLI, not the Cloud console.

How to submit a batch job to a Dataproc cluster

You can submit a batch job using gcloud like this:

gcloud dataproc jobs submit pyspark --cluster {HAIL_CLUSTER_NAME} --region us-central1 \
    <your_batch_script>.py

See this notebook file for an example. You can run this command from a workspace notebook environment (both a “regular” notebook environment or the Dataproc cluster environment), or from your local machine. Locally, you may need to set the
--project flag as well, if gcloud is not configured for the correct project.

You can also submit a batch job from the Dataproc dashboard in the Cloud console.

Dataproc in shared workspaces

Where a workspace is shared with multiple users, those with Writer or Owner privileges can create Dataproc clusters in the workspace. Those with Reader privileges can’t create clusters, though they will be able to duplicate (clone) the workspace and create clusters in the cloned workspace.

Users cannot see the details of each other’s clusters or access others’ notebook servers.

As described above, two controlled bucket resources are automatically created for you on first creation of a Dataproc cluster. Their resource names are: dataproc-staging-<workspace-name>, and dataproc-temp-<workspace-name>, where <workspace name> is replaced with the name of your workspace. Those buckets are used by default for all Dataproc clusters in the workspace (unless a user selects a different bucket).

Examples

Some example notebooks are available here:

Last Modified: 21 May 2024