Getting started with Verily Workbench

Getting started with Verily Workbench (web UI).

Purpose: This document walks you through how to use basic features of Verily Workbench, including creating a workspace, loading data and analyzing it in a notebook.



Synopsis

The goal of this walkthrough is to show you what you need to know to achieve the following through the Verily Workbench web UI:

  • assemble a workspace on your own;
  • bring in some data;
  • access the data from a notebook running in a cloud environment;
  • perform some basic data manipulation and analysis operations;
  • save and share your results.

Note: This walkthrough is intentionally very concise. To learn more about each step and additional options that may be available, see the documentation referenced at each step.

Prerequisites

This walkthrough assumes that you have already signed up for an account and that you have a billing project that you can use to perform operations that incur costs, such as creating workspaces, creating and running cloud environments, and storing data on the cloud. If that is not the case, please contact support.

It is helpful, though not required, that you be familiar with basic JupyterLab notebook usage and R code.

Step-by-step

1. Create a workspace

First, you need to create a workspace.

In the Workbench web UI, navigate to the page that lists your workspaces. The simplest way to do this is to click on the Workspaces icon in the left-hand menu bar.

Click the green button in the top right corner labeled + New workspace to open the workspace creation dialog.

The workspace name is the only field that requires your input; everything else is either optional or prefilled.

Once you’ve filled out the three (very short) screens, click the Create workspace button on the third screen. It should take less than a minute for the system to create your workspace. Once it’s done, your browser will load your new workspace’s overview page.

Creating a new workspace.

ℹ️ Workspace operations (web UI) > Create a new workspace

2. Add storage resources to the workspace

The bucket resources you add to a workspace will be automounted to the file system of any workspace cloud environments that you create.

ℹ️ Accessing workspace files and folders > Bucket automounting

2.1 Connect to a storage bucket containing data

First, you’re going to connect an external cloud storage bucket to your workspace, which will make it easy to access data in that bucket from your workspace. This bucket will be treated as a referenced resource.

To do so, click on the Resources tab on your workspace page. You’ll see three green buttons in the top right corner of that tab; click on the one labeled + Cloud resource to open the menu of options, and select Reference Cloud Storage bucket to open the appropriate resource addition dialog.

Make sure to give your resource a memorable, descriptive name; the resource name is what will show up in your workspace’s list of resources.

Enter the name of the storage bucket you want to connect to in the Bucket name field; do not include the gs:// prefix. This should be an existing bucket that you have access to. For testing purposes, you can use genomics-public-data, a public storage bucket that includes data from the 1000 Genomes Project.

Creating a GCS bucket reference.

Once you’ve filled out the form, click the Add to resources button. You should see the new resource appear immediately in the list of resources. You can test that you can access the bucket by clicking on the white Browse button in the information panel on the right of the Resources tab.

Note: You can also create a referenced resource to an object or folder in a Cloud Storage bucket, instead of pointing to the whole bucket. We’ll see that in the example notebook referenced below.

ℹ️ Data resource operations > Reference a storage bucket

2.2. Create a storage bucket for outputs

Now you have a workspace with data connected to it, but nowhere to store results of your analyses. Next, you’re going to create a bucket that will be attached to the workspace itself, where you can store your outputs, as well as any other files, inputs and so on that you might wish to associate with the workspace. This bucket will be treated as a controlled resource.

From the Resources tab of your workspace, click on + Cloud resource again but this time, select New Cloud Storage bucket to open the relevant dialog.

The dialog should look familiar at this point; it’s similar to the previous one, except this time you’re creating a bucket instead of pointing an existing one.

Creating a controlled storage bucket.

When you create the bucket, you will specify a resource name, which must be unique for the workspace, and in Workbench you can refer to it by its resource name. (The example notebook below will use a bucket with a resource name of ws_files; you can create it now if you like).

By default the system will generate an underlying bucket name (URI) for you, but you can specify a name you prefer. Note that you will NOT be able to change the name of your bucket later, and it has to be unique across all of Google Cloud.

Creating a controlled storage bucket. Here, the resource will be added under the "experimental data" folder.

Once you’ve filled out the form, click the Create bucket button. You should see the new resource appear immediately in the list of resources. Again, you can test that you can access the bucket by clicking on the white Browse button on the right. Since it’s brand new, it should be empty.

ℹ️ Data resource operations > Create a storage bucket

3. Create a reference to a GitHub repository

Click on the workspace Environments tab, and in the Git repositories card, add this public repo: https://github.com/DataBiosphere/terra-axon-examples.git. Give it the name: terra-axon-examples.

Adding a GitHub repo to a workspace.

This repo will be automatically cloned to any cloud environments that you create— which you’ll do next.

4. Create a cloud environment

You’re almost ready to start working; you just need to get the actual computer up and running! In this step, you’re going to create and launch a cloud environment that will consist of a virtual machine (VM) with some preinstalled software and a local storage drive associated with it, called a persistent disk.

Creating a new cloud environment.

To do so, click on the Environments tab on your workspace page, then click on the green + New cloud environment button to open the menu of options; select the JupyterLab option labeled Vertex AI Workbench instance.

Selecting a cloud environment option.

Once again, be sure to give your environment a memorable name. Feel free to customize the ID or accept the one generated automatically by the system.

On the second screen of the cloud environment creation dialog, you can accept the default configuration or customize it to suit your needs. For the purposes of this walkthrough, leave the R (Latest) option selected, since the analysis example below will use R code.

If you need more compute power, you can increase the number of CPUs to allocate; the memory allocated will scale accordingly. You can also attach GPUs to your environment (if compatible with the selected environment image). Keep in mind that the running cost of your environment will scale with the computing resources allocated to it.

Once you’re satisfied with the confirguration, click the Create environment button. You should see a card for the new resource appear in the list of cloud environments, with a status indicator. It may take a few minutes for the system to get your cloud environment ready, depending on what you’ve requested.

ℹ️ Cloud environment operations > Create a new cloud environment

5. Open a Jupyter Notebook

When your environment is ready to use, the status indicator will turn green, with the label RUNNING. Everything is now ready for you to get to work.

Click on the name of your environment to open it; this will open a JupyterLab session in a new browser window, displaying the JupyterLab Launcher.

Open the link to a running cloud environment

5.1 Open a new notebook

For the purposes of this walkthrough, we’re going to use example code written in R, so click on the R logo in the list of Notebook options. This will create a new R notebook file stored on your environment’s persistent disk, and open it for editing.

Start by renaming your new notebook to something meaningful. Right-click (or control-click) on the file name (which should be Untitled.ipynb) to open the contextual menu and select the Rename option.

You’ll find the file in the $HOME directory, which is displayed by default in the JupyterLab file explorer (left side panel). The file explorer allows you to access and organize any files stored on the persistent disk itself, and to access resources that are mounted to your environment.

Now that you know how to create a new notebook, you won’t actually use it for the example below. Instead, you’ll open a pre-existing R notebook.

5.2 Open an existing notebook from an examples repo

Next, open an existing notebook from the examples repo you added to the workspace in Step 3. This repo, terra-axon-examples, was automatically cloned to the JupyterLab server when you created your cloud environment. You will find it under /repos in the file explorer (or /home/jupyter/repos in the Terminal).

In the file navigator, click in to terra-axon-examples –> 1kgenomes_examples, then click on the R_1kgenomes.ipynb file to open it.

Note: if you did not add the GitHub resource previously, you can manually run git clone https://github.com/DataBiosphere/terra-axon-examples.git to add the repo to your notebook server.

6. Run an example notebook

This example R notebook, R_1kgenomes.ipynb walks through the computation of principal components (PCA) of genomic variant data across one chromosome from 2,504 people from the 1000 genomes project. (The notebook is adapted from http://bwlewis.github.io/1000_genomes_examples/PCA.html).

In addition to its interesting computational example, the notebook walks through several Workbench concepts:

7. Turn down the cloud environment

Once you’re done working and you’re sure you saved everything appropriately, you can close the JupyterLab browser window and return to the browser window that had your workspace open to the Environments tab. If you closed it, that’s okay, just navigate back to it now. There is one more thing to do before you can sign out for the day: turn down your cloud environment.

To stop a cloud environment that is currently running, select Stop in the environment card. This will immediately send the instruction to stop the environment; there is no confirmation step. However, there may be a lag of a few seconds before the status is updated in the graphical user interface.

Stop a running cloud environment

Make sure you don’t forget this step, because the cloud provider will continue to charge you as long as your environment is running, even if it’s not doing anything!

ℹ️ Cloud environment operations > Stop cloud environment

When you are ready to resume work, click Start in the environment card. It may take a few minutes for a cloud environment to restart after being paused.

Start a stopped cloud environment.

8. Delete the cloud environment

If you are done with your cloud environment, you can Delete it from the action menu of the environment card.

Deletion of the cloud environment deletes its underlying disk as well. So, before you do this, make sure that you’ve preserved your work. You can do this by writing data and notebook files to your workspace bucket; this was shown in the example notebook. You can also commit work back to a GitHub repo, or download files to your local machine.

Last Modified: 16 April 2024