Getting started with Verily Workbench
Purpose: This document walks you through how to use basic features of Verily Workbench, including creating a workspace, loading data and analyzing it in a notebook.
The goal of this walkthrough is to show you what you need to know to achieve the following through the Verily Workbench web UI:
- assemble a workspace on your own;
- bring in some data;
- access the data from a notebook running in a cloud environment;
- perform some basic data manipulation and analysis operations;
- save and share your results.
Note: This walkthrough is intentionally very concise. To learn more about each step and additional options that may be available, see the documentation referenced at each step.
This walkthrough assumes that you have already signed up for an account and that you have a billing project that you can use to perform operations that incur costs, such as creating workspaces, creating and running cloud environments, and storing data on the cloud. If that is not the case, please contact support.
It is helpful, though not required, that you be familiar with basic JupyterLab notebook usage and R code.
1. Create a workspace
First, you need to create a workspace.
In the Workbench web UI, navigate to the page that lists your workspaces. The simplest way to do this is to click on the Workspaces icon in the left-hand menu bar.
Click the green button in the top right corner labeled + New workspace to open the workspace creation dialog.
The workspace name is the only field that requires your input; everything else is either optional or prefilled.
Once you’ve filled out the three (very short) screens, click the Create workspace button on the third screen. It should take less than a minute for the system to create your workspace. Once it’s done, your browser will load your new workspace’s overview page.
2. Add storage resources to the workspace
2.1 Connect to a storage bucket containing data
First, you’re going to connect an external cloud storage bucket to your workspace, which will make it easy to access data in that bucket from your workspace. This bucket will be treated as a referenced resource.
To do so, click on the Resources tab on your workspace page. You’ll see three green buttons in the top right corner of that tab; click on the one labeled + Cloud resource to open the menu of options, and select Reference Cloud Storage bucket to open the appropriate resource addition dialog.
Make sure to give your resource a memorable, descriptive name; the resource name is what will show up in your workspace’s list of resources.
Enter the name of the storage bucket you want to connect to in the Bucket name field; do not include the
gs:// prefix. This should be an existing bucket that you have access to. For testing purposes, you can use
genomics-public-data, a public storage bucket that includes data from the 1000 Genomes Project.
Once you’ve filled out the form, click the Add to resources button. You should see the new resource appear immediately in the list of resources. You can test that you can access the bucket by clicking on the white Browse button in the information panel on the right of the Resources tab.
Note: You can also create a referenced resource to an object or folder in a Cloud Storage bucket, instead of pointing to the whole bucket. We’ll see that in the example notebook referenced below.
2.2. Create a storage bucket for outputs
Now you have a workspace with data connected to it, but nowhere to store results of your analyses. Next, you’re going to create a bucket that will be attached to the workspace itself, where you can store your outputs, as well as any other files, inputs and so on that you might wish to associate with the workspace. This bucket will be treated as a controlled resource.
From the Resources tab of your workspace, click on + Cloud resource again but this time, select New Cloud Storage bucket to open the relevant dialog.
The dialog should look familiar at this point; it’s similar to the previous one, except this time you’re creating a bucket instead of pointing an existing one.
When you create the bucket, you will specify a resource name, which must be unique for the workspace, and in Workbench you can refer to it by its resource name. (The example notebook below will use a bucket with a resource name of
ws_files; you can create it now if you like).
By default the system will generate an underlying bucket name (URI) for you, but you can specify a name you prefer. Note that you will NOT be able to change the name of your bucket later, and it has to be unique across all of Google Cloud.
Once you’ve filled out the form, click the Create bucket button. You should see the new resource appear immediately in the list of resources. Again, you can test that you can access the bucket by clicking on the white Browse button on the right. Since it’s brand new, it should be empty.
3. Create a reference to a GitHub repository
Click on the workspace Environments tab, and in the Git repositories card, add this public repo:
https://github.com/DataBiosphere/terra-axon-examples.git. Give it the name:
This repo will be automatically cloned to any cloud environments that you create— which you’ll do next.
4. Create a cloud environment
You’re almost ready to start working; you just need to get the actual computer up and running! In this step, you’re going to create and launch a cloud environment that will consist of a virtual machine (VM) with some preinstalled software and a local storage drive associated with it, called a persistent disk.
To do so, click on the Environments tab on your workspace page, then click on the green + New cloud environment button to open the menu of options; select the JupyterLab option labeled Vertex AI Workbench instance.
Once again, be sure to give your environment a memorable name. Feel free to customize the ID or accept the one generated automatically by the system.
On the second screen of the cloud environment creation dialog, you can accept the default configuration or customize it to suit your needs. For the purposes of this walkthrough, leave the R (Latest) option selected, since the analysis example below will use R code.
If you need more compute power, you can increase the number of CPUs to allocate; the memory allocated will scale accordingly. You can also attach GPUs to your environment (if compatible with the selected environment image). Keep in mind that the running cost of your environment will scale with the computing resources allocated to it.
Once you’re satisfied with the confirguration, click the Create environment button. You should see a card for the new resource appear in the list of cloud environments, with a status indicator. It may take a few minutes for the system to get your cloud environment ready, depending on what you’ve requested.
5. Open a Jupyter Notebook
When your environment is ready to use, the status indicator will turn green, with the label RUNNING. Everything is now ready for you to get to work.
Click on the name of your environment to open it; this will open a JupyterLab session in a new browser window, displaying the JupyterLab Launcher.
5.1 Open a new notebook
For the purposes of this walkthrough, we’re going to use example code written in R, so click on the R logo in the list of Notebook options. This will create a new R notebook file stored on your environment’s persistent disk, and open it for editing.
Start by renaming your new notebook to something meaningful. Right-click (or control-click) on the file name (which should be
Untitled.ipynb) to open the contextual menu and select the Rename option.
You’ll find the file in the $HOME directory, which is displayed by default in the JupyterLab file explorer (left side panel). The file explorer allows you to access and organize any files stored on the persistent disk itself, and to access resources that are mounted to your environment.
Now that you know how to create a new notebook, you won’t actually use it for the example below. Instead, you’ll open a pre-existing R notebook.
5.2 Open an existing notebook from an examples repo
Next, open an existing notebook from the examples repo you added to the workspace in Step 3. This repo,
terra-axon-examples, was automatically cloned to the JupyterLab server when you created your cloud environment. You will find it under
/repos in the file explorer (or
/home/jupyter/repos in the Terminal).
In the file navigator, click in to
1kgenomes_examples, then click on the
R_1kgenomes.ipynb file to open it.
Note: if you did not add the GitHub resource previously, you can manually run
git clone https://github.com/DataBiosphere/terra-axon-examples.gitto add the repo to your notebook server.
6. Run an example notebook
This example R notebook,
R_1kgenomes.ipynb walks through the computation of principal components (PCA) of genomic variant data across one chromosome from 2,504 people from the 1000 genomes project.
(The notebook is adapted from http://bwlewis.github.io/1000_genomes_examples/PCA.html).
In addition to its interesting computational example, the notebook walks through several Workbench concepts:
- Creating bucket resources using the Workbench CLI
- Accessing data from automounted Cloud Storage bucket resources:
ℹ️ Accessing workspace files and folders > Bucket automounting
- Saving results locally, and to Cloud Storage via mounted resources
7. Turn down the cloud environment
Once you’re done working and you’re sure you saved everything appropriately, you can close the JupyterLab browser window and return to the browser window that had your workspace open to the Environments tab. If you closed it, that’s okay, just navigate back to it now. There is one more thing to do before you can sign out for the day: turn down your cloud environment.
To stop a cloud environment that is currently running, select Stop in the environment card. This will immediately send the instruction to stop the environment; there is no confirmation step. However, there may be a lag of a few seconds before the status is updated in the graphical user interface.
Make sure you don’t forget this step, because the cloud provider will continue to charge you as long as your environment is running, even if it’s not doing anything!
When you are ready to resume work, click Start in the environment card. It may take a few minutes for a cloud environment to re-start after being paused.
8. Delete the cloud environment
If you are done with your cloud environment, you can Delete it from the action menu of the environment card.
Deletion of the cloud environment deletes its underlying disk as well. So, before you do this, make sure that you’ve preserved your work. You can do this by writing data and notebook files to your workspace bucket; this was shown in the example notebook. You can also commit work back to a GitHub repo, or download files to your local machine.
Last Modified: 16 November 2023