How to Work with Data
Categories:
Introduction
You can use a Workspace to collect, organize, and share data underlying your research. A typical workspace includes a mix of data references, pointing to cloud buckets or databases containing source data related to your work, and workspace data, containing file uploads, intermediate results, and outputs.
Add data to your Verily Workbench workspace
Verily Workbench supports adding three types of data to your workspace:
- Data reference – add a reference to a single cloud-based data resource.
- Data collection – import a collection of data resources from the data catalog.
- Workspace data – create a data resource owned by the workspace.
You can use either the web UI or the CLI to add data to your workspace. Most users prefer to use the web UI for adding data, but the CLI can be useful for automating repetitive workspace data setup tasks in advanced scenarios.
Add a data reference
Use a data reference to store a named reference to a cloud-based data source in your workspace. Workbench currently supports Cloud Storage buckets and objects, and BigQuery datasets and tables.
As a Workspace resource, data references have a name, description, and can be organized within folders in your workspace.
To add a data reference:
-
Click the “+ Cloud Resource” button in the Resources tab. Choose the type of data reference you want to add
-
Fill in the dialog specifying a name, short description, and cloud resource location for the new data reference.
-
Click “Add” to save the reference to your workspace.
Depending on the type of data reference, you may be required to provide the cloud resource location in a different format:
Data reference type | Description | Example cloud resource location |
Cloud Storage Bucket | A reference to a top-level Google Cloud Storage bucket | Bucket name: “genomics-public-data” |
Cloud Storage object | A reference to a file or folder within a Google Cloud Storage bucket | Cloud object URL: “gs://genomics-public-data” |
BigQuery dataset | A reference to a BigQuery dataset | Project ID: bigquery-public-data
Dataset name: human-genome-variants |
BigQuery table | A reference to a BigQuery table within a dataset | Project ID: bigquery-public-data
Dataset name: human-genome-variants Table name: 1000_genomes_phase_3_variants_20150220 |
- Locate the Workbench subcommand related to the type of data reference you are trying to create. Run
terra resource add-ref
to see a list of supported commands.
$ terra resource add-ref
Missing required subcommand
Usage: terra resource add-ref [COMMAND]
Add a new referenced resource.
Commands:
bq-dataset Add a referenced BigQuery dataset.
bq-table Add a referenced BigQuery Data Table.
gcs-bucket Add a referenced GCS bucket.
gcs-object Add a referenced GCS bucket object.
git-repo Add a referenced git repository.
- Provide the required fields, a resource name, and optionally a description string. Workbench will create the reference and will return a summary of the new resource.
$ terra resource add-ref bq-table --dataset-id=samples --project-id=bigquery-public-data --table-id=github_timeline --name=github_timeline
Successfully added referenced BigQuery data table.
Name: github_timeline
Description:
Type: BQ_TABLE
Stewardship: REFERENCED
Cloning: COPY_REFERENCE
GCP project id: bigquery-public-data
BigQuery dataset id: samples
BigQuery table id: github_timeline
# Rows: 6219749
Share controlled-access data with Workbench users
If you add a reference to a controlled-access data resource in Workbench, you must ensure that the appropriate Workbench user accounts are granted access to the controlled external data. Work with an administrator who can modify the access control settings on the source data as you set up the initial access configuration.
To share an external Google Cloud data resource with Workbench users:
-
Work with your data and organization administrators to identify the set of users who should have access to the data. This may include all active researchers in your organization; it may include only researchers participating in a specific study; or it may include a specific set of individuals.
-
To grant access to a group of users:
-
Ensure there is a Workbench Group representing the appropriate group of users. This group may be manually-managed or synced to your organization’s internal user directory. Find the email address corresponding to the Workbench Group and save it for the next step.
Hint: Workbench Groups should look something like
my-user-group@verily-bvdp.com
. -
Use the appropriate Google Cloud mechanism to grant read-only or read-write access to the Workbench Group email address found above. Details vary by resource type; see cloud documentation for BigQuery and Cloud Storage.
-
-
To grant access to a single user:
-
Identify the Workbench “proxy” group that contains the user account and all associated robot accounts. The proxy group for your current user can be found by running the CLI command
terra auth status
.Hint: the proxy group for a user should look something like
PROXY_2631740767397ab04fed6@verily-bvdp.com
. -
Use the appropriate Google Cloud mechanism to grant read-only or read-write access to the proxy group found above. Details vary by resource type; see cloud documentation for BigQuery and Cloud Storage.
-
For more details on user and group management in Workbench, see [Manage users and groups].
Add a data collection from the Data Catalog
A data collection is a grouping of related cloud-based resources related to a well-known dataset or study (e.g. public datasets like 1000 Genomes or TCGA, or data sources and studies internal to your organization). Import a data collection into your workspace to start working with the underlying cloud resources in your analysis.
Data collections may also include policy annotations and associated resources to facilitate data exploration and analysis. Policy annotations are non-optional: you can choose which resources to include when you add a data collection to a workspace, but all policy annotations will be added to your workspace policy set.
Import references to resources from a data collection
To add a data collection to your workspace via the web UI, click on the ‘+ Data from catalog’ button in the top right of the Resources section. This will open a resource addition dialog; use it as detailed below.
-
Browse the data catalog and select a data collection of interest. You’ll be able to see information about the most recent version of the data collection and when it was published. Click ‘Next’.
This will lead to a dialog showing the contents of the collection.
-
After you’ve clicked in to the data collection, select the version you’d like to import.
-
Select which resources you would like to import from the data collection version. You can expand folders by clicking on the triangle to the left of the folder name. If you do not select all resources in a data collection, you will still have the option of adding them later. Once you have finalized your selection, click ‘Next’.
This will lead to a dialog showing the data policies associated with the resources you have selected.
-
Review the policy requirements. Click ‘Next’.
This will display a list of selected resources and destination options.
-
Review your selection and choose the workspace folder where you want to add them. You can select an existing folder from the dropdown menu or create a new folder. Click ‘Add to your workspace’.
The selected resources should now appear in your workspace resources view.

You can manage and access these resources as you would any other resource in your workspace.
View the lineage of resources imported from a data collection
You can view the data collection lineage of each resource. This displays provenance information, including a link to the collection of origin as well as the time or date when the resource was added to the workspace.
To view lineage information, click on the resource you want to inspect in the Resources list, then click on the ’lineage’ tab in the information pane on the right.
Create workspace data
You can also create cloud-native data resources within a workspace, also referred to as “controlled resources”. This data is shared with workspace collaborators, and the data lifecycle matches that of the workspace. If the workspace is deleted, this data is also deleted. If the workspace is cloned, this data is also cloned.
To create a workspace data resource:
-
Click the “Add” button and choose the resource type (e.g. Cloud Bucket or BigQuery Dataset).
-
Fill in the dialog specifying a name, description, and cloud-specific details of the new resource. Note that many cloud resources have specific naming restrictions or requirements.
-
Click “Create” to create the resource. It should appear in the “Resources” tab after completion.
-
Locate the Workbench subcommand related to the type of workspace data resource you are trying to create. Run
terra resource create
to see a list of supported commands.$ terra resource create Missing required subcommand Usage: terra resource create [COMMAND] Add a new controlled resource. Commands: gcp-notebook Add a controlled GCP notebook instance resource. For a detailed explanation of some parameters, see https: //cloud.google. com/vertex-ai/docs/workbench/reference/rest/v1/projects. locations.instances#Instance. bq-dataset Add a controlled BigQuery dataset. gcs-bucket Add a controlled GCS bucket.
-
Provide the required fields, a resource name, and optionally a description string. Workbench will create the reference and will return a summary of the new workspace data resource.
$ terra resource create gcs-bucket --name=scratch-data --description="Scratch space for working data." Successfully added controlled GCS bucket. Name: scratch-data Description: Scratch space for working data. Type: GCS_BUCKET Stewardship: CONTROLLED Cloning: COPY_RESOURCE Access scope: SHARED_ACCESS Managed by: USER GCS bucket name: scratch-data-terra-vdevel-clean-pear-3014 Location: US-CENTRAL1 # Objects: 0
The terra
CLI allows a wider range of configuration options than does the UI. For example, if you
enter just terra resource create gcs-bucket
, you’ll see usage information that indicates how to
set a bucket lifecycle rule, change the resource’s cloning mode— which determines
how the resource is handled when its associated workspace is duplicated, and more.
Access and browse data
Workbench provides a variety of features to browse and interact with data in your workspace. Depending on your role in a project, you may be interested in browsing reference data, incorporating data resources into a Jupyter notebook or Nextflow script for analysis, or viewing data files and results. The subsections below will help you get started on these activities.
Locate and download data
All data resources in your workspace have an underlying cloud-based location, such as the gs://
URLs for Google Cloud Storage buckets. It is often useful to pass these global identifiers on to
other cloud-native tools or systems.
To locate a data resource:
-
Click on the data resource to show the details pane.
-
Look for the “Source” row, which shows the underlying cloud location.
Note: some cloud resources show a link next to the cloud location. Click this link to open the resource in the cloud-native file or database browser, if your workspace policy allows.
-
Use the
terra resource list
command to list resources in your workspace. Find the name of the data resource of interest:$ terra resource list NAME RESOURCE TYPE STEWARDSHIP TYPE DESCRIPTION 1000-genomes-example-notebooks GIT_REPO REFERENCED (unset) bam-folder GCS_OBJECT REFERENCED (unset) code GIT_REPO REFERENCED (unset) cram-folder GCS_OBJECT REFERENCED (unset)
-
Use the
terra resource resolve
command to print out the underlying cloud location:$ terra resource resolve --name=bam-folder gs://genomics-public-data/ftp-trace.ncbi.nih.gov
Using a Jupyter Notebook
- Within a Jupyter notebook, use the shell magic prefix
!
to invoke the Workbench CLI to resolve a data reference:

- You can assign the resolved location to a Python variable and use it later in your analysis, or pass the location to cloud-native tools. The example below demonstrates using the
gsutil
command to list files within a Google Cloud Storage data reference: \

Check data access with the CLI
Since some data references may be controlled-access, it can be helpful to verify that your user account has access to data required for your analysis. The Workbench CLI check-access
command provides a simple method to check access.
To use the CLI to check data access:
- List resources in your workspace to find the name of the data resource of interest:
$ terra resource list
NAME RESOURCE TYPE STEWARDSHIP TYPE DESCRIPTION
1000-genomes-example-notebooks GIT_REPO REFERENCED (unset)
bam-folder GCS_OBJECT REFERENCED (unset)
code GIT_REPO REFERENCED (unset)
cram-folder GCS_OBJECT REFERENCED (unset)
- Use the
terra resource check-access
command to verify that your account has access:
$ terra resource check-access --name=bam-folder
User's pet SA in their proxy group (PROXY_2631740767397aa04fec6@verily-bvdp.com) DOES have access to this resource.
Tip
You can combine this CLI command with a small amount of Python code to loop over all resources in a workspace and check access one-by-one:
Browse a storage bucket
To quickly browse the contents of a cloud storage bucket from a workspace, use the built-in storage browser from the web UI:
- Open the workspace resources tab and navigate to the bucket resource of interest.
- Click on the resource to view the details pane.
- Click the “Browse” button to browse the bucket contents in a new window.
View file details
Click on an individual file or folder to view its details in the browser window. The details pane will show file details such as last modified date, file size, and a download button.
Preview file contents
Certain supported file types, such as .ipynb
notebook files and .csv
tabular data, will show a “Preview” button. Click the button to open a preview of the file.
We support the below file types for preview (values are the file extensions):
- Images: jpeg, jpg, png, tiff, gif, bmp, svg
- Renderables: md, pdf, html, ipynb, rmb
- Tabular: csv, tsv
- Text: txt, wdl, nf, sh, log, stdout, stderr, script, rc, json
- IGV: bam, bed, bedgraph, bb, bw, birdseye_canary_calls, broadpeak, seg, cbs, sam, vcf, linear, logistic, assoc, qassoc, gwas, gct, cram
Browse BigQuery data
Workbench does not have a built-in browser for Google BigQuery data. If your workspace policy allows it, you can follow a link to Google’s native BigQuery data browser:
- Click on the BigQuery dataset or table resource to show the details pane.
- Click the “Browse in BigQuery” button to open the dataset or table in Google’s BigQuery data browser.
Save and share data
It is critical to be able to bring process files and research results from your compute environment (whether that is your laptop machine, a Jupyter notebook in the cloud, or a compute node running a workflow task) back to the shared storage space in your workspace. The utilities that Workbench provides to locate and download data, demonstrated above, also have a role to play in making it possible to upload local files to your cloud-native workspace data storage.
Save data to your workspace
When you run a tool or analysis script on your laptop or in a personal compute environment, results are usually stored as private files attached to that device. To archive your results or share them with collaborators, you will need to transfer data back to a shared storage resource in your workspace. A typical Workbench workspace might have a results
or shared
cloud storage bucket designed for this purpose, or a database resource for collecting tabular analysis outputs.
Upload a file to cloud storage with the CLI
The Workbench CLI features a terra gsutil
command that wraps around Google’s gsutil
command-line utility. When this command is invoked, terra
sets the correct cloud credentials and Google Cloud project ID before passing arguments to the underlying gsutil
executable.
To upload a file from your laptop or cloud environment to a workspace storage bucket:
-
Navigate your local computer to the path of the file you wish to upload.
-
Identify the name of the cloud storage resource that will be your destination.
-
Use a combination of
terra gsutil cp
andterra resource resolve
to copy the file to Workbench’s cloud-native storage:$ terra gsutil cp iris.csv $(terra resource resolve --name=scratch)/ Setting the gcloud project to the workspace project Updated property [core/project]. Copying file://iris.csv [Content-Type=text/csv]... / [0 files][ 0.0 B/ 3.9 KiB] / [1 files][ 3.9 KiB/ 3.9 KiB] Operation completed over 1 objects/3.9 KiB. Restoring the original gcloud project configuration: terra-vdevel-clean-pear-3014 Updated property [core/project].
Load a CSV file into BigQuery with the CLI
The Workbench CLI features a terra bq
command that wraps around Google’s bq
command-line utility. When this
command is invoked, the CLI sets the correct cloud credentials and Google Cloud project ID before
passing arguments to the underlying bq
executable.
$ terra bq load --source_format=CSV --autodetect $(terra resource resolve --name=results_dataset).iris_data iris.csv
Setting the gcloud project to the workspace project
Updated property [core/project].
Upload complete.
Waiting on bqjob_r54fad8aedc10f440_000001844d0de670_1 ... (0s) Current status: RUNNING
Waiting on bqjob_r54fad8aedc10f440_000001844d0de670_1 ... (1s) Current status: RUNNING
Waiting on bqjob_r54fad8aedc10f440_000001844d0de670_1 ... (1s) Current status: DONE
Restoring the original gcloud project configuration: terra-vdevel-clean-pear-3014
Updated property [core/project].
Share a link to output data
To share the results of your output, use the Workbench web UI to find a stable URL linking to a file within a workspace data resource:
- Open the workspace “Resources” tab and locate the data resource containing your file of interest. Click the resource to view the details pane.
- Click the “Browse” button to open Workbench’s storage browsing window.
- Navigate to the file of interest and click on it to view the file details pane.
- In your browser, select the current URL and copy it.
- Share the URL with a collaborator who has access to the same workspace. This link should open the Workbench bucket browser to the same file location.
Last Modified: 16 November 2023