Data resources overview
Categories:
Purpose: This document provides a high-level understanding of data resources that can be used in Verily Workbench, and how to make them available for analysis in the context of a workspace.
What are data resources?
Resources comprise a variety of entities whose chief purpose is to facilitate analysis. In many cases, resources are simply multimodal data that can be managed within the workspace, but they aren’t limited to data exclusively. Inside a workspace, the “Resources” tab is where the data resources associated with that project are found.
Types of data resources
At the highest level, we make a distinction between object-based data resources that are or contain files and folders (storage buckets and objects) and tabular data resources (BigQuery datasets and tables).
The following table summarises the four subtypes of data resources and provides examples of how they are identified (location or ID strings).
Data reference type | Example location or identifiers |
A top-level storage bucket | Bucket name: “genomics-public-data” |
A file or folder within a storage bucket | Cloud object URL: “gs://genomics-public-data” |
A BigQuery dataset | Project ID: bigquery-public-data
Dataset name: human-genome-variants |
A table within a BigQuery dataset | Project ID: bigquery-public-data
Dataset name: human-genome-variants Table name: 1000_genomes_phase_3_variants_20150220 |
Referenced vs. workspace-controlled data resources
All four subtypes of data resources can be further distinguished as either “referenced” or “controlled” as defined below.
Referenced resources, or simply references, represent data and other elements in Verily Workbench by pointing to a source that exists outside of the current workspace. While references are functionally identical to their source, they afford more flexibility and less risk, as anything done to a reference has no effect on its source.
An example of a reference is a BigQuery dataset you want to work with in Workbench. By creating a reference, you can bring that dataset into the workspace as a reference, and perform analysis and workflows using that referenced resource. You can safely delete the reference, or make new references in other workspaces, with no effect on the original dataset. There are no limits to the number of references you can create, as long as access to the source is maintained.
Controlled resources are cloud resources that are managed or created by Verily Workbench within the current workspace, such as a Cloud Storage bucket that was made using your workspace. If you wanted to use the same bucket in a different workspace, a reference to the original controlled resource would need to be created in the other workspace. In other words, a controlled resource is its own source, and native to the Workbench workspace it exists within. If the workspace or the resource is deleted, it no longer exists.
Data catalog and collections
The data catalog is an integrated tool within Verily Workbench that streamlines the process of data discovery. Browse data collections curated by data owners using powerful filters to minimize the amount of time taken to discover data relevant to your study. Export entire or partial collections to your Workbench workspace for use in interactive analysis or workflows. Easy version tracking and optioned updates ensure all collaborators can stay in sync.
Data collections are diverse datasets, available from Verily Workbench’s data catalog for use as referenced data in your own workspace. Collections are curated by data owners that ensure data quality, reproducibility, and associated lineage. Many collections will have policies attached that determine how the data may be accessed and used. Collections you have access to may be entirely or partially referenced for use in your Workbench workspace.
To learn more about using the data catalog and browsing data collections, see this page.
Data resource operations
You can perform the following operations on data resources through the web UI:
- List your data resources
- Create a new resource
- Add a reference to an existing resource
- Add a data collection from the data catalog
- Manage your data resources (organize, view, browse contents, etc.)
For instructions on how to perform these operations through the web UI, see this page.
For equivalent CLI instructions, see the Workbench CLI reference.
Last Modified: 16 July 2024