Data collections in Workbench
Categories:
Prior reading: Data resources overview
Purpose: This document explains the purpose of data collections in Verily Workbench and how you can create, publish, and manage them via the Workbench user interface.
What is a data collection?
Data collections represent multimodal datasets that you can publish to Verily Workbench’s data catalog, so users can reference these data in workspaces. Collections are curated by data owners that ensure data quality, reproducibility, and associated lineage. Many collections will have policies attached that determine how the data may be accessed and used. Collections you have access to may be entirely or partially referenced for use in your Workbench workspace.
A data collection is a grouping of cloud-based resources related to a specific project, study, or purpose (e.g., public datasets like 1000 Genomes or TCGA, or data sources and studies internal to an organization). Researchers can browse and add data collections to their workspaces.
Data collections can…
- Be curated and set up by data stewards or researchers who are responsible for the governance/usage of the data
- Be referenced across multiple workspaces
- Consist of more than just “data”; they can contain notebook and text files, images, etc.
- Be associated with policies that govern the usage of data
Users can add resources from a data collection to their own workspaces through the UI, and from the context of a workspace. See this page for information on how to add an existing data collection to your workspace.
At times you may also want to create your own data collections. This document describes how to do that.
Why create a data collection?
If you have a collection of data (tables and files) that you would like to package and share across a large number of users, then creating a data collection could be the right way to address your use case.
Packaging your data as a data collection has a few benefits:
- Allow users to easily discover, browse, and import your data:
- Allow users to work with multiple data collections in one workspace, in a policy-compliant way, facilitating cross-analysis
- Define policies to associate with your data, and be ensured that workspaces and users who reference subsets of your data must comply
- Centrally manage how your reference data appears in Workbench, regardless of all the workspaces (and clones) that reference it:
- Define a new “version” of your data collection, and inform users in all workspaces that reference your data
- Manage the “discovery” of your data collections, ranging from widely discoverable to highly private; allow users to get a summary of what a data collection offers, without granting full access permissions
How to create a data collection and manage its versions
Note
Data collections have one or more versions associated with them. An end user can select the version from which they want to import resources.Create and manage a data collection using the Workbench UI
If you like, you can manage most steps in the lifecycle of a data collection yourself using the Workbench UI. This includes creating and publishing new versions of the data collection. If you’d prefer using the Workbench CLI, please see Creating a data collection with the Workbench CLI.
Step 1: Create a data collection
First, you need to create a data collection.
In the Workbench web UI, click on the Data Collection icon in the left-hand menu bar. This page lists all of your data collections that you’re a Writer or Owner of, and therefore can manage and modify.
Click the green button in the top right corner labeled + New data collection to open the data collection creation dialog. (Note: This button will be disabled if you’re not placed in a pod. See Setting up pods and billing for more details.) The dialog will take you through four easy steps.
In the first step called “Enter collection details,” the data collection name is the only field that requires your input; everything else is either optional or prefilled. However, we encourage you to enter a publisher name and email address that you would like researchers to see when they find your data collection in the catalog. Typically this would be the name of a subject matter expert who can answer questions regarding the schema, use cases, and access controls around the data.
On step 2, you’ll have the option to add group, region, and perimeter policies to your data collection. A group policy limits access to your data collection. A region policy allows researchers using your data collection to create resources only within specific regions. A perimeter policy allows data to be accessed only within a particular perimeter.
Note
Once these policies are added, they cannot be subsequently removed from your collection.
On step 3, you’re asked to give information for the first version of your data collection. Remember that data collections have one or more versions associated with them. An end user can select the version from which they want to import resources. When you create a new data collection, your first version is created automatically by the UI. You’re only required to enter a version name. You can optionally enter a URL that links the user to release notes describing the changes and updates in your data collection.
Note
At any given time, you can only have one version in draft form.
Tip
Follow a naming convention for your data collection versions that ensures that each name is unique and is easily understood by the researchers who will be exploring your data collections (e.g.,<data collection name> <date of data release>
).
To finish creating your data collection and its first version, click the Create data collection button on the third screen. It should take less than a minute for the system to create your data collection. Once it’s done, your browser will load your new data collection’s overview page.
Step 2: Add and organize resources to the data collection version
You can add controlled and referenced resources to your data collection version. These resources can be organized in folders.
Click on the Versions tab on the Data collections page, then click the + New resource button. From the dropdown, click New folder.
A New folder details dialog will open. Here, you can enter a folder name, select a folder path, and provide a description (optional). Click Create folder.
You should now see your new folder in the Versions tab. From here, you can edit the folder name and description, move the folder, or delete the folder.
To add a resource, click the + New resource button and select the resource type you’d like to add. In the example below, we’ll add a Cloud Storage bucket.
You’ll be prompted to add a resource ID and an optional description. The bucket name will be prefilled based on your resource ID, but you can change it if you wish. You can also select the folder path for the resource. Click Create bucket.
Once the bucket is created, it will be listed in the Versions tab. You can click on the bucket name to view additional details such as the gsutil URI and description. In addition, you can browse the bucket’s contents; add a file to the bucket via URL; open the bucket in GCP; and move, edit, or delete the bucket.
Step 3: Share your data collection with collaborators
You can invite collaborators to edit your data collection. Click the Share button in the upper right corner of any Data collections page. A dialog will prompt you to enter the email address of the user you want to share your data collection with and select Writer or Owner permissions.
Note
Be sure to only invite collaborators you trust since they will have editing rights in the data collection.After you invite a collaborator, you can change their access level or remove them as a collaborator by using the dropdown options. Click the Share button in the upper right corner of any Data collections page to view collaborators.
Step 4: Manage researcher access to your data collection
To make your data collection visible to other researchers, you can grant Reader or Discover access. Readers can view resources and add them to their workspaces, while Discover users can only see collection metadata.
Click on the Access tab on the Data collections page. Here you can see a list of people who can read and discover your data collection. You can also change a researcher’s access level via the dropdown or remove their access by clicking the trash can button.
To grant access, click the Grant access button. A dialog will prompt you to enter the email address of the user you want to grant access to. You’ll see an error if the email address doesn’t exist or if it doesn’t belong to a registered Workbench user. As a reminder, if you added a group policy to your data collection, only users in this group will see the data collection.
Select whether the user should be granted Reader or Discover access. Click Apply.
Step 5: Publish your data collection
Once your data collection is ready for others to access, you can publish it.
Click on the Versions tab on the Data collections page, then click on the Publish button. A dialog will prompt you to review the details of the data collection version you wish to publish. You can edit the Version details and confirm the policies and resources associated with the version. If everything looks OK, tick the I’m ready box and click Publish version. Users will now be able to view the published versions in the data collection in the data catalog in their workspace.
Note
Users will still technically be able to view draft versions through the CLI. See this page for more details.
You’ll see a message saying that the version was published. You’ll also be able to click the + New version button to create a new draft data collection version.
Edit your data collection
To edit your data collection’s settings, click the Edit button.
You can update the data collection’s name, summary, ID, and description. You can add policies to further limit your collection’s visibility; this could impact existing collaborators and users who have access to your data collection. You can also change the resource region; changes will apply to new resources and environments.
Delete your data collection
To delete your data collection, expand the three-dot menu and click Delete.
A dialog will appear asking you to confirm deletion. All controlled resources and cloud environments will be deleted, and everyone with access to the data collection will be affected.
Warning
All draft and published versions associated with the data collection will be deleted. Any resources that researchers may have added to any of that data collection’s versions will remain in their workspaces, but the links to those resources will be broken.
Last Modified: 21 October 2024