Data Collections in Workbench

How to create and manage a data collection via the Workbench UI

What is a data collection?

Data collections represent multimodal datasets that you can publish to Verily Workbench’s data catalog, so users can reference these data in workspaces. Collections are curated by data owners that ensure data quality, reproducibility, and associated lineage. Many collections will have policies attached that determine how the data may be accessed and used. Collections you have access to may be entirely or partially referenced for use in your Workbench workspace.

A data collection is a grouping of cloud-based resources related to a specific project, study, or purpose (e.g., public datasets like 1000 Genomes or TCGA, or data sources and studies internal to an organization). Researchers can browse and add data collections to their workspaces.

Data collections can…

  • Be curated and set up by data stewards or researchers who are responsible for the governance/usage of the data
  • Be referenced across multiple workspaces
  • Consist of more than just “data”; they can contain notebook and text files, images, etc.
  • Be associated with policies that govern the usage of data

Users can add resources from a data collection to their own workspaces through the UI, and from the context of a workspace. See this page for information on how to add an existing data collection to your workspace.

At times you may also want to create your own data collections. This document describes how to do that.

Why create a data collection?

If you have a collection of data (tables and files) that you would like to package and share across a large number of users, then creating a data collection could be the right way to address your use case.

Packaging your data as a data collection has a few benefits:

  • Allow users to easily discover, browse, and import your data:
    • Consolidate resources across destinations in the cloud into one data collection, via a schema of your preference
    • Support users in easily browsing through the list of resources available in a data collection, and select only the ones most relevant to them
  • Allow users to work with multiple data collections in one workspace, in a policy-compliant way, facilitating cross-analysis
  • Define policies to associate with your data, and be ensured that workspaces and users who reference subsets of your data must comply
  • Centrally manage how your reference data appears in Workbench, regardless of all the workspaces (and clones) that reference it:
    • Define a new “version” of your data collection, and inform users in all workspaces that reference your data
    • Manage the “discovery” of your data collections, ranging from widely discoverable to highly private; allow users to get a summary of what a data collection offers, without granting full access permissions

How to create a data collection and manage its versions

Note: Data collections have one or more versions associated with them. An end user can select the version from which they want to import resources.

Create and manage a data collection using the Workbench UI

If you like, you can manage most steps in the lifecycle of a data collection yourself using the Workbench UI. This includes creating and publishing new versions of the data collection. If you’d prefer using the Workbench CLI, please see Creating a data collection with the Workbench CLI.

Step 1: Create a data collection

First, you need to create a data collection.

In the Workbench web UI, click on the Data Collection icon in the left-hand menu bar. This page lists all of your data collections that you’re a Writer or Owner of, and therefore can manage and modify.

Screenshot of Data collections page.
Main Data collections page.

Click the green button in the top right corner labeled + New data collection to open the data collection creation dialog. (Note: This button will be disabled if you’re not placed in a pod. See Setting up pods and billing for more details.) The dialog will take you through four easy steps.

In the first step called “Enter collection details,” the data collection name is the only field that requires your input; everything else is either optional or prefilled. However, we encourage you to enter a publisher name and email address that you would like researchers to see when they find your data collection in the catalog. Typically this would be the name of a subject matter expert who can answer questions regarding the schema, use cases, and access controls around the data.

Screenshot of Enter collections detail dialog, the first step when creating a data collection.
Enter details for your new data collection.

On step 2, you’ll have the option to add group, region, and perimeter policies to your data collection. A group policy limits access to your data collection. A region policy allows researchers using your data collection to create resources only within specific regions. A perimeter policy allows data to be accessed only within a particular perimeter.

Note: Once these policies are added, they cannot be subsequently removed from your collection.

Screenshot of Set policies dialog, the second step when creating a data collection.
Set group, region, and perimeter policies for your new data collection.

On step 3, you’re asked to give information for the first version of your data collection. Remember that data collections have one or more versions associated with them. An end user can select the version from which they want to import resources. When you create a new data collection, your first version is created automatically by the UI. You’re only required to enter a version name. You can optionally enter a URL that links the user to release notes describing the changes and updates in your data collection.

Note: At any given time, you can only have one version in draft form.

Screenshot of Provide version details dialog, the last step when creating a data collection.
Enter details about your data collection's first version.

To finish creating your data collection and its first version, click the Create data collection button on the third screen. It should take less than a minute for the system to create your data collection. Once it’s done, your browser will load your new data collection’s overview page.

Screenshot of the Overview page of a newly created data collection.
Overview page of a data collection.

Step 2: Add and organize resources to the data collection version

You can add controlled and referenced resources to your data collection version. These resources can be organized in folders.

Click on the Versions tab on the Data collections page, then click the + New resource button. From the dropdown, click New folder.

Screenshot of the Versions page of a data collection, with the '+ New resource' and 'New Folder' buttons highlighted.
Adding a new folder to a data collection.

A New folder details dialog will open. Here, you can enter a folder name, select a folder path, and provide a description (optional). Click Create folder.

Screenshot of the New folder details dialog where users can add a folder name and description.
Add details about your new folder.

You should now see your new folder in the Versions tab. From here, you can edit the folder name and description, move the folder, or delete the folder.

Screenshot of Versions tab of a data collection, showing details of a newly created folder.
View details of your new folder and available actions.

To add a resource, click the + New resource button and select the resource type you’d like to add. In the example below, we’ll add a Cloud Storage bucket.

You’ll be prompted to add a resource ID and an optional description. The bucket name will be prefilled based on your resource ID, but you can change it if you wish. You can also select the folder path for the resource. Click Create bucket.

Screenshot of the Creating Cloud Storage bucket dialog.
Add details about your new Cloud Storage bucket.

Once the bucket is created, it will be listed in the Versions tab. You can click on the bucket name to view additional details such as the gsutil URI and description. In addition, you can browse the bucket’s contents; add a file to the bucket via URL; open the bucket in GCP; and move, edit, or delete the bucket.

Screenshot of Versions tab of a data collection, showing details of a newly created bucket and the location of the Move, Edit, and Delete options.
View bucket details and available actions.

Step 3: Share your data collection with collaborators

You can invite collaborators to edit your data collection. Click the Share button in the upper right corner of any Data collections page. A dialog will prompt you to enter the email address of the user you want to share your data collection with and select Writer or Owner permissions.

Screenshot of the Share data collection dialog showing a valid email address and an invalid email address with an error message.
Share data collection dialog. An error appears if the entered email address doesn't exist or doesn't belong to a registered Workbench user.

Note: Be sure to only invite collaborators you trust since they will have editing rights in the data collection.

After you invite a collaborator, you can change their access level or remove them as a collaborator by using the dropdown options. Click the Share button in the upper right corner of any Data collections page to view collaborators.

Screenshot of the Share data collection dialog showing a list of users the data collection's been shared with, their access levels, and the dropdown to change the access level.
Change a collaborator's access level or remove their access.

Step 4: Manage researcher access to your data collection

To make your data collection visible to other researchers, you can grant Reader or Discover access. Readers can view resources and add them to their workspaces, while Discover users can only see collection metadata.

Click on the Access tab on the Data collections page. Here you can see a list of people who can read and discover your data collection. You can also change a researcher’s access level via the dropdown or remove their access by clicking the trash can button.

Screenshot of Access tab of a data collection showing two users listed with Discover and Reader access.
Access overview page showing list of users with Discover or Reader access to your data collection.

To grant access, click the Grant access button. A dialog will prompt you to enter the email address of the user you want to grant access to. You’ll see an error if the email address doesn’t exist or if it doesn’t belong to a registered Workbench user. As a reminder, if you added a group policy to your data collection, only users in this group will see the data collection.

Screenshot of Grant collection access to people or groups dialog, showing email input field and dropdown with Discover and Reader access options.
Add user's email address and select data collection access permission level.

Select whether the user should be granted Reader or Discover access. Click Apply.

Step 5: Publish your data collection

Once your data collection is ready for others to access, you can publish it.

Click on the Versions tab on the Data collections page, then click on the Publish button. A dialog will prompt you to review the details of the data collection version you wish to publish. You can edit the Version details and confirm the policies and resources associated with the version. If everything looks OK, tick the I’m ready box and click Publish version. Users will now be able to view the published versions in the data collection in the data catalog in their workspace.

Note: Users will still technically be able to view draft versions through the CLI. See this page for more details.

Screenshot of Publishing draft version dialog showing versions details and checklist to review prior to publishing a version.
Add version details and confirm policies and resources before publishing version.

You’ll see a message saying that the version was published. You’ll also be able to click the + New version button to create a new draft data collection version.

Screenshot of Versions tab of a data collection showing a 'published' message and green 'published' status next to version name.
Versions tab showing successfully published data collection version.

Edit your data collection

To edit your data collection’s settings, click the Edit button.

Screenshot of Overview tab of a data collection, with Edit button highlighted.
Edit your data collection settings.

You can update the data collection’s name, summary, ID, and description. You can add policies to further limit your collection’s visibility; this could impact existing collaborators and users who have access to your data collection. You can also change the resource region; changes will apply to new resources and environments.

Screenshot of Edit data collection dialog showing editable fields.
Edit data collection details.

Delete your data collection

To delete your data collection, expand the three-dot menu and click Delete.

Screenshot of Overview tab of a data collection, with Delete option highlighted.
Delete a data collection.

A dialog will appear asking you to confirm deletion. All controlled resources and cloud environments will be deleted, and everyone with access to the data collection will be affected.

Note: All draft and published versions associated with the data collection will be deleted. Any resources that researchers may have added to any of that data collection’s versions will remain in their workspaces, but the links to those resources will be broken.

Screenshot of Delete data collection dialog.
Delete a data collection and all of its contents.

Last Modified: 10 June 2024