Creating data collections

How to create Verily Workbench data collections

What is a data collection?

Data collections are diverse datasets, available from Verily Workbench’s data catalog for use as referenced data in your own workspace. Collections are curated by data stewards that ensure data quality, reproducibility, and associated lineage. Many collections will have policies attached that determine how the data may be accessed and used. Collections you have access to may be entirely or partially referenced for use in your Workbench workspace.

A data collection is a grouping of cloud-based resources related to a specific project, study or purpose (e.g. public datasets like 1000 Genomes or TCGA, or data sources and studies internal to an organization). Researchers can browse and add data collections to their workspaces.

Data collections can…

  • Be curated and set up by data stewards or researchers who are responsible for the governance / usage of the data
  • Be referenced across multiple workspaces
  • Consist of more than just “data”; they can contain notebook and text files, images, etc.
  • Be associated with policies that govern the usage of data

Users can add resources from a data collection to their own workspaces through the UI, and from the context of a workspace. See this page for information on how to add an existing data collection to your workspace.

At times you may also want to create your own data collections. This document describes how to do that.

Why create a data collection?

If you have a collection of data (tables and files) that you would like to package and share across a large number of users, then creating a data collection could be the right way to address your use case.

Packaging your data as a data collection has a few benefits:

  • Allow users to easily discover, browse, and import your data:
    • Consolidate resources across destinations in the cloud into one data collection, via a schema of your preference
    • Provide more flexible ways for users to access your resources, whether they are starting in a new or existing workspace
    • Support users in easily browsing through the list of resources available in a data collection, and select only the ones most relevant to them
    • Allow users to work with multiple data collections in one workspace, facilitating cross-analysis
  • Ensure that policies are always associated with and are ‘sticky’ for the resources in your data collection:
    • Ensure that workspaces that reference your resources, no matter who has set them up, comply with the policies you have defined around your data
    • Whether a user is adding your entire data collection, or just one resource, be assured that the policies you have defined are propagating to each and every resource
    • Allow users to analyze data from multiple data collections, with the comfort that the policies on each collection are appropriately merging and being adhered to
  • Centrally manage how your reference data appears in Workbench, regardless of all the workspaces (and clones) that reference it:
    • Define a new ‘version’ of your data collection, and inform users in all workspaces that reference your data
    • Depending on your permissions, work with us to set up a dashboard to view reports on all ‘activity’ around your data in Workbench— which workspaces have referenced it, which resources are most imported etc.
    • Manage the ‘discovery’ of your data collections, ranging from widely discoverable to highly private; allow users to get a summary of what a data collection offers, without granting full access permissions

How to create a data collection and manage its versions

Note: Data collection creation & management will be supported in the Workbench UI in ~Q4 2023. Until then, the process below (using the CLI) will need to be used.

Data collections are workspace-based. To create a data collection, you first create a workspace, then convert it to a data collection. Once your data collection has been published, other users can then access the data collection via the “Data Catalog”, as described here, and do not see the underlying workspace.

Tip: If you would like to also create a demo workspace to share with your users, it’s best to first create a data collection, then create workspace(s) that reference your data collection.

Data collections have one or more versions associated with them. An end-user can select the version from which they want to import resources. Data collection versions are based on workspace Folders, as created under the “Resources” tab: to create a data collection version, you populate and publish its corresponding folder.

In future, you will be able to perform all operations related to creating data collections via the Verily Workbench web UI. Currently, you can:

  • allow the Workbench support team to help you build a data collection
  • or you can create the data collection yourself using the Workbench CLI

Create and manage a data collection yourself

If you like, you can manage most steps in the lifecycle of a data collection yourself, using the Workbench CLI. This includes creating and publishing new versions of the data collection.

The exception (described below) is optionally adding a region policy to the data collection. The Workbench support team can help with that step if it’s relevant to your use case— this only needs to be done once.

Step 0: Install or access the Workbench CLI

You will need to use the Workbench CLI to define and update a data collection. You can install it locally, or create a cloud environment, where it will be pre-installed for you.

Step 1: Create the workspace that will underlie the data collection

Data collections are based on workspaces. So, the first step in creating a data collection is to create the workspace that will underlie the data collection. It’s best to use a new, not an existing, workspace for this.

The workspace name is the data collection name that users will see.

Step 2: Create a ‘version1’ folder in the data collection workspace

Under the Resources tab in the Workbench web UI, create a top-level Folder with the name of the first version of your data collection. You can name it what you like; however, this is the name that other users will see for the version, so it is best to name it something intuitive (e.g. version1).

You can add resources to this version folder now, or do it later.

Creating a Resources folder for the first version of a data collection.

Step 3: Set the property that turns a workspace into a data collection

Via the Workbench CLI, run the following command, first setting WORKSPACE_ID with the ID of your data collection workspace. You can find this ID in the Workspace details on its Overview page.

terra workspace set-property --workspace=${WORKSPACE_ID} --properties="terra-type=data-collection"

The workspace will not yet be treated as a data collection until you publish its first version, which we’ll do next.

Step 4: Publish the first version of the data collection

Publish the folder that you created, as the first version of your data collection. Any resources that you add to the folder, either now or later, will appear as part of that version of the data collection. Publishing a version makes its resources accessible to users in the UI. However, it’s important to note that all unpublished versions and resources therein are accessible to users via the API. All data access is managed at the data collection level, and all resources in a data collection (published or not) are accessible.

Note that for any referenced resource you add that points to an external (non-workspace-controlled) resource, you must ensure that your end-users have appropriate access to that resource.

  1. Copy and run the command to set the workspace, which you can find on the overview page of your data collection workspace:

    terra workspace set --id=<YOUR_WORKSPACE_ID>
    
  2. Then, run:

    terra folder tree
    

    This will show the folder structure for your data collection workspace. Copy the UUID that is associated with the name you assigned to the ‘version’ folder (e.g. version1).

  3. Next, run:

    terra folder set-property --properties=terra-published-date=${DATE} --id=${VERSION_FOLDER_ID}
    

    Replace ${DATE} with the current date in format yyyy-mm-dd, e.g. 2023-09-20, and replace ${VERSION_FOLDER_ID} with the UUID you copied in the step above.

Any resources later added to this folder will be included in that version of the data collection.

Optional: add “release notes”

If you want, you can add a “release notes” URL to a data collection version. To do this, set the workspace and copy the version folder’s UUID as described above. Then, run the following command, replacing ${VERSION_FOLDER_ID} with the UUID you copied, and replacing ${URL} with a url in the format https://www.google.com.

terra folder set-property --properties=terra-release-notes-url=${URL} --id=${VERSION_FOLDER_ID}

If you do so, the user will see a clickable link to view the notes in the version information:

Adding release notes to a data collection version.
Optional: add “Organization Name”

Additionally, you can add your organization name to a data collection for users to see when viewing the Data Collection in the UI.

terra workspace set-property --workspace=${WORKSPACE_ID} --properties="terra-organization-name=${ORG_NAME}"

Step 4.5 View your new data collection

In the “Resources” tab for a workspace, visit the “Data Catalog” to see your new data collection as it will appear to other users.

View your newly created data collection from the "Data Catalog" pane.

Step 5: [optional] set a data collection policy

You can set policies to control which groups can access your data collection, and to impose region constraints. Do this before you share your data collection. See below for more detail.

If you would like to set a region constraint policy, we will work with you to do this before you share your data collection.

Step 6: Share the data collection workspace

Now you are ready to share the data collection with other users (or groups of users) as appropriate for your use case. To do this, click “Share” for the data collection’s underlying workspace. For details on how to share a workspace, see the Workspace Operations page. This page describes the privileges for the different access levels. For details on how to define Workbench groups, see this page.

Typically, you would share with “READER” permissions, which means that consumers of the data collection can view the resources but not modify them. (You may want to give others on your team “WRITER” access).

When an end-user imports resources from your data collection, referenced resources are created in their workspace for any controlled resources in the data collection. That is, none of the resources in the data collection are copied; instead, references to them are created.

Step 7: Publish additional versions of a data collection

At some later point, you may want to publish additional versions of a data collection. The process is the same as that described above:

  • Create a new top-level version folder in the data collection workspace (say, version2)
  • Find the UUID of that new folder (via terra folder tree) and publish the new version via setting the terra-published-date property as described above.

Users who have imported resources from your data collection into a workspace will see a notification in that workspace when a new version is published:

A notification of a new version of a data collection.

The newest version of a data collection (based on its publication date) will be the one shown by default to users in the Workbench web UI, but they can select other older versions.

Selecting a data collection version.

If you have set a policy for the data collection, it holds across all versions.

Including a resource in multiple data collection versions

Currently, it is not possible for two resources in the same workspace to have the same name. This means that if you want to include a resource in, say, two data collection versions— so that the same resource resides in two different ‘version’ folders— you will need to create a referenced resource in the second version that points to the resource, but gives it a different name.

See this page for more information on creating referenced resources.

Add a data collection policy

You can set policies to control which Workbench groups the data collection can be shared with, and to constrain which region the data may reside in.

Group policies

A group policy limits the eligible access of workspace and data sharing to members of all selected groups. A group policy does not grant access, but can be used as an additional layer of access control. Like other policy types, a group policy can’t be removed once it’s been applied, and carries over to any duplicates.

You can set the group policy for the data collection’s backing workspace via the Workbench web UI. See this page for more detail.

Region constraint policies

A region constraint policy is a type of policy that limits which regions may be used to create cloud resources & environments. For example, if you used data from a collection that had a region constraint policy, your cloud environment and analysis outputs must be kept within the regions specified by the policy. When a region constraint policy is applied to a workspace outside of the prescribed regions, the default resource region must be updated in order to comply with the policy requirements. You don’t need to migrate data that was in the workspace before the policy was applied, and references to data aren’t affected.

A region constraint only applies to controlled resources. Referenced resources are not constrained. You can add a region constraint policy after you’ve added some resources to your version folder (or more broadly, to the workspace), but note that application of the policy will fail if you have already added resources that violate the constraint.

Currently, we will need to set a data collection ’s region constraint policy for you, as its configuration is not user-facing. (This will change in the future). Once a policy has been set, all subsequent versions of a data collection use that policy, so you can add new versions yourself if you like.

Reach out to workbench-support@verily.com, or your primary Workbench contact, for support in setting a data collection’s region constraint policy. Do this before before you share the data collection.

Let us create the data collection and its versions for you

Reach out to workbench-support@verily.com, or your primary Workbench contact, for support in creating a data collection. In future, Workbench will add UI support so data stewards and researchers can easily create data collections themselves.

Removing a data collection or one of its versions

There are multiple ways to remove part or all of a data collection, depending upon your goals:

  • If you delete the terra-type: data-collection property from a data collection’s underlying workspace, the data collection will no longer appear in the Data Catalog. However, for any users who have already imported resources from that data collection, those resources will still be available to them as referenced resources. The lineage information that these users see in their workspaces will still point to the data collection’s underlying workspace.

  • If you delete a version folder from a data collection’s underlying workspace, this version will no longer show up as an option when a user browses the available versions for the data collection.
    Note that if you delete a folder that contains controlled resources, these resources will be deleted as well.

  • If you move a resource out of a version folder, it will no longer be listed in the data collection, and will no longer be imported by users accessing that collection. However, that resource will still be accessible to the user, as noted above.

  • If you delete a resource that is part of a data collection, then users who have imported that resource (as a named reference) will still see the reference listed, but will get an error when they try to access the resource.

  • If you delete the data collection’s underlying workspace, all of its controlled resources will be deleted as well, and the lineage information will display “Unknown workspace”.

Last Modified: 16 November 2023