Data resource operations

Operations that can be performed on data resources through the Verily Workbench web UI

Prior reading: Data resources overview

Purpose: This document provides detailed instructions for performing operations on data resources through the Verily Workbench web UI.

Note: These instructions all assume that you have already opened a workspace in the Verily Workbench web UI and navigated to the Resources tab.

All of the operations described below can also be performed via the Workbench CLI. See the CLI reference for details.



List your data resources

Your data resources are listed in the Resources tab of the workspace. If your resources are organized in folders, the folders may be displayed as collapsed by default. Click on the triangle to the left of the folder name to expand or collapse the view.

For more information about folders, see Organize resources into folders below. For other operations such as previewing contents and editing resource details, see Manage your data resources.


Create a new controlled resource

You can create empty storage buckets and BigQuery datasets directly from the Verily Workbench web UI. Any resource created in this way will be treated as a controlled resource, meaning that access to the resource is controlled by Workbench. This is in contrast to a referenced resource (see below).

Note that controlled data resources are tightly associated with the workspace where they are created. They are automatically shared with any collaborators who have been granted access to the workspace, and their data lifecycle matches that of the workspace. If the workspace is deleted, its controlled data resources are also deleted. If the workspace is cloned, its controlled data resources are also cloned.

Create a storage bucket

To create a new storage bucket via the web UI, click on the + New resource button in the the Resources pane and select New Cloud Storage bucket. This will open a resource creation dialog; fill it out as detailed below.

Screenshot of a workspace's Resources page, with the New Cloud Storage bucket option highlighted.
Creating a controlled storage bucket.
  1. Enter an ID for your new resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique within the workspace.

  2. Use the Folder path dropdown menu to select a folder. You’ll be able to move the bucket to a different folder after creation if desired.

  3. Provide a brief description of the resource. This is optional but highly recommended.

  4. The system will suggest a bucket name, generated automatically based on the resource ID and the Google Project associated with the workspace. The bucket name will be the name of the bucket as listed in Google Cloud (displayed in the Resource details in Workbench). You can modify or replace the suggested bucket name in the creation dialog, but note that the bucket name must be globally unique across all of Google Cloud. You will not be able to edit the bucket name once it has been created (though you may change the resource ID if you like). Click the Create bucket button.

Screenshot of the Creating Cloud Storage bucket dialog, showing the folder path for the newly created bucket.
Creating a controlled storage bucket. Here, the resource will be added under the "experimental data" folder.

Create a BigQuery dataset

To create a new BigQuery dataset via the web UI, click on the + New resource button in the Resources pane and select New BigQuery dataset. This will open a resource creation dialog; fill it out as detailed below.

Screenshot of a workspace's Resources page, with the New BigQuery dataset option highlighted.
Creating a controlled BigQuery dataset.
  1. Enter an ID for your new resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique within the workspace.

  2. Use the Folder path dropdown menu to select a folder. You’ll be able to move the dataset to a different folder after creation if desired.

  3. Provide a brief description of the resource. This is optional but highly recommended.

  4. The system will suggest a dataset ID, generated automatically based on the resource ID. The dataset ID will be the name of the dataset as listed in Google Cloud (displayed in the Resource details in Workbench). You can modify or replace the suggested dataset ID in the creation dialog, but note that the dataset ID must be unique within your Google Cloud project (but not across all of Google Cloud). You will not be able to edit the dataset ID once it has been created.

Screenshot of the Creating BigQuery dataset dialog.
Creating a controlled BigQuery dataset. This resource will be stored in the "test data" folder.

Add a reference to an existing resource

You can reference existing storage buckets and files as well as existing BigQuery datasets and tables based on their Google Cloud identifiers. Any resource added in this way will be treated as a referenced resource, meaning that access to the resource is not controlled by Workbench. This is in contrast to a controlled resource (see above).

Note that as a result, access to referenced data resources is not automatically granted to collaborators who have been granted access to the workspace. For information about sharing a referenced data resource with collaborators, visit Access levels and privileges.

Access a resource in the Google Cloud console

If you need the name or ID of a referenced resource, you can access it through the Google Cloud console. To do so, select the resource in the Resources list and click on the Open in GCP button in the information pane on the right. This will open a new tab or window in your web browser.

Reference a storage bucket

To reference an existing storage bucket via the web UI, click on the + New resource button in the Resources pane and select Reference Cloud Storage bucket. This will open a resource addition dialog; fill it out as detailed below.

  1. Enter an ID for your resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique across all of Workbench (but not across all of Google Cloud).

  2. Use the Folder path dropdown menu to select a folder. You’ll be able to move the bucket to a different folder after creation if desired.

  3. Provide a brief description of the resource. This is optional but highly recommended.

  4. Enter the name of the bucket you want to reference. You can find this information in the Google Cloud console. Do not include the gs:// prefix.

Screenshot of the Adding Cloud Storage bucket dialog
Creating a Cloud Storage bucket reference.

Once created, the resource should look similar to the following:

Screenshot showing details of a Cloud Storage bucket referenced resource.
A Cloud Storage bucket reference.

Reference a file or a folder in a bucket

To reference an existing file or folder in a storage bucket via the web UI, click on the + New resource button in the Resources pane and select Reference Cloud Storage object. This will open a resource addition dialog; fill it out as detailed above under Reference a storage bucket for steps 1-3, then as detailed below for step 4.

  1. Enter the gs:// URI of the file or folder you want to reference. You can find this information in the Google Cloud console (or in the Details panel for Workbench workspace resources). Click Add to resources.
Screenshot of the Adding Cloud Storage object dialog
Creating a Cloud Storage bucket folder reference.

Once created, the resource should look similar to the following:

Screenshot showing details of a Cloud Storage object referenced resource.
A Cloud Storage bucket folder reference.

Reference a BigQuery dataset

To reference an existing BigQuery dataset via the web UI, click on the + New resource button in the Resources pane and select Reference BigQuery dataset. This will open a resource addition dialog; fill it out as detailed below.

  1. Enter an ID for your resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique within the workspace.

  2. Use the Folder path dropdown menu to select a folder. You’ll be able to move the bucket to a different folder after creation if desired.

  3. Provide a brief description of the resource. This is optional but highly recommended.

  4. Enter the ID of the BigQuery dataset you want to reference. You can find this information in the Google Cloud console.

Screenshot of a list of resources with Table ID information highlighted in Google Cloud console.
You can find project, dataset, and table identifiers in the Google Cloud BigQuery console.
  1. Enter the ID of the Google Project associated with the BigQuery dataset you want to reference. You can find this information in the Google Cloud console.
Screenshot of the Adding a BigQuery dataset dialog.
Creating a BigQuery dataset reference.

Reference a BigQuery table

To reference an existing BigQuery table via the web UI, click on the + New resource button in the Resources pane and select Reference BigQuery table. This will open a resource addition dialog; fill it out as detailed above under Reference a BigQuery dataset for steps 1-5, then as detailed below for step 6.

Screenshot of the Adding BigQuery table dialog.
Creating a BigQuery table reference.
  1. Enter the ID of the BigQuery table you want to reference. You can find this information in the Google Cloud BigQuery console. Click Add to resources.

Once created, the table reference details will look like the following:

Screenshot showing details of a BigQuery table referenced resource.
A BigQuery table reference.

Using Cloud Storage “Managed Folders” with referenced resources

If you have a non-Workbench-managed Cloud Storage bucket that you would like to reference as a Verily Workbench resource, but want to share only certain (sub-)folders of the bucket with other users or groups, you may find Cloud Storage Managed Folders useful.

Managed folders are a type of folder on which you can grant IAM roles, so you have more fine-grained access control over specific groups of objects within a bucket. To use this feature, your bucket must be set to uniform bucket-level access.

When you configure a Managed Folder as a resource, the Workbench UI may display some warning notifications— depending upon bucket permissions— but users will still have access to the managed folders via the Cloud console as well as via command-line utilities like gsutil and gcloud.

To set up managed folder access as a Workbench workspace or data collection resource:

  1. Create a Workbench group, whose members are the emails of the users, and/or other Workbench groups, for which you want to provide managed folder access. You’ll want a separate Workbench group for each different set of users for which you’ll give access to managed folder(s). Note that it is important to use Workbench groups, instead of directly adding users’ account emails, because the groups include the users’ pet SAs as well.

  2. Ensure that the bucket-level permissions are set appropriately for the bucket you want to manage, to disallow access for those users who should not be able to view full bucket contents. As noted above, this bucket must also be set to use uniform bucket-level access.

  3. Select the bucket you want to manage in the Cloud Console by visiting the Cloud Storage panel, then navigate to a folder that you want to set up as a Managed folder.

  4. Follow the instructions here to set up permissions for the folder. Briefly, select “Edit Access” from the right-hand “three-dot” menu, and add the Workbench group with the desired permission settings. E.g., to allow viewing but not modifying the folder contents, use the “Storage Object Viewer” role.

    Edit access to a Cloud Storage 'Managed Folder'
    Edit access to a Cloud Storage 'Managed Folder'.

    Repeat the process for each bucket subdirectory that you want to set up as a managed folder, giving the appropriate Workbench groups access to the folder.

  5. Create a referenced object resource for the managed folder in your Workbench workspace or data collection, by specifying the path to the managed folder.

    Create a referenced object resource
    Create a referenced object resource that points to the managed folder.

    Note: Depending upon the bucket permissions of the user adding the referenced resource, the error notification below may be shown.

    The user may see an error when adding the referenced resource, but this does not mean that the folder is inaccessible
    Depending upon bucket-level permissions, the user may see an error when adding the referenced resource, but this does not mean that the folder is inaccessible.

    However, after clicking “Add to resources”, the user should see an indication of “Permissions: Granted” if the Managed Folder permissions were set up to give access.

    'Permissions: Granted' indicates that the user has access to the referenced Managed Folder
    The 'Permissions: Granted' tag indicates that the user has access to the referenced Managed Folder.
  6. Share the workspace or data collection with the appropriate Workbench groups when ready.

    Note that sharing a workspace or data collection does not in itself affect access to referenced resources— only controlled resources. The access to the referenced managed folders (and other referenced resources) is determined by the ACLs you define in the Google Cloud project that holds the resources, e.g. as described in Step 4.

When a user selects the referenced resource for the managed folder, they will be able to view the folder contents by clicking on the “Open in GCP” button, and browsing the folder in the Cloud Console. Currently, the Workbench “Browse” button may show the following error if the user does not have permissions to list all bucket objects, only the objects in certain folders.

<figure>
  <img src="/images/data_resources/browse_issue.png" alt="xxx" width="60%">
  <figcaption><smaller><i>The Browse button may show this error depending upon bucket permissions, but the user can browse the folder by clicking on “Open in GCP”.</i></smaller></figcaption>
</figure>

If you like, you can enable the Workbench “Browse” panel by giving users the “Storage Legacy Bucket Reader” role at the top bucket level (in contrast to setting folder-level access). Click on the “PERMISSIONS” tab for the bucket in the Cloud Console, then select “GRANT ACCESS”.

<figure>
  <img src="/images/data_resources/bucket_perms.png" alt="the user can browse the folder by clicking on 'Open in GCP'" width="50%">
  <figcaption><smaller><i>The 'Browse' button for the resource may show this error, depending upon bucket permissions, but the user can browse the folder by clicking on “Open in GCP”.</i></smaller></figcaption>
</figure>

This role will allow those users to list all bucket objects— including those in folders to which they don’t have access— but they will not be able to view the contents of any objects not under the folders to which they’ve been given access.


Add a data collection from the data catalog

Import references to resources from a data collection

To add a data collection to your workspace via the web UI, click on the + Data from catalog button in the Resources pane. This will open a resource addition dialog; use it as detailed below.

  1. Browse the data catalog and select a data collection of interest. You’ll be able to see information about the most recent version of the data collection and when it was published. Click Next.

    Screenshot of the Select collection dialog, the first step when adding a data collection from the data catalog.

    This will lead to a dialog showing the contents of the collection.

  2. After you’ve clicked in to the data collection, select the version you’d like to import.

    Screenshot of the Select resource dialog, the second step when adding a data collection from the data catalog. It highlights selecting a specific version.
  3. Select which resources you would like to import from the data collection version. You can expand folders by clicking on the triangle to the left of the folder name. If you do not select all resources in a data collection, you’ll still have the option of adding them later. Once you have finalized your selection, click Next.

    Screenshot of a nested list of resources within a specific data collection version, with certain resources selected for importing.

    This will lead to a dialog showing the data policies associated with the resources you have selected.

  4. Review the policy requirements. Click Next.

    Screenshot of the Review policies dialog, the third step when adding a data collection from the data catalog.

    This will display a list of selected resources and destination options.

  5. Review your selection and choose the workspace folder where you want to add them. You can select an existing folder from the dropdown menu or create a new folder. Click Add to your workspace.

    Screenshot of the Review selection dialog, the last step when adding a data collection from the data catalog.

The selected resources should now appear in your workspace resources view.

Screenshot of nested workspace resources, which includes resources selected for importing from a data collection in the data catalog.

You can manage and access these resources as you would any other resource in your workspace, as described below in Manage your data resources.

View the lineage of resources imported from a data collection

You can view the data collection lineage of each resource. This displays provenance information, including a link to the collection of origin as well as the time or date when the resource was added to the workspace.

To view lineage information, click on the resource you want to inspect in the Resources list, then click on the Lineage tab in the information pane on the right.

Screenshot of a list of resources, with the pedigree-table resource selected and showing its lineage information.
Data lineage for a referenced resource.

Manage your data resources

Organize resources into folders

You can organize your data resources in hierarchical folders.

To create a new folder, click the + New resource button in the Resources tab and select New folder. This will bring up a folder creation dialog.

The following screencast shows creation of a new folder, then creation of a controlled Cloud Storage bucket resource within that folder.

To move a resource or folder to a different folder, select it and click on the Move to button in the information pane on the right. This will bring up a folder organization dialog (which also allows you to create a new folder if needed).

The following screencast shows moving a resource (bucket1) to a new folder, created as part of the Move dialog. When creating a new folder, you have the option of where to place it. In this case we didn’t locate the new folder under the current one, but created it at the top level.

View and edit resource details

You can edit the resource name and description of any of your resources at any time. To do so, select the resource and click on the Edit details button in the information pane on the right. This will bring up the editing dialog.

Note that you cannot edit external identifiers such as bucket path, project ID, dataset ID or table ID after a resource creation. If you realize you made a mistake in one of these identifiers when you created or added the resource, you’ll need to delete the erroneous entry and repeat the process of creating or adding that resource to your workspace as described above. For instructions on deleting a resource, see Delete a resource below.

Browse buckets and preview file contents

You can browse the contents of buckets and preview file contents for certain file types directly in Workbench.

To browse the contents of a bucket, select it in the list of resources and click Browse in the information pane on the right. This will bring up a browser pane that you can use to explore the contents of the bucket.

Screenshot showing a list of folders belonging in a genomics-public-data Google Cloud Storage referenced bucket.
Browsing a referenced Cloud Storage bucket.

Note that you can select an object within the bucket browser and add a direct reference to it in your resource list by clicking on Add as reference in the information pane on the right.

Screenshot showing details of a selected file, with the 'Preview' and 'Add as reference' buttons highlighted.
File details while browsing a referenced Cloud Storage bucket.

To preview a file, select the file in the list of resources or in the bucket browser and click on the Preview button in the information pane on the right. This will display a preview of the file contents.

Here’s an example of previewing a bam file:

Screenshot of a preview of a .bam file, with the 'Copy preview link' button highlighted.
Preview of a bam file.

Workbench supports the below file types for preview, as identified by the file extensions:

  • Images: jpeg, jpg, png, tiff, gif, bmp, svg
  • Renderables: md, pdf, html, ipynb, rmb
  • Tabular: csv, tsv
  • Text: txt, wdl, nf, sh, log, stdout, stderr, script, rc, json
  • Bioinformatics-related data formats: bam, bed, bedgraph, bb, bw, birdseye_canary_calls, broadpeak, seg, cbs, sam, vcf, linear, logistic, assoc, qassoc, gwas, gct, cram

Note that you cannot upload files into your buckets through the Workbench web UI. To do so, please use the Google Cloud console, or the gsutil command-line utility.

Delete a resource

When you delete a controlled resource, managed by your workspace, it will be fully deleted and is not recoverable.

In contrast, when you delete a referenced resource, you’re removing only the reference. The resource to which the reference pointed is not affected.

To delete a resource, select it in the list of resources and click the symbol showing three vertically-stacked dots to display the menu of additional actions, and select Delete.

Screenshot of a bucket's details, with the 'Delete' option highlighted.
Deleting a controlled resource.

This will bring up a dialog that summarizes what will happen upon deletion. To confirm that you want to delete the resource, click the confirmation checkbox and click Delete resource.

Screenshot of the dialog that appears when a user chooses to delete a controlled resource.
Deleting a controlled resource.
Screenshot of the dialog that appears when a user chooses to delete a referenced resource.
Deleting a referenced resource.

Note that deletion of referenced resources and controlled resources has different effects as described above; please make sure that you understand the difference before deleting any resources.


Note on button locations

The resource management operations described above are available through buttons or selector menus located in the information pane that is displayed on the right when a resource is selected.

Screenshot of a workspace's Resources tab, showing a list of resources and a BigQuery dataset's details, with the Move and Delete options highlighted.
Move and Delete in the information pane for a resource.

The exact layout and appearance of the information may vary with the type of resource selected. For example, the information pane displayed for a storage bucket will include a Browse button, while the one displayed for a BigQuery dataset will not.

Last Modified: 28 October 2024