Data resource operations
Categories:
Prior reading: Data resources overview
Purpose: This document provides detailed instructions for performing operations on data resources through the Verily Workbench web UI.
Note: These instructions all assume that you have already opened a workspace in the Verily Workbench web UI and navigated to the Resources tab.
All of the operations described below can also be performed via the Workbench CLI. See the CLI reference for details.
List your data resources
Your data resources are listed in the Resources tab of the workspace. If your resources are organized in folders, the folders may be displayed as collapsed by default. Click on the triangle to the left of the folder name to expand or collapse the view.
For more information about folders, see Organize resources into folders below. For other operations such as previewing contents and editing resource details, see Manage your data resources.
Create a new controlled resource
You can create empty storage buckets and BigQuery datasets directly from the Verily Workbench web UI. Any resource created in this way will be treated as a controlled resource, meaning that access to the resource is controlled by Workbench. This is in contrast to a referenced resource (see below).
Note that controlled data resources are tightly associated with the workspace where they are created. They are automatically shared with any collaborators who have been granted access to the workspace, and their data lifecycle matches that of the workspace. If the workspace is deleted, its controlled data resources are also deleted. If the workspace is cloned, its controlled data resources are also cloned.
Create a storage bucket
To create a new storage bucket via the web UI, click on the + New resource button in the the Resources pane and select New Cloud Storage bucket. This will open a resource creation dialog; fill it out as detailed below.
-
Enter an ID for your new resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique within the workspace.
-
Use the Folder path dropdown menu to select a folder. You’ll be able to move the bucket to a different folder after creation if desired.
-
Provide a brief description of the resource. This is optional but highly recommended.
-
The system will suggest a bucket name, generated automatically based on the resource ID and the Google Project associated with the workspace. The bucket name will be the name of the bucket as listed in Google Cloud (displayed in the Resource details in Workbench). You can modify or replace the suggested bucket name in the creation dialog, but note that the bucket name must be globally unique across all of Google Cloud. You will not be able to edit the bucket name once it has been created (though you may change the resource ID if you like). Click the Create bucket button.
Create a BigQuery dataset
To create a new BigQuery dataset via the web UI, click on the + New resource button in the Resources pane and select New BigQuery dataset. This will open a resource creation dialog; fill it out as detailed below.
-
Enter an ID for your new resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique within the workspace.
-
Use the Folder path dropdown menu to select a folder. You’ll be able to move the dataset to a different folder after creation if desired.
-
Provide a brief description of the resource. This is optional but highly recommended.
-
The system will suggest a dataset ID, generated automatically based on the resource ID. The dataset ID will be the name of the dataset as listed in Google Cloud (displayed in the Resource details in Workbench). You can modify or replace the suggested dataset ID in the creation dialog, but note that the dataset ID must be unique within your Google Cloud project (but not across all of Google Cloud). You will not be able to edit the dataset ID once it has been created.
Add a reference to an existing resource
You can reference existing storage buckets and files as well as existing BigQuery datasets and tables based on their Google Cloud identifiers. Any resource added in this way will be treated as a referenced resource, meaning that access to the resource is not controlled by Workbench. This is in contrast to a controlled resource (see above).
Note that as a result, access to referenced data resources is not automatically granted to collaborators who have been granted access to the workspace. For information about sharing a referenced data resource with collaborators, visit Access levels and privileges.
Access a resource in the Google Cloud console
If you need the name or ID of a referenced resource, you can access it through the Google Cloud console. To do so, select the resource in the Resources list and click on the Open in GCP button in the information pane on the right. This will open a new tab or window in your web browser.
Reference a storage bucket
To reference an existing storage bucket via the web UI, click on the + New resource button in the Resources pane and select Reference Cloud Storage bucket. This will open a resource addition dialog; fill it out as detailed below.
-
Enter an ID for your resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique across all of Workbench (but not across all of Google Cloud).
-
Use the Folder path dropdown menu to select a folder. You’ll be able to move the bucket to a different folder after creation if desired.
-
Provide a brief description of the resource. This is optional but highly recommended.
-
Enter the name of the bucket you want to reference. You can find this information in the Google Cloud console. Do not include the
gs://
prefix.
Once created, the resource should look similar to the following:
Reference a file or a folder in a bucket
To reference an existing file or folder in a storage bucket via the web UI, click on the + New resource button in the Resources pane and select Reference Cloud Storage object. This will open a resource addition dialog; fill it out as detailed above under Reference a storage bucket for steps 1-3, then as detailed below for step 4.
- Enter the
gs://
URI of the file or folder you want to reference. You can find this information in the Google Cloud console (or in the Details panel for Workbench workspace resources). Click Add to resources.
Once created, the resource should look similar to the following:
Reference a BigQuery dataset
To reference an existing BigQuery dataset via the web UI, click on the + New resource button in the Resources pane and select Reference BigQuery dataset. This will open a resource addition dialog; fill it out as detailed below.
-
Enter an ID for your resource. This will be the ID displayed when you list your resources in Workbench. The resource ID must be unique within the workspace.
-
Use the Folder path dropdown menu to select a folder. You’ll be able to move the bucket to a different folder after creation if desired.
-
Provide a brief description of the resource. This is optional but highly recommended.
-
Enter the ID of the BigQuery dataset you want to reference. You can find this information in the Google Cloud console.
- Enter the ID of the Google Project associated with the BigQuery dataset you want to reference. You can find this information in the Google Cloud console.
Reference a BigQuery table
To reference an existing BigQuery table via the web UI, click on the + New resource button in the Resources pane and select Reference BigQuery table. This will open a resource addition dialog; fill it out as detailed above under Reference a BigQuery dataset for steps 1-5, then as detailed below for step 6.
- Enter the ID of the BigQuery table you want to reference. You can find this information in the Google Cloud BigQuery console. Click Add to resources.
Once created, the table reference details will look like the following:
Using Cloud Storage “Managed Folders” with referenced resources
If you have a non-Workbench-managed Cloud Storage bucket that you would like to reference as a Verily Workbench resource, but want to share only certain (sub-)folders of the bucket with other users or groups, you may find Cloud Storage Managed Folders useful.
Managed folders are a type of folder on which you can grant IAM roles, so you have more fine-grained access control over specific groups of objects within a bucket. To use this feature, your bucket must be set to uniform bucket-level access.
Note
You can’t make changes to IAM roles for controlled resource buckets. Access is controlled by the associated workspace and its policies, and can’t be modified independently.When you configure a Managed Folder as a resource, the Workbench UI may display some warning
notifications— depending upon bucket permissions— but users will still have access to the managed
folders via the Cloud console as well as via command-line utilities like gsutil
and gcloud
.
To set up managed folder access as a Workbench workspace or data collection resource:
-
Create a Workbench group, whose members are the emails of the users, and/or other Workbench groups, for which you want to provide managed folder access. You’ll want a separate Workbench group for each different set of users for which you’ll give access to managed folder(s). Note that it is important to use Workbench groups, instead of directly adding users’ account emails, because the groups include the users’ pet SAs as well.
-
Ensure that the bucket-level permissions are set appropriately for the bucket you want to manage, to disallow access for those users who should not be able to view full bucket contents. As noted above, this bucket must also be set to use uniform bucket-level access.
-
Select the bucket you want to manage in the Cloud Console by visiting the Cloud Storage panel, then navigate to a folder that you want to set up as a Managed folder.
-
Follow the instructions here to set up permissions for the folder. Briefly, select “Edit Access” from the right-hand “three-dot” menu, and add the Workbench group with the desired permission settings. E.g., to allow viewing but not modifying the folder contents, use the “Storage Object Viewer” role.
Repeat the process for each bucket subdirectory that you want to set up as a managed folder, giving the appropriate Workbench groups access to the folder.
-
Create a referenced object resource for the managed folder in your Workbench workspace or data collection, by specifying the path to the managed folder.
Note: Depending upon the bucket permissions of the user adding the referenced resource, the error notification below may be shown.
However, after clicking “Add to resources”, the user should see an indication of “Permissions: Granted” if the Managed Folder permissions were set up to give access.
-
Share the workspace or data collection with the appropriate Workbench groups when ready.
Note that sharing a workspace or data collection does not in itself affect access to referenced resources— only controlled resources. The access to the referenced managed folders (and other referenced resources) is determined by the ACLs you define in the Google Cloud project that holds the resources, e.g. as described in Step 4.
When a user selects the referenced resource for the managed folder, they will be able to view the folder contents by clicking on the “Open in GCP” button, and browsing the folder in the Cloud Console. Currently, the Workbench “Browse” button may show the following error if the user does not have permissions to list all bucket objects, only the objects in certain folders.
<figure>
<img src="/images/data_resources/browse_issue.png" alt="xxx" width="60%">
<figcaption><smaller><i>The Browse button may show this error depending upon bucket permissions, but the user can browse the folder by clicking on “Open in GCP”.</i></smaller></figcaption>
</figure>
If you like, you can enable the Workbench “Browse” panel by giving users the “Storage Legacy Bucket Reader” role at the top bucket level (in contrast to setting folder-level access). Click on the “PERMISSIONS” tab for the bucket in the Cloud Console, then select “GRANT ACCESS”.
<figure>
<img src="/images/data_resources/bucket_perms.png" alt="the user can browse the folder by clicking on 'Open in GCP'" width="50%">
<figcaption><smaller><i>The 'Browse' button for the resource may show this error, depending upon bucket permissions, but the user can browse the folder by clicking on “Open in GCP”.</i></smaller></figcaption>
</figure>
This role will allow those users to list all bucket objects— including those in folders to which they don’t have access— but they will not be able to view the contents of any objects not under the folders to which they’ve been given access.
Add a data collection from the data catalog
Import references to resources from a data collection
To add a data collection to your workspace via the web UI, click on the + Data from catalog button in the Resources pane. This will open a resource addition dialog; use it as detailed below.
-
Browse the data catalog and select a data collection of interest. You’ll be able to see information about the most recent version of the data collection and when it was published. Click Next.
This will lead to a dialog showing the contents of the collection.
-
After you’ve clicked in to the data collection, select the version you’d like to import.
-
Select which resources you would like to import from the data collection version. You can expand folders by clicking on the triangle to the left of the folder name. If you do not select all resources in a data collection, you’ll still have the option of adding them later. Once you have finalized your selection, click Next.
This will lead to a dialog showing the data policies associated with the resources you have selected.
-
Review the policy requirements. Click Next.
This will display a list of selected resources and destination options.
-
Review your selection and choose the workspace folder where you want to add them. You can select an existing folder from the dropdown menu or create a new folder. Click Add to your workspace.
The selected resources should now appear in your workspace resources view.
You can manage and access these resources as you would any other resource in your workspace, as described below in Manage your data resources.
View the lineage of resources imported from a data collection
You can view the data collection lineage of each resource. This displays provenance information, including a link to the collection of origin as well as the time or date when the resource was added to the workspace.
To view lineage information, click on the resource you want to inspect in the Resources list, then click on the Lineage tab in the information pane on the right.
Manage your data resources
Organize resources into folders
You can organize your data resources in hierarchical folders.
To create a new folder, click the + New resource button in the Resources tab and select New folder. This will bring up a folder creation dialog.
The following screencast shows creation of a new folder, then creation of a controlled Cloud Storage bucket resource within that folder.
To move a resource or folder to a different folder, select it and click on the Move to button in the information pane on the right. This will bring up a folder organization dialog (which also allows you to create a new folder if needed).
The following screencast shows moving a resource (bucket1
) to a new folder, created as part of the Move dialog. When creating a new folder, you have the option of where to place it. In this case we didn’t locate the new folder under the current one, but created it at the top level.
View and edit resource details
You can edit the resource name and description of any of your resources at any time. To do so, select the resource and click on the Edit details button in the information pane on the right. This will bring up the editing dialog.
Note that you cannot edit external identifiers such as bucket path, project ID, dataset ID or table ID after a resource creation. If you realize you made a mistake in one of these identifiers when you created or added the resource, you’ll need to delete the erroneous entry and repeat the process of creating or adding that resource to your workspace as described above. For instructions on deleting a resource, see Delete a resource below.
Browse buckets and preview file contents
You can browse the contents of buckets and preview file contents for certain file types directly in Workbench.
To browse the contents of a bucket, select it in the list of resources and click Browse in the information pane on the right. This will bring up a browser pane that you can use to explore the contents of the bucket.
Note that you can select an object within the bucket browser and add a direct reference to it in your resource list by clicking on Add as reference in the information pane on the right.
To preview a file, select the file in the list of resources or in the bucket browser and click on the Preview button in the information pane on the right. This will display a preview of the file contents.
Here’s an example of previewing a bam
file:
Workbench supports the below file types for preview, as identified by the file extensions:
- Images: jpeg, jpg, png, tiff, gif, bmp, svg
- Renderables: md, pdf, html, ipynb, rmb
- Tabular: csv, tsv
- Text: txt, wdl, nf, sh, log, stdout, stderr, script, rc, json
- Bioinformatics-related data formats: bam, bed, bedgraph, bb, bw, birdseye_canary_calls, broadpeak, seg, cbs, sam, vcf, linear, logistic, assoc, qassoc, gwas, gct, cram
Note that you cannot upload files into your buckets through the Workbench web UI. To do so, please use the Google Cloud console, or the gsutil
command-line utility.
Delete a resource
When you delete a controlled resource, managed by your workspace, it will be fully deleted and is not recoverable.
In contrast, when you delete a referenced resource, you’re removing only the reference. The resource to which the reference pointed is not affected.
To delete a resource, select it in the list of resources and click the symbol showing three vertically-stacked dots to display the menu of additional actions, and select Delete.
This will bring up a dialog that summarizes what will happen upon deletion. To confirm that you want to delete the resource, click the confirmation checkbox and click Delete resource.
Note that deletion of referenced resources and controlled resources has different effects as described above; please make sure that you understand the difference before deleting any resources.
Note on button locations
The resource management operations described above are available through buttons or selector menus located in the information pane that is displayed on the right when a resource is selected.
The exact layout and appearance of the information may vary with the type of resource selected. For example, the information pane displayed for a storage bucket will include a Browse button, while the one displayed for a BigQuery dataset will not.
Last Modified: 28 October 2024