Access, browse, save, and share data

How to access, browse, save, and share workspace data in Verily Workbench

Prior reading: Data resource operations

Purpose: This document describes ways you can access, browse, save, and share data in your Workbench workspace.



Introduction

Verily Workbench provides a variety of features to browse and interact with data in your workspace. It is also critical to be able to bring process files and research results from your compute environment (whether that is your laptop machine, a Jupyter Notebook in the cloud, or a compute node running a workflow task) back to the shared storage space in your workspace. This document provides details for accessing, browsing, saving, and sharing your data.

Access and browse data

Depending on your role in a project, you may be interested in browsing reference data, incorporating data resources into a Jupyter Notebook or Nextflow script for analysis, or viewing data files and results. The subsections below will help you get started on these activities.

Locate and download data

All data resources in your workspace have an underlying cloud-based location, such as the gs:// URLs for Google Cloud Storage buckets. It is often useful to pass these global identifiers on to other cloud-native tools or systems.

To locate a data resource:

  1. Click on the data resource to show the details pane.

  2. Look for the Source row, which shows the underlying cloud location.

    Note: Some cloud resources show a link next to the cloud location. Click this link to open the resource in the cloud-native file or database browser, if your workspace policy allows.

  1. Use the wb resource list command to list resources in your workspace. Find the name of the data resource of interest:

    $ wb resource list
    
    NAME                            RESOURCE TYPE         STEWARDSHIP TYPE      DESCRIPTION
    1000-genomes-example-notebooks  GIT_REPO              REFERENCED            (unset)
    bam-folder                      GCS_OBJECT            REFERENCED            (unset)
    code                            GIT_REPO              REFERENCED            (unset)
    cram-folder                     GCS_OBJECT            REFERENCED            (unset)
    
  2. Use the wb resource resolve command to print out the underlying cloud location:

    $ wb resource resolve --id=bam-folder
    gs://genomics-public-data/ftp-trace.ncbi.nih.gov
    

Using a Jupyter Notebook

  1. Within a Jupyter Notebook, use the shell magic prefix ! to invoke the Workbench CLI to resolve a data reference:
Screenshot of terra command to resolve a data reference using shell magic prefix `!`.
  1. You can assign the resolved location to a Python variable and use it later in your analysis, or pass the location to cloud-native tools. The example below demonstrates using the gsutil command to list files within a Google Cloud Storage data reference:
Screenshot of commands to print list of files using Python and !terra resource resolve.

Check data access with the CLI

Since some data references may be controlled-access, it can be helpful to verify that your user account has access to data required for your analysis. The Workbench CLI check-access command provides a simple method to check access.

To use the CLI to check data access:

  1. List resources in your workspace to find the name of the data resource of interest:
$ wb resource list

NAME                            RESOURCE TYPE         STEWARDSHIP TYPE      DESCRIPTION
1000-genomes-example-notebooks  GIT_REPO              REFERENCED            (unset)
bam-folder                      GCS_OBJECT            REFERENCED            (unset)
code                            GIT_REPO              REFERENCED            (unset)
cram-folder                     GCS_OBJECT            REFERENCED            (unset)
  1. Use the wb resource check-access command to verify that your account has access:
$ wb resource check-access --id=bam-folder
User's pet SA in their proxy group (PROXY_2631740767397aa04fec6@verily-bvdp.com) DOES have access to this resource.

Browse a storage bucket

To quickly browse the contents of a Cloud Storage bucket from a workspace, use the built-in storage browser from the web UI:

  1. Open the workspace Resources tab and navigate to the bucket resource of interest.
  2. Click on the resource to view the details pane.
  3. Click the Browse button to browse the bucket contents in a new window.

View file details

Click on an individual file or folder to view its details in the browser window. The details pane will show file details such as last modified date and file size, and allow you to download the file.

Preview file contents

Certain supported file types, such as .ipynb notebook files and .csv tabular data, will show a Preview button. Click the button to open a preview of the file.

We support the below file types for preview (values are the file extensions):

  • Images: jpeg, jpg, png, tiff, gif, bmp, svg
  • Renderables: md, pdf, html, ipynb, rmb
  • Tabular: csv, tsv
  • Text: txt, wdl, nf, sh, log, stdout, stderr, script, rc, json
  • IGV: bam, bed, bedgraph, bb, bw, birdseye_canary_calls, broadpeak, seg, cbs, sam, vcf, linear, logistic, assoc, qassoc, gwas, gct, cram

Browse BigQuery data

Workbench does not have a built-in browser for Google BigQuery data. If your workspace policy allows it, you can follow a link to Google’s native BigQuery data browser:

  1. Click on the BigQuery dataset or table resource to show the details pane.
  2. Click the Browse in BigQuery button to open the dataset or table in Google’s BigQuery data browser.

Save and share data

The utilities that Workbench provides to locate and download data also have a role to play in making it possible to upload local files to your cloud-native workspace data storage.

Save data to your workspace

When you run a tool or analysis script on your laptop or in a personal compute environment, results are usually stored as private files attached to that device. To archive your results or share them with collaborators, you will need to transfer data back to a shared storage resource in your workspace. A typical Workbench workspace might have a results or shared Cloud Storage bucket designed for this purpose, or a database resource for collecting tabular analysis outputs.

Upload a file to Cloud Storage with the CLI

The Workbench CLI features a wb gsutil command that wraps around Google’s gsutil command-line utility. When this command is invoked, wb sets the correct cloud credentials and Google Cloud project ID before passing arguments to the underlying gsutil executable.

To upload a file from your laptop or cloud environment to a workspace storage bucket:

  1. Navigate your local computer to the path of the file you wish to upload.

  2. Identify the name of the Cloud Storage resource that will be your destination.

  3. Use a combination of wb gsutil cp and wb resource resolve to copy the file to Workbench’s cloud-native storage:

    $ wb gsutil cp iris.csv $(wb resource resolve --id=scratch)/
    Setting the gcloud project to the workspace project
    Updated property [core/project].
    Copying file://iris.csv [Content-Type=text/csv]...
    / [0 files][    0.0 B/  3.9 KiB]
    / [1 files][  3.9 KiB/  3.9 KiB]
    Operation completed over 1 objects/3.9 KiB.
    Restoring the original gcloud project configuration: terra-vdevel-clean-pear-3014
    Updated property [core/project].
    

Load a CSV file into BigQuery with the CLI

The Workbench CLI features a wb bq command that wraps around Google’s bq command-line utility. When this command is invoked, the CLI sets the correct cloud credentials and Google Cloud project ID before passing arguments to the underlying bq executable.

$ wb bq load --source_format=CSV --autodetect $(wb resource resolve --id=results_dataset).iris_data iris.csv
Setting the gcloud project to the workspace project
Updated property [core/project].

Upload complete.

Waiting on bqjob_r54fad8aedc10f440_000001844d0de670_1 ... (0s) Current status: RUNNING

Waiting on bqjob_r54fad8aedc10f440_000001844d0de670_1 ... (1s) Current status: RUNNING

Waiting on bqjob_r54fad8aedc10f440_000001844d0de670_1 ... (1s) Current status: DONE

Restoring the original gcloud project configuration: terra-vdevel-clean-pear-3014

Updated property [core/project].

To share the results of your output, use the Workbench web UI to find a stable URL linking to a file within a workspace data resource:

  1. Open the workspace Resources tab and locate the data resource containing your file of interest. Click the resource to view the details pane.
  2. Click the Browse button to open Workbench’s storage browsing window.
  3. Navigate to the file of interest and click on it to view the file details pane.
  4. In your browser, select the current URL and copy it.
  5. Share the URL with a collaborator who has access to the same workspace. This link should open the Workbench bucket browser to the same file location.

Last Modified: 16 July 2024