Glossary

A

analysis plane: Analysis plane refers to the cloud environment which includes user workspaces and all the resources within them. Workbench supports analysis planes on AWS and GCP, providing access to the user’s choice of cloud-specific services.
API: An application programming interface (API) is a way for programs to communicate with each other. A programmer may also interact with software applications through an API. Within the context of Verily Workbench, you might use an API to run scripts that automate certain actions within specific applications.
automount: Automount is a Verily Workbench feature that mounts workspace buckets to your cloud environment on startup. If you have referenced resources that point to one or more folders within a bucket, they’ll be mounted as well. Mounted buckets will be displayed in your environment in a tree-structured directory that matches the hierarchy in the “Resources” tab of the workspace.

B

batch analysis: Batch analysis refers to setting up multiple jobs to be automated. Jobs are performed in parallel or sequential order, or some combination of the two. Generally, batch analysis requires computational setup, but minimizes or eliminates the oversight necessary for each individual job.
BigQuery: BigQuery is Google’s fully managed data warehouse. BigQuery’s serverless architecture is distributed and highly scalable, allowing SQL queries to run on terabytes of data in seconds and petabytes in minutes. Data in BigQuery is stored in tables, and tables are stored in datasets. BigQuery also offers analytic tools, and integrations with third-party tools to help load and visualize data.
BigQuery dataset: BigQuery datasets are used to store, organize, and control access to BigQuery tables and views. Each dataset has a name that is unique within its cloud project, and a storage location that’s specified on creation.
BigQuery table: BigQuery tables contain individual records organized in rows. Each record is composed of columns (also called fields). Tables can be queried using SQL or read natively with the Storage Read API.
bucket: Buckets are used to store, organize, and control access to file data in cloud storage services. Files stored in a bucket are referred to as objects. A bucket is like a shared file server that can be accessed from your computer or virtual machines in the cloud. There’s no limit on the number of files a bucket can store. Each bucket has a globally unique name, and a storage region that’s specified on creation.

C

central processing unit (CPU)

The central processing unit (CPU), or simply processor, can be considered the “brain” of a computer. Every computational machine will have at least one CPU, which is connected to every part of the computer system. It’s the operational center that receives, executes, and delegates the instructions received from programs. CPUs also handle computational calculations and logic. Increasing the number of CPUs accelerates the processing of these tasks. Other types of processors (GPUs or TPUs) may be better suited for processing specialized tasks, such as parallel computing and machine learning.

cloud environment

A cloud environment is a configurable pool of cloud computing resources. Cloud environments consist of a virtual machine and a persistent disk, with some useful libraries and tools preinstalled. They’re ideal for interactive analysis and data visualization, and can be finely tuned to suit analysis needs.

Cost is incurred while the cloud environment is running, based on your configuration. You can pause the environment when it’s not in use, but there’s still a charge for maintaining your disk.

cloud resource

Cloud resource is a broad term to describe resources that can be added or created via cloud storage services or other cloud-native services such as BigQuery. Cloud resources can be either referenced or controlled resources.

cloud storage

Cloud storage is a service provided by cloud providers, which is used to store files. Cloud storage has two key cloud resources: buckets and objects. Buckets are used to store objects, and objects are simply files. Key features of cloud storage include unlimited storage, regional controls, global access, user managed access control, and pay only for what you use.

Google Cloud’s cloud storage service is called Google Cloud Storage (GCS). Amazon Web Services’s cloud storage service is called Simple Storage Service (S3).

cluster

A cluster is a group of computers that work as a collective unit. Individual computers in the cluster are referred to as nodes, with every node in the larger whole working on the same tasks.

Clusters are usually assembled on an ad hoc basis to support tasks that require significant computing power.

command-line interface

A command-line interface is a text-based interface that uses defined commands to execute user actions. Using Verily Workbench via the command line requires more computational knowledge than a graphical user interface (GUI), but the CLI can offer some distinct advantages. For example, familiarity with the CLI can greatly increase the efficiency of performing repetitive tasks, or automate them entirely through the use of scripts.

container

A container is a software unit that enables quick and easy use of an application in a variety of computing environments. Within a container is an application and everything needed to run it, including code, runtime, system tools, and system libraries. By using containers, the same application can be used consistently in many environments without installing other packages.

control plane

Control plane refers to a set of services that enable the user to interact with Workbench, including the UI, , and all the orchestration that goes into managing the resources in the analysis plane. These services are managed by Verily directly. The control plane ensures user isolation through industry standard practices and access control.

controlled resources

Controlled resources are cloud resources that are managed or created by Verily Workbench within the current workspace, such as a Cloud Storage bucket that was made using your workspace. If you wanted to use the same bucket in a different workspace, a to the original controlled resource would need to be created in the other workspace. In other words, a controlled resource is its own source, and native to the Workbench workspace it exists within. If the workspace or the resource is deleted, it no longer exists.

D

data catalog: The data catalog is an integrated tool within Verily Workbench that streamlines the process of data discovery. Browse data collections curated by data owners using powerful filters to minimize the amount of time taken to discover data relevant to your study. Export entire or partial collections to your Workbench workspace for use in interactive analysis or workflows. Easy version tracking and optioned updates ensure all collaborators can stay in sync.
data collection: Data collections are diverse datasets, available from Verily Workbench’s data catalog for use as referenced data in your own workspace. Collections are curated by data owners that ensure data quality, reproducibility, and associated lineage. Many collections will have policies attached that determine how the data may be accessed and used. Collections you have access to may be entirely or partially referenced for use in your Workbench workspace.
default resource region: The default resource region is selected when you create a new workspace. Verily Workbench will automatically keep cloud resources and environments created in the workspace within this region to help prevent unexpected egress fees. Once selected, the default resource region can only be changed by creating a new workspace.
Docker: Docker is an application that is used to build containers. Docker containers streamline the process of running applications in any computing environment.
Docker image: A Docker image is a read-only template used to build a container. The image is taken from a snapshot in time and represents an application and its operating environment. By using an image to build containers, this operating environment can be consistently recreated for uniform testing and research.
dsub: dsub is a command-line tool for running tasks at scale. Learn more.

F

FAIR principles: The FAIR principles are a framework for data creation and management, intended to maximize machine-actionability with minimal human intervention. FAIR principles aim to make data more Findable, Accessible, Interoperable, and Reusable. Rich metadata and detailed provenance are examples of what FAIR data includes. Learn more.

G

Git repository: A Git repository is used for Git project storage and tracks the history of changes made. You can add Git repositories to your Verily Workbench workspace as references. When you create a cloud environment for analysis, Workbench will clone your repository to the environment. This can help you better manage your source code and enables you to easily pull in existing work and code.
graphical processing unit (GPU): A graphical processing unit (GPU) is a specialized processor that excels at parallel computing, which means processing many tasks at once. While a central processing unit (CPU) must process tasks one by one, a GPU can split complex tasks into many pieces and work through them simultaneously. GPUs were traditionally used to accelerate video rendering, but their parallel computing ability also makes them ideal for tasks such as sequence alignment, AI, and machine learning.
group: Workbench groups connect one or more members under a shared label. They can be used to grant access to all members of a group. You can also restrict a workspace’s eligible access to members of selected groups only by applying a group policy.
group policy: A group policy limits the eligible access of workspace and data sharing to members of all selected groups. A group policy does not grant access, but can be used as an additional layer of access control. Like other policy types, a group policy can’t be removed once it’s been applied, and carries over to any duplicates.

J

job: A job is a general term that describes a unit of work or execution in computing. Within Verily Workbench, a job refers to a running instance of a workflow.
JupyterLab: JupyterLab is an open-source web application that provides an interactive computational environment in a notebook interface. Notebooks are a convenient place for writing, testing, and/or executing code written in select languages, such as Python. Notebooks combine inputs and outputs into a single file, as well as display visualizations, statistical models, and many other types of rich media.

L

lineage: Lineage displays the end-to-end lifecycle of data while under control of Verily Workbench’s services. Every resource in Workbench has lineage attached, beginning with the data source, to help you be aware of the complete path the data has taken to reach your workspace.

M

memory: Memory, also known as random access memory (RAM), is where programs and data that are currently in use are temporarily stored. The central processing unit (CPU) receives instructions and data from programs, which are kept in the computer’s memory while being used. Once the instructions are completed, or the program is no longer in use, the memory is freed up. If the computer system doesn’t have enough memory for all of the CPU’s instructions, the system’s performance will diminish and slow down. While the CPU is commonly thought of as a computer’s brain, you can think of memory as the attention span.

N

node: A node is one computer that’s part of a larger network known as a cluster. Each node in a cluster is a single virtual machine.
notebook: A notebook is a type of digital document that provides an interactive computational environment. Notebooks combine code inputs and outputs into a single file. One of the key advantages notebooks provide is the ability to display visualizations and modeling alongside your code. Notebooks support a diverse range of rich media and are a powerful tool for conducting interactive analysis.

O

object: An object is a cloud storage term for what is more commonly referred to as a “file.” Objects are stored in buckets.
orchestration: Orchestration describes the coordination of automated tasks across disparate systems. It’s the task flow of a workflow, used to automate multi-stage jobs by supplying information such as when, what, how much, error contingencies, and much more. The simplest way to visualize orchestration is using a flow chart. Just like its namesake suggests, proper cloud orchestration resembles a conductor directing an orchestra, calling different parts at specific times to create a coherent whole.

P

perimeter: A perimeter limits the copying, transfer, or retrieval of data across a boundary in the cloud. A perimeter can be put around data collections and workspaces to limit the data that can be moved or copied outside. Once data is inside a perimeter, it can only be accessed from cloud environments in workspaces inside that perimeter.
perimeter policy: A perimeter policy limits data access and exfiltration by requiring that data can only be accessed from workspaces within a particular perimeter. When the policy is applied to a workspace, that workspace will be enrolled in the perimeter and cannot be removed. The data inside this perimeter cannot be copied into other workspaces, or be read from workspaces outside the perimeter.
persistent disk: A persistent disk is a network storage device accessed by your virtual machine (VM) instances. When you’re finished with your virtual machine, you can detach the persistent disk to keep your data, or move it to a new VM. In other words, it’s like a digital flash drive where the storage scales to your needs. A persistent disk forms part of a cloud environment, together with a VM and an application.
pipeline: A pipeline streamlines multi-stage data processing by executing tasks autonomously, in accordance with your specified inputs. In some cases, a pipeline may also use the outputs of a previous stage as inputs for the next stage. “Pipeline” and “workflow” are sometimes used interchangeably, as they share the larger concept of automating multiple tasks that process data.
pod: A pod is a way of organizing workspaces, people, and resources within an organization, so that they share the same cloud platform and billing. Each pod is linked to a cloud account, which is used for billing. One pod can be used for many workspaces. Since pods are linked to cloud accounts, the pod used determines which cloud platform the workspace belongs to.
policy: Policies are restrictions that may be attached to data collections and dictate how the data can be accessed and used, chiefly for the purposes of privacy and legal compliance. When a collection or resource with a policy is brought into a workspace, it attaches to the workspace. Policies can’t be removed from workspaces, even if the associated resources are deleted, to ensure that any analysis outputs remain in compliance with the policy.
profile: Your profile is your user account in Workbench. Each profile is linked to an email address, which is used to log in. Others can use your profile to add you to Workbench groups, workspaces, and spend profiles for billing. Your proxy group and Workbench SSH key are also attached to your profile.
protected health information (PHI): Protected health information (PHI) refers to any information regarding health status, or provision and payment of health care, that may be linked to an individual. For safety and privacy reasons, medical information such as PHI can only be shared or revealed without being de-identified in specific legal contexts.
proxy group: Your proxy group contains service accounts that are necessary to work with external data that is protected by access control. Every Workbench account has its own proxy group.

R

referenced resource

Referenced resources, or simply references, represent data and other elements in Verily Workbench by pointing to a source that exists outside of the current workspace. While references are functionally identical to their source, they afford more flexibility and less risk, as anything done to a reference has no effect on its source.

An example of a reference is a BigQuery dataset you want to work with in Workbench. By creating a reference, you can bring that dataset into the workspace as a reference, and perform analysis and workflows using that referenced resource. You can safely delete the reference, or make new references in other workspaces, with no effect on the original dataset. There are no limits to the number of references you can create, as long as access to the source is maintained.

region constraint policy

A region constraint policy is a type of policy that limits which regions may be used to create cloud resources and environments. For example, if you used data from a collection that had a region constraint policy, your cloud environment and analysis outputs must be kept within the regions specified by the policy. When a region constraint policy is applied to a workspace outside of the prescribed regions, the default resource region must be updated in order to comply with the policy requirements. You don’t need to migrate data that was in the workspace before the policy was applied, and references to data aren’t affected.

Requester Pays

Requester Pays is an option that can be enabled on Cloud Storage buckets. Whenever a user accesses a Cloud Storage resource such as a bucket or object, there may be charges associated with making and executing the request.

Data transfer out of Google Cloud from Cloud Storage or cloud virtual machines generates charges. This is small (or within Google’s Always Free limits) for small data. The most common movement of data out of a storage region is when a user downloads data from Google Cloud to their workstation or laptop. Charges for large downloads can be significant.

Normally, the project owner of the resource is billed for these charges. But with Requester Pays enabled on your bucket, you can require requesters to be charged for the requests. Enabling Requester Pays is useful any time you share data with researchers, and you cannot be liable for the costs generated by their usage patterns.

resource

Resources comprise a variety of entities whose chief purpose is to facilitate analysis. In many cases, resources are simply multimodal data that can be managed within the workspace, but they aren’t limited to data exclusively. Inside a workspace, the “Resources” tab is where the data resources associated with that project are found.

S

scalability

Scalability refers to the ability to quickly adjust to computing demands. When something is described as “scalable,” it can smoothly respond to a rapid increase in demand while continuing to provide a high-quality experience. Scalability is an important factor in avoiding bottlenecks without wasting energy and resources.

service account

A service account is a special kind of account typically used by an application or compute workload, rather than a person. A service account is identified by its email address, which is unique to the account. Verily Workbench manages a set of service accounts for each user called a “pet service account.”

You can create service accounts outside of Workbench and register those accounts for automating activities that need to interact with Workbench. For more on Google Cloud service accounts, see the overview.

shared configuration

Shared configurations can be used by those in a workspace to create duplicate cloud environments, usually for the purposes of reproducibility. When creating a new environment, you have the option to share the configuration used with the workspace, or create from existing configurations that were shared.

T

task: A task refers to a stage or activity executed within a workflow. Multi-stage workflows are divided into a series of tasks, while a single-stage workflow is one task itself.
tree-structured directory: A tree-structured directory is a means of organizing files and folders. The tree begins with a root directory, with each branch from the root being either a subdirectory or a file. Every file in the system has a unique path. An example in Verily Workbench is a workspace’s “Resources” tab. The workspace is the root directory, with resources forming the tree structure. Resources can nest within subdirectories like folders, or simply be a child of the root.

V

virtual machine

A virtual machine (VM) is an emulation of a physical computer and can perform the same functions, such as running applications and other software. VM instances are created from pools of resources in cloud data centers. You can specify your VM’s geographic region, compute, and storage resources to meet your job’s requirements.

You can create and destroy virtual machines at will to run applications of interest to you, such as interactive analysis environments or data processing algorithms. Virtual machines underlie Verily Workbench’s cloud environments.

W

workflow: A workflow streamlines multi-stage data processing by executing tasks autonomously, in accordance with your specified inputs. Workflows can confer many advantages, especially for large datasets and outputs, such as increasing efficiency and reproducibility while reducing bottlenecks. Running instances of workflows are often called “jobs,” with each job being split into a series of tasks for the workflow to run through.
Workflow Description Language (WDL): Workflow Description Language (WDL) is a programming language used for describing data processing workflows. WDL is designed to be as accessible and human-readable as possible. Using WDL, scientists without a well of programming expertise can still specify complex analysis tasks to automate through workflows. Portability and accessibility are major cornerstones of WDL.
workspace: Workspaces are where researchers and teams connect, collaborate, and organize all the elements of their research, including data, documentation, code, and analysis. Workspaces are also a way for data and tool providers to deliver data and tools to researchers, along with helpful resources such as documentation, code, and examples.