A
- api
- An application programming interface (API) is a way for programs to communicate with each other. A programmer may also interact with software applications through an API. Within the context of Verily Workbench, you might use an API to run scripts that automate certain actions within specific applications.
- automount
- Automount is a Verily Workbench feature that mounts workspace buckets to your cloud environment on startup. If you have referenced resources that point to one or more folders within a bucket, they’ll be mounted as well. Mounted buckets will be displayed in your environment in a tree-structured directory that matches the hierarchy in the resources tab of the workspace.
B
- batch analysis
- Batch analysis refers to setting up multiple jobs to be automated. Jobs are performed in parallel or sequential order, or some combination of the two. Generally batch analysis requires computational set-up, but minimizes or eliminates the oversight necessary for each individual job.
- BigQuery
- BigQuery is Google’s fully managed data warehouse. BigQuery’s serverless architecture is distributed and highly scalable, allowing SQL-like queries to run on terabytes of data in only a few seconds. Datasets in BigQuery are stored in append-only tables, where you can control who can view and query your data. BigQuery also offers analytic tools, and integrations with third-party tools to help load and visualize data.
- bucket
- Buckets are used to store, organize, and control access to file data in cloud storage services. Files stored in a bucket are referred to as objects. A bucket is like a shared file server that can be accessed from your computer or virtual machines in the cloud. There’s no limit on the number of files a bucket can store. Each bucket has a globally unique name, and a storage region that’s specified on creation.
C
- central processing unit (CPU)
- The central processing unit (CPU), or simply processor, can be considered the “brain” of a computer. Every computational machine will have at least one CPU, which is connected to every part of the computer system. It’s the operational center that receives, executes, and delegates the instructions received from programs. CPUs also handle computational calculations and logic. Increasing the number of CPUs accelerates the processing of these tasks. Other types of processors (GPUs or TPUs) may be better suited for processing specialized tasks, such as parallel computing and machine learning.
- cloud environment
A cloud environment is a configurable pool of cloud computing resources. Cloud environments consist of a virtual machine and a persistent disk, with some useful libraries and tools preinstalled. They’re ideal for interactive analysis and data visualization, and can be finely tuned to suit analysis needs.
Cost is incurred while the cloud environment is running, based on your configuration. You can pause the environment when it’s not in use, but there’s still a charge for maintaining your disk.
- cloud resource
- Cloud resource is a broad term to describe resources that can be added or created via Cloud Storage services, or other cloud-native services such as BigQuery. Cloud resources can be either referenced or controlled resources.
- Cloud storage
Cloud storage is a service provided by cloud providers, which is used to store files. Cloud storage has two key cloud resources: buckets and objects. Buckets are used to store objects, and objects are simply files. Key features of cloud storage: unlimited storage, regional controls, global access, user managed access control, and pay only for what you use.
Google Cloud’s cloud storage service is called Google Cloud Storage (GCS). Amazon Web Service’s cloud storage service is called Simple Storage Service (S3)
- Cluster
A cluster is a group of computers that work as a collective unit. Individual computers in the cluster are referred to as nodes, with every node in the larger whole working on the same tasks.
Clusters are usually assembled on an ad-hoc basis to support tasks that require significant computing power.
- Command-line interface
- A command-line interface is a text-based interface that uses defined commands to execute user actions. Using Verily Workbench via the command line requires more computational knowledge than a graphical user interface (GUI), but the CLI can offer some distinct advantages. For example, familiarity with the CLI can greatly increase the efficiency of performing repetitive tasks, or automate them entirely through the use of scripts.
- container
- A container is a software unit that enables quick and easy use of an application in a variety of computing environments. Within a container is an application and everything needed to run it, including code, runtime, system tools, and system libraries. By using containers, the same application can be used consistently in many environments without installing other packages.
- controlled resources
- Controlled resources are cloud resources that are managed or created by Verily Workbench within the current workspace, such as a Cloud Storage bucket that was made using your workspace. If you wanted to use the same bucket in a different workspace, a reference to the original controlled resource would need to be created in the other workspace. In other words, a controlled resource is its own source, and native to the Workbench workspace it exists within. If the workspace or the resource is deleted, it no longer exists.
D
- data catalog
- The data catalog is an integrated tool within Verily Workbench that streamlines the process of data discovery. Browse data collections curated by stewards, with powerful filters to minimize the amount of time taken to discover data relevant to your study. Export entire or partial collections to your Workbench workspace for use in interactive analysis or workflows. Easy version tracking and optioned updates ensure all collaborators can stay in sync.
- data collection
- Data collections are diverse datasets, available from Verily Workbench’s data catalog for use as referenced data in your own workspace. Collections are curated by data stewards that ensure data quality, reproducibility, and associated lineage. Many collections will have policies attached that determine how the data may be accessed and used. Collections you have access to may be entirely or partially referenced for use in your Workbench workspace.
- Default resource region
- The default resource region is selected when you create a new workspace. Verily Workbench will automatically keep cloud resources & environments created in the workspace within this region, to help prevent unexpected egress fees. Once selected, the default resource region can only be changed by creating a new workspace.
- docker
- Docker is an application that is used to build containers. Docker containers streamline the process of running applications in any computing environment.
- docker image
- A Docker image is a read-only template used to build a container. The image is taken from a snapshot in time, and represents an application and its operating environment. By using an image to build containers, this operating environment can be consistently recreated for uniform testing and research.
- dsub
- dsub is a command-line tool for running tasks at scale. Learn more.
F
- FAIR principles
- The FAIR principles are a framework for data creation and management, intended to maximize machine-actionability with minimal human intervention. FAIR principles aim to make data more Findable, Accessible, Interoperable, and Reusable. Rich metadata and detailed provenance are examples of what FAIR data includes. Learn more.
G
- Git repository
- A Git repository is used for Git project storage, and tracks the history of changes made. You can add Git repositories to your Verily Workbench workspace as references. When you create a cloud environment for analysis, Workbench will clone your repository to the environment. This can help you better manage your source code, and enables you to easily pull in existing work & code.
- graphical processing unit (GPU)
- A graphical processing unit (GPU) is a specialized processor that excels at parallel computing, which means processing many tasks at once. While a central processing unit (CPU) must process tasks one by one, a GPU can split complex tasks into many pieces and work through them simultaneously. GPUs were traditionally used to accelerate video rendering, but their parallel computing ability also makes them ideal for tasks such as sequence alignment, AI and machine learning.
- group
- Workbench groups connect one or more members under a shared label. They can be used to grant access to all members of a group. You can also restrict a workspace’s eligible access to members of selected groups only, by applying a group policy.
- group policy
- A group policy limits the eligible access of workspace and data sharing to members of all selected groups. A group policy does not grant access, but can be used as an additional layer of access control. Like other policy types, a group policy can’t be removed once it’s been applied, and carries over to any duplicates.
J
- job
- A job is a general term that describes a unit of work or execution in computing. Within Verily Workbench, a job refers to a running instance of a workflow.
- jupyterlab
- JupyterLab is an open-source web application that provides an interactive computational environment in a notebook interface. Notebooks are a convenient place for writing, testing, and/or executing code written in select languages, such as Python. Notebooks combine inputs and outputs into a single file, as well as displaying visualizations, statistical models, and many other types of rich media.
L
- Lineage
- Lineage displays the end-to-end lifecycle of data while under control of Verily Workbench’s services. Every resource in Workbench has lineage attached, beginning with the data source, to help you be aware of the complete path the data has taken to reach your workspace.
M
- memory
- Memory, also known as random access memory (RAM), is where programs and data that are currently in use are temporarily stored. The central processing unit (CPU) receives instructions & data from programs, which is kept in the computer’s memory while being used. Once the instructions are completed, or the program is no longer in use, the memory is freed up. If the computer system doesn’t have enough memory for all of the CPU’s instructions, the system’s performance will diminish and slow down. While the CPU is commonly thought of as a computer’s brain, you can think of memory as the attention span.
N
- node
- A node is one computer that’s part of a larger network known as a cluster. Each node in a cluster is a single virtual machine.
- Notebook
- A notebook is a type of digital document that provides an interactive computational environment. Notebooks combine code inputs and outputs into a single file. One of the key advantages notebooks provide is the ability to display visualizations & modeling alongside your code. Notebooks support a diverse range of rich media and are a powerful tool for conducting interactive analysis.
O
- object
- An object is a cloud storage term for what is more commonly referred to as a “file”. Objects are stored in buckets.
- Orchestration
- Orchestration describes the coordination of automated tasks across disparate systems. It’s the task flow of a workflow, used to automate multi-stage jobs by supplying information such as when, what, how much, error contingencies, and much more. The simplest way to visualize orchestration is using a flow chart. Just like its namesake suggests, proper cloud orchestration resembles a conductor directing an orchestra, calling different parts at specific times to create a coherent whole.
P
- Persistent disk
- A persistent disk is a network storage device accessed by your virtual machine instances. When you’re finished with your virtual machine, you can detach the persistent disk to keep your data, or move it to a new VM. In other words, it’s like a digital flash drive where the storage scales to your needs. A persistent disk forms part of a cloud environment, together with a VM and an application.
- pipeline
- A pipeline streamlines multi-stage data processing by executing tasks autonomously, in accordance with your specified inputs. In some cases, a pipeline may also use the outputs of a previous stage as inputs for the next stage. ‘Pipeline’ and ‘workflow’ are sometimes used interchangeably, as they share the larger concept of automating multiple tasks that process data.
- policy
- Policies are restrictions that may be attached to data collections, and dictate how the data can be accessed and used, chiefly for the purposes of privacy and legal compliance. When a collection or resource with a policy is brought into a workspace, it attaches to the workspace. Policies can’t be removed from workspaces, even if the associated resources are deleted, to ensure that any analysis outputs remain in compliance with the policy.
- Protected health information (PHI)
- Protected health information (PHI) refers to any information regarding health status, or provision & payment of health care, that may be linked to an individual. For safety and privacy reasons, medical information such as PHI can only be shared or revealed without being de-identified in specific legal contexts.
- proxy-group
- Your proxy group contains service accounts that are necessary to work with external data that is protected by access control. Each Workbench has its own; your proxy group is personal to you.
R
- referenced resource
Referenced resources, or simply references, represent data and other elements in Verily Workbench by pointing to a source that exists outside of the current workspace. While references are functionally identical to their source, they afford more flexibility and less risk, as anything done to a reference has no effect on its source.
An example of a reference is a BigQuery dataset you want to work with in Workbench. By creating a reference, you can bring that dataset into the workspace as a reference, and perform analysis and workflows using that referenced resource. You can safely delete the reference, or make new references in other workspaces, with no effect on the original dataset. There are no limits to the number of references you can create, as long as access to the source is maintained.
- Region constraint policy
- A region constraint policy is a type of policy that limits which regions may be used to create cloud resources & environments. For example, if you used data from a collection that had a region constraint policy, your cloud environment and analysis outputs must be kept within the regions specified by the policy. When a region constraint policy is applied to a workspace outside of the prescribed regions, the default resource region must be updated in order to comply with the policy requirements. You don’t need to migrate data that was in the workspace before the policy was applied, and references to data aren’t affected.
- resource
- Resources comprise a variety of entities whose chief purpose is to facilitate analysis. In many cases resources are simply multimodal data that can be managed within the workspace, but they aren’t limited to data exclusively. Inside a workspace, the “Resources” tab is where the data resources associated with that project are found.
S
- Scalability
- Scalability refers to the ability to quickly adjust to computing demands. When something is described as scalable, it can smoothly respond to a rapid increase in demand while continuing to provide a high-quality experience. Scalability is an important factor in avoiding bottlenecks without wasting energy and resources.
- Service Account
A service account is a special kind of account typically used by an application or compute workload, rather than a person. A service account is identified by its email address, which is unique to the account. Verily Workbench manages a set of service accounts for each user called a “pet service account”.
You can create service accounts outside of Workbench and register those accounts for automating activities that need to interact with Workbench. For more on Google Cloud service accounts, see the overview.
T
- Task
- A task refers to a stage or activity executed within a workflow. Multi-stage workflows are divided into a series of tasks, while a single-stage workflow is one task itself.
- tree-structured directory
- A tree-structured directory is a means of organizing files and folders. The tree begins with a root directory, with each branch from the root being either a subdirectory or a file. Every file in the system has a unique path. An example in Verily Workbench is a workspace’s resources tab. The workspace is the root directory, with resources forming the tree structure. Resources can nest within subdirectories like folders, or simply be a child of the root.
V
- virtual machine
A virtual machine (VM) is an emulation of a physical computer and can perform the same functions, such as running applications and other software. VM instances are created from pools of resources in cloud datacenters. You can specify your VM’s geographic region, compute, and storage resources to meet your job’s requirements.
You can create and destroy virtual machines at will to run applications of interest to you, such as interactive analysis environments or data processing algorithms. Virtual machines underlie Verily Workbench’s cloud environments.
W
- Workflow
- A workflow streamlines multi-stage data processing by executing tasks autonomously, in accordance with your specified inputs. Workflows can confer many advantages, especially for large datasets and outputs, such as increasing efficiency and reproducibility while reducing bottlenecks. Running instances of workflows are often called ‘jobs’, with each job being split into a series of tasks for the workflow to run through.
- Workflow Description Language (WDL)
- Workflow description language (WDL) is a programming language used for describing data processing workflows. WDL is designed to be as accessible and human-readable as possible. Using WDL, scientists without a well of programming expertise can still specify complex analysis tasks to automate through workflows. Portability and accessibility are major cornerstones of WDL.
- workspace
- Workspaces are where researchers and teams connect and organize all the elements of their research. This includes data, code, analysis, documentation, and collaboration with others. Workspaces are also a way for data & tool providers to deliver data and tools to researchers, along with helpful resources such as documentation, code, and examples.