Run GVS on Verily Workbench
Categories:
Prior reading: Workflows in Verily Workbench: Cromwell, dsub, and Nextflow
Introduction
Genomic Variant Store (GVS) is a WDL-based workflow developed by the Broad Institute. This tutorial shows you how to run GVS in your own workspace.
Step by step instructions
1 Create a Cloud Storage bucket to hold WDL files
Workbench currently requires that the WDL file(s) for your workflow is in a bucket. For this example, we create a new bucket.
- In resources tab, click “+ Cloud resource” -> “New Cloud Storage bucket”
- Give it a name and click “Create bucket”. In this example I name it
workflows_bucket
:
2 Create a BigQuery dataset
The GVS workflow requires a BigQuery dataset to exist that it can read and write to. For this example, we create a new BQ dataset.
- In resources tab, click “+ Cloud resource” -> “New BigQuery dataset”
- Give it a name and click “Create dataset”. In this example I name it
gvs_1
:
3 Get the WDLs into the bucket
The WDLs used are available on github here. To get them into the bucket from step 1, run:
git clone https://github.com/verily-src/workbench-examples.git .
cd workbench-examples/cromwell_setup/gvs_wdls/
gsutil cp *.wdl $BUCKET_NAME
4 Add the Workflow
Navigate to the Workflows section and click “+Add workflow”. Add the wdl named “GvsJointVariantCalling.wdl”
5 Create a new job
Click on the “+New job” button. Navigate to the next “Prepare inputs” page.
6 Enter the inputs
Enter in the following values:
Input Key | Value | Example |
---|---|---|
GvsJointVariantCalling.call_set_identifier | Any string for this callset. | “my_call_set_1” |
GvsJointVariantCalling.dataset_name | The dataset created in step 2. | “gvs_1” |
GvsJointVariantCalling.external_sample_names | The list of sample names. | [“2013050218”, “2013050219”] |
GvsJointVariantCalling.input_vcf_indexes | The list of GCS locations pointing to the vcf index files of each sample. | [“gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi”, “gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi”] |
GvsJointVariantCalling.input_vcfs | The list of GCS locations pointing to the vcf files of each sample. | [“gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz”, “gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz”] |
GvsJointVariantCalling.project_id | The Google Cloud Project ID of the workspace. | “YOUR_PROJECT_ID” |
7 Monitor the workflow
In the workflows tab, there is a section to monitor the jobs as they run and complete.
8 Get outputs
Once the workflow completes, browse the workspace bucket and navigate to the task containing the sharded vcf outputs.
Last Modified: 23 October 2024