Querying OMOP data in BigQuery using the CMS SynPuf dataset

Examples of querying OMOP data in BigQuery using the CMS SynPuf dataset

Introduction

This tutorial shows some examples of querying OMOP data in BigQuery. The examples query the CMS SynPuf dataset, a public synthetic patient data in OMOP, and show how you can do basic visualizations.

Two example queries are shown:

  • How to extract medical conditions based on ICD-10 codes
  • How to extract drug treatments containing a specific molecule

How to run the example queries

To run the examples, you can import the following notebooks into a Verily Workbench notebook cloud environment. You can take the defaults when creating the environment.

You can also add the example GitHub repo to a Workbench workspace, by adding this URL: https://github.com/DataBiosphere/terra-axon-examples.git. For step-by-step instructions, see Add a Git repository to a workspace.

When you do that, the repo will be auto-mounted in any new notebook environments that you create, and you will be able to find the notebook files in the
~/repos/terra-axon-examples/omop_examples/ directory in the notebook environment’s file system.

Imports and setup

The example queries and visualizations require the following imports:

import pandas as pd
from google.cloud import bigquery

import numpy as np
import seaborn as sns

# In JupyterLab, enable IPython to display matplotlib graphs.
import matplotlib.pyplot as plt
%matplotlib inline
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

Optional: create a referenced resource for the CMS SynPuf dataset

If you’re running this example in a Workbench workspace, you can create a referenced resource for the dataset. This gives it a more convenient name, and if you share your workspace with others, they will be able to access the workspace’s defined resources as well.

The URI for the CMS SynPuf dataset is: bigquery-public-data.cms_synthetic_patient_data_omop. In a Workbench workspace, you can define a referenced resource for this dataset like this, using cms_synthetic_patient_data_omop for the dataset and bigquery-public-data for the Project ID:

 

You can also define the referenced resource from using the wb command-line tool, as follows. Here we’re giving the resource the ID cms_synthetic_patient_data_omop, but you can name it what you like.

wb resource add-ref bq-dataset --id cms_synthetic_patient_data_omop --path bigquery-public-data.cms_synthetic_patient_data_omop

Here is a convenience function to resolve a referenced resource, once defined. This function assumes that the wb command-line tool is installed, which will be the case for workspace notebook environments.

'''
Define a function to resolve a BQ dataset from a referenced resource in a workspace.
'''
def get_bq_dataset_from_reference(resource_id):
    BQ_CMD_OUTPUT = !wb resolve --id={resource_id}
    BQ_DATASET = BQ_CMD_OUTPUT[0]
    return BQ_DATASET

Connect to the BQ database

Before you run queries, instantiate a BigQuery client object configured for the CMS SynPuf dataset. If you have created a referenced resource as described above, you can resolve it using the convenience function as shown below. If not, set BQ_dataset directly to the CMS dataset’s URI.

# The following line resolves the workspace resource named cms_synthetic_patient_data_omop.
# It will fail if the named resource does not exist
BQ_dataset = get_bq_dataset_from_reference('cms_synthetic_patient_data_omop')

If you did not define a referenced resource, point to BQ_dataset to the dataset’s URI instead:

BQ_dataset = 'bigquery-public-data.cms_synthetic_patient_data_omop'

Then, instantiate the client object, setting the default_dataset:

job_query_config = bigquery.QueryJobConfig(default_dataset=BQ_dataset)
client = bigquery.Client(default_query_job_config=job_query_config)

Example query: history of skin conditions

The purpose of this example is to calculate the number of participants with a diagnosis of skin condition based on a list of ICD-10 codes. As a starting point for this study, we define the diagnosis of skin condition as a condition coded with an ICD-10 code from the following list:

  • L70.*: Acne
  • L20.*: Atopic dermatitis
  • L40.*: Psoriasis
  • L80.*: Vitiligo

where “*” corresponds to any sequence of numbers.

For this purpose, it is necessary to perform the following steps:

  1. Converting the ICD-10 codes of skin condition to concept IDs: This step consists of finding the concept IDs associated with the ICD-10 codes in the following list: L70, L20, L40 and L80. To do this, we will use the concept table of the OMOP vocabulary and find all concept_code equal to one of the ICD-10 codes for the vocabulary_id equal to “ICD10CM”. See the CTE skin_condition_ICD_concept_ids in the SQL query below.

  2. Find the standard concept IDs linked to the ICD-10 concept IDs: The majority of conditions are coded with a standard concept ID (corresponding to a SNOMED code). Therefore, we need to find the standard concept IDs corresponding to the concept IDs previously extracted in step 1. The CTE standard_concept_ids in the SQL query below consists of retrieving only the standard concept IDs from the concept table (i.e., standard_concept equal to “S”) and the CTE skin_condition_standard_concept_ids consists of mapping the concept IDs from step 1 to the standard concept IDs (i.e., using the concept_relationship table with relationship_id equal to “Maps to”).

  3. Find all concept IDs linked to the standard skin condition concept IDs: The dataset may also contain non-standard concept IDs. To ensure that we include patients who have a diagnosis of skin condition using a non-standard concept ID, we need to find all concept IDs associated with the core set of standard concept IDs identified in the previous step. To do this, we again use the concept_relationship table, setting our standard concept IDs as concept_id_2 when using the relationship_id of “Maps to”. We union these with our standard concept IDs to get a full set of skin condition concept_ids. See the CTEs skin_condition_nonstandard_concept_ids and skin_condition_all_concept_ids in the SQL query below.

  4. Find all the descendants of the skin condition concept IDs: The condition concept IDs are organized under an ontology with different levels of precision. Therefore, to capture all concept IDs for skin condition, we must also find all descendants of the concept IDs previously extracted in step 3. To do this, we will use the concept_ancestor table with the ancestor_concept_id equal to the concept IDs from step 3 and extract all descendant_concept_id as the final list all_concept_ids_with_descendants in the SQL query below.

  5. Calculate the number of participants with a diagnosis of skin condition: Finally, the last step is to extract and count the participants with at least one diagnosis of skin condition. We will use the condition_occurrence table and filter only the conditions coded with a condition_concept_id corresponding to a concept ID of the previously extracted list. See the CTE nb_of_participants_diagnosed_with_skin_condition in the SQL query below.

  6. Calculate the percentage of participants with a diagnosis of skin condition: Finally, the last step is to calculate the percentage with at least one diagnosis of skin condition out of the total number of participants. We will use the number of participants in the person table and calculate the percentage. See the CTE nb_total_of_participants in the SQL query below.

query = """
    WITH skin_condition_ICD_concept_ids AS (
        SELECT
            concept_id,
            CASE concept_code
                WHEN 'L70' THEN 'Acne'
                WHEN 'L20' THEN 'Atopic dermatitis'
                WHEN 'L40' THEN 'Psoriasis'
                ELSE 'Vitiligo'
            END AS skin_condition
        FROM
            `concept`
        WHERE
            concept_code IN ('L70', 'L20', 'L40', 'L80')
            AND vocabulary_id = 'ICD10CM'
    ),
    standard_concept_ids AS (
        SELECT
            concept_id
        FROM
            `concept`
        WHERE
            standard_concept = 'S'
    ),
    skin_condition_standard_concept_ids AS (
        SELECT
            skin_condition,
            standard_concept_ids.concept_id
        FROM
            skin_condition_ICD_concept_ids
        INNER JOIN
            `concept_relationship`
        ON
            skin_condition_ICD_concept_ids.concept_id = concept_id_1
        INNER JOIN
            standard_concept_ids
        ON
            standard_concept_ids.concept_id = concept_id_2
        WHERE
            relationship_id = 'Maps to'
    ),
    skin_condition_nonstandard_concept_ids AS (
        SELECT
            skin_condition,
            concept_id_1 AS concept_id
        FROM
            `concept_relationship`
        INNE JOIN
            skin_condition_standard_concept_ids
        ON
            relationship_id = 'Maps to'
            AND concept_id_2 = concept_id
    ),
    skin_condition_all_concept_ids AS (
        SELECT DISTINCT
            skin_condition,
            concept_id
        FROM (
            SELECT
                *
            FROM
                skin_condition_standard_concept_ids
        ) UNION ALL (
            SELECT
                *
            FROM
                skin_condition_nonstandard_concept_ids
        )
    ),
    skin_condition_all_concept_ids_with_descendants AS (
        SELECT
            skin_condition,
            descendant_concept_id AS concept_id
        FROM
            skin_condition_all_concept_ids
        INNER JOIN
            `concept_ancestor`
        ON
            concept_id = ancestor_concept_id
    ),
    nb_of_participants_diagnosed_with_skin_condition AS (
        SELECT
            skin_condition,
            COUNT(DISTINCT person_id) AS nb_of_participants_with_skin_condition
        FROM
            `condition_occurrence`
        INNER JOIN
            skin_condition_all_concept_ids_with_descendants
        ON
            condition_concept_id = concept_id
        GROUP BY
            skin_condition
    ),
    nb_total_of_participants AS (
        SELECT
            COUNT(DISTINCT person_id) AS nb_of_participants
        FROM
            `person`
    )
    SELECT
        skin_condition,
        100*nb_of_participants_with_skin_condition/nb_of_participants AS percentage_of_participants,
    FROM
        nb_total_of_participants,
        nb_of_participants_diagnosed_with_skin_condition
"""

Execute the query

The code below will send a request to BigQuery to execute the query. The results will be stored in a Pandas dataframe.

df = client.query(query).result().to_dataframe()
df

Plot a visualization of the results

When run in a notebooks, the code below uses matplotlib to plot a simple histogram of the results.

ax = df.plot.bar(x='skin_condition', y='percentage_of_participants', title='Percentage of participants with a diagnosis of skin condition', rot=0, ylabel='', xlabel='', legend=False)
# Add bar labels
for p in ax.patches:
    ax.annotate('%.2f%%' % (p.get_height()), (p.get_x()+0.07, p.get_height()+0.03))
plt.show()

Example query: medications containing Quinapril

The purpose of this example is to calculate the number of participants who were treated with drugs containing the Quinapril molecule, a Beta Blocker. As a starting point for this study, we need to find the RxCUI code of the Quinapril molecule with the RxNav tool (https://mor.nlm.nih.gov/RxNav/search?searchBy=String&searchTerm=quinapril):

  • Quinapril RxCUI: 35208

For this purpose, it is necessary to perform the following steps:

  1. Converting the RxCUI code of Quinapril molecule to concept ID: This step consists of finding the concept ID associated with the 35208 RxCUI code. To do this, we will use the concept table of the OMOP vocabulary and the concept_code equal to the RxCUI code for the vocabulary_id equal to “RxNorm”. See the CTE quinapril_RxNorm_concept_id in the SQL query below.

  2. Find the drug standard concept IDs: Drugs are coded with a standard concept ID (corresponding to a RxNorm code). Therefore, We need to find the concept IDs linked to drugs containing the Quinapril molecule. The CTE medications_with_quinapril in the SQL query below consists of extracting all the descendants of the ingredient concept ID from step 1, which are all the drugs containing the Metropol ingredient.

  3. Find all drug concept IDs: Because some of the drugs may be coded using non-standard concept IDs, we recommend mapping the standard concept IDs identified in step 2 to obtain a comprehensive set of relevant concept IDs. This mapping is performed using the concept_relationship table of the OMOP vocabulary, where concept_id_2 is the standard concept ID(s) identified in step 1, and the relationship_id is ‘Maps to’. See the CTE all_medications_with_quinapril in the SQL query below.

  4. Calculate the number of participants who were treated with Quinapril: The next step is to extract and count the participants with at least one Quinapril drug exposure. We will use the drug_exposure table and filter only the drugs coded with a drug_concept_id corresponding to a concept ID of the previously extracted list. See the CTE nb_of_participants_treated_with_quinapril in the SQL query below.

  5. Calculate the percentage of participants who were treated with Quinapril: Finally, the last step is to calculate the percentage who were treated with Quinapril out of the total number of participants. We will use the number of participants in the person table and calculate the percentage. See the CTE nb_total_of_participants in the SQL query below.

query = """
    WITH quinapril_RxNorm_concept_id AS (
        SELECT
            concept_id
        FROM
            `concept`
        WHERE
            concept_code = "35208"
            AND vocabulary_id = "RxNorm"
    ),
    medications_with_quinapril AS (
        SELECT
            ancestor.descendant_concept_id AS concept_id
        FROM
            `concept_ancestor` AS ancestor
        INNER JOIN
            quinapril_RxNorm_concept_id AS quinapril
        ON
            ancestor.ancestor_concept_id = quinapril.concept_id
        INNER JOIN
            `concept` AS concept
        ON
            ancestor.descendant_concept_id = concept.concept_id
        WHERE
            concept.standard_concept = 'S'
    ),
    all_medications_with_quinapril AS (
        SELECT
            DISTINCT concept_id
        FROM (
            SELECT
                *
            FROM
                medications_with_quinapril
        ) UNION ALL (
            SELECT
                concept_id_1 AS concept_id
            FROM
                `concept_relationship`
            WHERE
                relationship_id = 'Maps to'
                AND concept_id_2 IN (
                    SELECT
                        concept_id
                    FROM
                        medications_with_quinapril
                )
        )
    ),
    nb_of_participants_treated_with_quinapril AS (
        SELECT
            COUNT(DISTINCT person_id) AS nb_of_participants_with_quinapril
        FROM
            `drug_exposure`
        INNER JOIN
            all_medications_with_quinapril
        ON
            drug_concept_id = concept_id
    ),
    nb_total_of_participants AS (
        SELECT
            COUNT(DISTINCT person_id) AS nb_of_participants
        FROM
            `person`
    )
    SELECT
        100*nb_of_participants_with_quinapril/nb_of_participants AS with_quinapril,
        100-(100*nb_of_participants_with_quinapril/nb_of_participants) AS without_quinapril,
    FROM
        nb_total_of_participants,
        nb_of_participants_treated_with_quinapril
"""

Execute the query

The code below will send a request to BigQuery to execute the query. The results will be stored in a Pandas dataframe.

df = client.query(query).result().to_dataframe()
df

Plot a visualization of the results

When run in a notebook, the code below uses matplotlib to plot a simple histogram of the results.

ax = df.transpose().plot.pie(y=0, autopct='%.2f%%', title='Percentage of participants who were treated with Quinapril', ylabel='', legend=False)
plt.show()

Build an interactive dashboard

This example shows how you can build an interactive dashboard to help visualize your data. The code that creates the plot is meant to be run in a JupyterLab notebook.

We first make several queries to BigQuery, store the results in dataframes, and then merge the dataframes.

race_concept_query = """
SELECT concept_id as race_concept_id, concept_name as race FROM `concept` where domain_id = "Race"
"""
race_concept_df = client.query(race_concept_query).result().to_dataframe()
race_concept_df.head(5)
gender_concept_query = """
SELECT concept_id as gender_concept_id, concept_name as gender FROM `concept` where domain_id = "Gender"
"""
gender_concept_df = client.query(gender_concept_query).result().to_dataframe()
gender_concept_df.head(5)
people_query = """
SELECT
  person_id,
  race_concept_id,
  gender_concept_id,
  year_of_birth,
  month_of_birth
FROM
  `person`
WHERE RAND() < 10000/2326856
"""

cms_syn_df = client.query(people_query).result().to_dataframe()
cms_syn_df.head(5)

Merge the dataframes:

merged_df = pd.merge(cms_syn_df, race_concept_df, on="race_concept_id")
merged_df = pd.merge(merged_df, gender_concept_df, on="gender_concept_id")
merged_df.head(5)

Build an interactive dashboard

Next, we’ll display the merged data via an interactive plot.

def plot_histogram(bins = 10, hue = 'race', palette = 'Blues', x_range_1 = (1900,2000)):
    plt.figure(dpi = 120)
    sns.histplot(data = merged_df,
                        x = 'year_of_birth',
                        palette=palette,
                        bins = bins,
                        hue = hue,
                       )
    plt.xlim(x_range_1)
_ = interact(
    plot_histogram,
    palette = widgets.Dropdown(
        options = ['pastel','husl','Set2','flare','crest','magma','icefire']
    ),
    hue = widgets.ToggleButtons(
        options = ['race','gender'],
        disabled = False,
        button_style = 'success'
    ),
    bins = widgets.IntSlider(
        value = 10,
        min = 3,
        max = 15,
        step = 1
    ),
    x_range_1 = widgets.IntRangeSlider(
        value = [1900,2000],
        min = 1900,
        max = 2000,
    ),
)

The dashboard should look like this:

Last Modified: 16 April 2024