How SageRx works

Because who has time to dig into a repo?

Sep 20, 2023

In case you missed our last post, we released a platform of open drug data pipelines called SageRx on GitHub. It’s open source, but I know from experience that I star a bunch of repos that seem interesting at the time and never go back to look at how they work. This post attempts to summarize the tools we chose to use to create SageRx and explain how it all comes together.

Fair warning: this article is somewhat technical, so if you aren’t really interested in knowing how software works “under the hood”, you might want to skip this one. Or don’t! You might learn something and get hooked on software development!

Overall architecture

Docker is used to containerize the tools below. Spinning up SageRx is as simple as a few initial setup steps and then a couple of docker-compose up statements.

See detailed installation instructions in the GitHub repo here: https://github.com/coderxio/sagerx

Flow of data

1. Airflow
   ⬇️
2. PostgreSQL / pgAdmin
   ⬇️
3. dbt

This simplistic diagram is not entirely correct because dbt also updates PostgreSQL with tables created from the transformations… but hopefully you get the general idea.

I will try to explain some highlights specific to SageRx from each of these steps in the sections that follow.

1. Airflow

The SageRx Airflow dashboard containing all of the available data sources at the time of this writing.

Airflow is the data pipeline tool we chose for this project because it is open source, supported by Apache, and seems to be widely used by a variety of industries. All of the business logic relating to where the raw open drug data lives (NLM / CMS / FDA / etc), what format it is stored in, how often it is updated, how to unpack and manipulate it, etc is stored within Airflow “DAGs” (directed acyclic graphs) - which are basically just Python scripts.

Airflow is used for the Extraction and Loading (EL) of the overall Extract-Load-Transform (ELT) data engineering process.

Timing of data source updates

The update timing of data sources is maintained here. Some data sources are updated weekly, others monthly, others yearly, etc. With the flip of a switch, you can automate recurring updates of any data source.

Authentication

Select data sources (RxNorm) require a free license key even though the parts of the data used from the source are open. Airflow handles the authentication steps required for these scenarios.

Changing file names

Some raw data file names change with every release - possibly including a date at the very least. Some files aren't regularly updated on the same day of the month and require an initial API call to determine the most recent file before downloading.

Back-loading of data

Some data sources have data going back several years. In situations where loading all of this data is beneficial (NADAC, for instance), Airflow can automate the backfill process of historic data.

Unzipping of files

Some data sources use zipped files one or more layers deep. Airflow, combined with custom functions, handles unzipping these files and only loading the necessary ones into the database.

Loading into the database

SageRx utilizes the Postgres COPY function to load data directly from a file into a database. We think this is the most efficient way to load data into a database. This can handle the vast majority of file types (CSV, TSV, RRF, etc).

In scenarios where the COPY function can't handle a file type (Excel files, for instance), we use Python / pandas to transform the data into a format that can be loaded into the database.

2. PostgreSQL / pgAdmin

SageRx PostgreSQL database as accessed through pgAdmin (available with SageRx).

Currently, all data ends up in a PostgreSQL database. We chose PostgreSQL because it is open source, supported by a large community, and has excellent documentation.

We suspect that there will be a need for connectors to other types of databases (cloud, data lake, etc), but initially we wanted to keep it simple.

Schemas

The three main schemas in the PostgreSQL database are:

datasource
staging
intermediate

All of the raw data from Airflow ends up in the datasource schema in as close to its raw format as possible. There may be slight differences solely due to column type, but the intent is to leave it in raw format.

From there, it is transformed by dbt (as described in step 3 below) and the resulting data ends up in the staging and intermediate schemas. More information about this in the dbt section below.

Eventually, we may have some data mart data in the mart schema and possibly a separate flatfile schema for data meant to be loaded into flat files for easy download.

Custom functions

NDC formatter

Due to the need for common transformations like taking a hyphenated NDC and converting it to NDC11 format, we created custom Postgres functions to handle this.

We made this decision prior to implementing dbt, so a future optimization may be converting these functions from Postgres to dbt. However, they come in handy when writing queries directly in pgAdmin.

select ndc_to_11('1234-5678-90')

/* returns 01234567890 */

Connecting to the database

Using pgAdmin

It is not necessary to use pgAdmin to connect to the database, but it makes it easy to run some quick queries and has (in our opinion) a great user interface.

By default pgAdmin is hosted on port 8002.

Connecting directly

Since we use Postgres as the database, connecting to a hosted version of SageRx is just like connecting to any other Postgres database.

By default, the connection information is configured as listed below.

server: <<server_address>>:5432
username: sagerx
password: sagerx

3. dbt

An example of a dbt staging model in SageRx. Note that these models are basically just SQL files.

Data build tool (commonly known as dbt) is where the Transformation (T) of the Extract-Load-Transform (ELT) data engineering process happens.

The steps of the dbt process line up nicely with schemas in Postgres. The two main schemas populated by dbt are staging and intermediate.

Staging

The staging step is where normalization of data generally happens (i.e. converting to NDC11, formatting date columns as DATE data types, etc).

It is also where basic data modeling begins happening. If the datasource schema contains raw energy and quarks, the staging schema condenses and refines this material into individual atoms we will use to build more intricate and useful structures. Some examples:

FDA NDC Directory - Class tables in this schema have one row per drug class per product. In the raw data, this information is crammed into a single cell, delimited by semicolons.
NLM RxNorm - Major transformation of RxNorm data happens at this step. Raw RxNorm tables are converted into data models representing NDCs, clinical products, brand products, ingredients, dose forms, etc.
CMS NADAC - De-duplication of data happens at this step. Additionally, differences in percentage and dollars is calculated between each price change reported in the NADAC data.

Intermediate

The intermediate schema uses the building blocks of the staging schema to create more complex, and sometimes more useful, data models. Continuing the raw energy → atoms analogy, the intermediate schema is where we use the atoms from the staging schema to create molecules.

Some examples of intermediate tables:

NLM RxNorm - The intermediate schema of RxNorm data contains extremely useful tables that let you go from NDC to clinical product information (ingredients, dose forms, etc) with a single select statement.
CMS NADAC - This table contains data from the NADAC staging table, but filtered to only current prices based on the current flag defined in the staging table.

Documentation

Documentation is a first class citizen in dbt (and SageRx), and helps support the understanding of the transformations that happen so that (in addition to the underlying queries being open source) end users can feel confident when using the resulting data.

Documentation lives in YAML files inside of the folders of the models/ directory, but dbt also generates a user-friendly website (dbt docs generate) that can be hosted (dbt docs serve - pending Docker integration) to make documentation easily accessible to consumers of the data.

Seeds

Seeds are a dbt feature, and are meant to be small CSV files that can be loaded as tables in a different seeds schema.

SageRx doesn't currently utilize seeds for much, but one potential use could be a repository of "fixes" or "flags" for incorrect or out of date open data (i.e. take the raw source data and “patch” it with corrections or modifications stored in seeds).

Macros

Macros are a dbt feature that utilizes Jinja 2 templating to allow you to create SQL "functions" or easily re-usable SQL code that can be updated in one place and instantly update all the models where it is used.

Conclusion

If you’re still reading this, it probably means you have experience or are interested in the technical building blocks used to build a data pipeline. Hopefully I did a good enough job explaining the tools we chose to make ours (there are many others out there). I’m definitely open to feedback from anyone experienced in this area if it could make SageRx more useful or easier to maintain.

I touched a little on how our data pipeline addressed things specific to open drug data sources, but I plan to dig into the drug data-specific examples in a future post - specifically about RxNorm. If you’re not subscribed already, please subscribe now to follow along and get updates in your inbox.

Thanks for reading!

🌿

The CodeRx Newsletter

Discussion about this post