MedVault, an organization offering data management services in primary healthcare, identified the need for a platform which would help their clients to really unlock the value of their data. This required a scalable, compliant, and secure analytics platform which would monitor and measure healthcare practice performance across thousands of data points, identify patient cohorts in seconds, increase reimbursement, and improve patient care.
The overall goal was to bring insights and analysis to primary healthcare clinics and consultants. The is based on a variety of data sources including various health practice applications and primary care reimbursement service datasets. In other words, to report on medical practice management both from a workflow and a financial perspective to improve patient primary care, increase reimbursement returns, and reduce needless admin.
The solution needed to support up to 200 data sources of a varying data footprint and data processing load. Key Snowflake Cloud Data Platform features led to the selection of Snowflake:
- Storage and compute scalability without downtime
- The ability to separate workloads using Snowflake Virtual Warehouses
- Always-on end to end encryption whether data is in transit or at rest
- No capital expenditure barrier to entry plus the utility billing approach whereby you only pay for what you use
- This billing approach combined with elastic scalability ensures cost effective data processing particularly as more data sources are on-boarded over time
The below diagram illustrates the overall solution which will be described in the following sections.
Snowflake Virtual Warehouse configuration
The solution uses various Snowflake virtual warehouses to separate analytical workloads from data extraction, load, and transformation workloads. A virtual warehouse is a cluster of compute resources in Snowflake. This provides the necessary CPU, memory, and temporary storage required to perform DML operations which require compute resources.
The configuration of multiple virtual warehouses ensures that the various workloads can run simultaneously without one workload impacting another workload.
From a cost management perspective, virtual warehouses auto-suspend and auto-resume settings are configured to ensure that Snowflake credits are consumed when only the virtual warehouses are being used.
The following SQL DDL demonstrates how to create a “Medium” sized virtual warehouse (four node compute cluster) , which as part of the solution is used for analytical workloads:
CREATE OR REPLACE WAREHOUSE ANALYTICS_WH
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE
COMMENT = 'Analytical workload processing.'
As indicated in the above example, if there is no activity for 5 minutes then the virtual warehouse switches to the ‘suspended’ state and stops billing. If a new SQL statement is submitted to the virtual warehouse, then that warehouse automatically resumes processing.
A variety of data source applications and suppliers are required as inputs into the solution. While it’s often the case that data can be extracted using third party applications or services which pull data from the relevant data sources, in this instance given the source security configurations, that was not a suitable option. Therefore a customized data extraction application was developed to push the data from source based on the following criteria:
- Developed using Python
- The application incorporates the Snowflake Connector for Python which is an interface for developing Python applications that can connect to Snowflake and perform all standard operations
- Bulk export data from various Microsoft SQL Server data sources
- Push those outputs to specific paths within AWS S3 buckets
- Each instance of the data extraction application is configured with its own AWS IAM policies associated with a specific S3 path.
- This ensures that one data extraction application instance cannot access data from another data extraction application instance which has been pushed to S3 and vice versa.
- Metadata is added to each row of extracted data to indicate which data source that row originated from and as a basis for row level permissions. The following screenshot illustrates data (columns 1 & 2), and associated row labelling (columns 3 & 4):
- PDF documents are another type of data source used in this solution. These documents are parsed and converted to flat file using ABBYY FineReader optical character recognition.
- The data is landed on Snowflake via logic configured using the above mentioned Snowflake Connector for Python.
Data source to AWS S3 encryption along with Snowflake’s end to end data encryption ensures that data is encrypted both in transit and at rest throughout the data pipeline.
Snowflake Role-based Access Control
Snowflake’s role based access control determines which users can access and perform operations on specific objects within Snowflake. Each solution user or process, whether landing, transforming, or reading data from data marts, is assigned minimum required permissions. In other words, access is granted on a ‘need-to-know’ basis.
Persistent Staging Area
A Persistent Staging Area (PSA) is a staging area that is not truncated between data loads. It instead contains an audit history of the data loaded from upstream data sources.
While an upstream data source typically doesn’t track it’s own history, requirements can change which can lead to new data marts / refactoring of current data marts. A PSA mitigates the risk in relation to these possible events. Also, availability of audit history can assist in relation to operational support.
Along with automatically determining and tracking the delta between the landed data and the PSA contents, current state SQL views are automatically configured in the PSA to present the latest version of the data as inputs into data mart processing.
- Each row in the PSA is labelled indicating whether the row has been inserted, updated or deleted
- An updated row results in a new row being added to the PSA and this row being treated as the latest version of that row
- If a new table is landed this is automatically picked up in the above steps
Data processing within the PSA and throughout the solution is designed to be idempotent so that if the upstream data has not changed, then there will be no change downstream.
Snowflake zero-copy cloning
Version control and release management take place using AWS Code Commit. The solution environments follow the typical Development, Test, and Production approach. Depending on the type of change being released, a clone of production may also be taken in line with solution governance using Snowflake zero-copy cloning.
The zero-copy cloning feature is used to take a “snapshot” of the relevant database / schema / table depending on the checks taking place. A clone is writable and is independent of its source (i.e. changes made to the source or clone are not reflected in the other object). This means that the creation of the cloned database is instant and likewise can be dropped once the necessary checks are complete. The following is an example of cloning syntax where a database clone called DB_CLN is created based on a database called DB:
CREATE OR REPLACE DATABASE DB_CLN CLONE DB
At the data mart layer where users run various analytical workloads, role-based access control permissions are applied also.
The data marts are accessed using AWS Quicksight. AWS Quicksight is a business intelligence service that is used to create and publish interactive dashboards which include machine learning insights. The below screenshot shows a MedVault Analytics demo dashboard which illustrates primary healthcare practice performance.
Data orchestration is managed using Apache Airflow.
Apache Airflow is an open-source workflow management platform. It is written in Python, and workflows are created via Python scripts. Airflow is designed under the principle of “configuration as code”. Developers can use Python to import libraries and classes to help them create their workflows. The workflow concept is based on directed acyclic graphs (DAGs), which make the workflow tasks and dependencies easier to manage. Separate DAGs are used in the solution for various data import steps, PSA processing, and data mart processing.
Most significantly, the Airflow configuration pushes the workload down to Snowflake which makes the most of Snowflake’s parallel processing power. The following is an example of an Apache Airflow DAG used to load OCR process outputs into Snowflake.
The MedVault Analytics secure healthcare platform allows General Practitioners to learn a lot more about their primary healthcare medical practice workflows, demographics, and clinical insights. It leads to reduced administration for clinical staff and monetary benefits in the form of increased reimbursement for work completed. This amounts to thousands of euros in savings and increased reimbursement per primary health care practice annually. The solution, based on the Snowflake Cloud Data Platform and delivered by Sonra Intelligence, allows primary care medical practices to focus on what they do best – treat patients.
The Snowflake Cloud Data Platform’s scalability, separation of workloads, end to end encryption of data whether in transit or at rest, and the ‘pay for what you use’ approach are cornerstones to a successful MedVault Analytics platform solution.
Quoting Tony Ryan, co-founder and CEO of MedVault:
“Snowflake has been fantastic in providing a robust and secure service with transparent pricing. There are no huge up front costs which allowed our team to invest more in developing great tools for doctors.
Likewise, Sonra Intelligence have been fantastic and professional to deal with. Their experience and passion for both data engineering and data architecture have been invaluable in the success of our project”.
We created the content in partnership with Snowflake.
Enjoyed this post? Have a look at the other posts on our blog.
Contact us for Snowflake professional services.
We created the content in partnership with Snowflake.