Converting CDISC XML data to Snowflake

Uli Bethke Flexter, Snowflake, XML

In this post we will guide you through the challenging process of obfuscating and converting CDISC XML data to Snowflake. We will be using Sonra’s masking tool Paranoid and processing and parsing tool Flexter as a Service (FaaS). FaaS follows a subscription model. Flexter is also available as a free edition and an enterprise edition.

CDISC

The Clinical Data Interchange Standards Consortium (CDISC) is a standards developing organization (SDO) dealing with medical research data linked with healthcare, to "enable information system interoperability to improve medical research and related areas of healthcare".

CDISC standards are harmonized through a model that is also a HL7 standard and is in the process of becoming an ISO/CEN standard.

Masking CDISC XML

Now that we have introduced the tools we are using, we will start masking our XML data.

In a first step we will be masking our CDISC XMLs with Paranoid. You can find how to install Paranoid in our Masking Sabre xml post ( don’t worry it only takes a couple of steps to install it )

This will mask all of the values of the XML document. Optionally Paranoid has the feature to mask individual elements inside an XML document.

Let’s have a look at our file after masking

Now we can start going through a few more steps and convert CDISC XML data to a relational format in Snowflake.

Snowflake

Snowflake is an analytic data warehouse provided as SaaS. It runs on cloud infrastructure, and all of its services are running on a public cloud infrastructure. Snowflake data warehouse uses a combination of SQL database engine and one of a kind architecture designed for the cloud.

Snowflake enables you to scale up or down with ease, due to separation of storage and compute. It can do even heavy workloads at unbelievable speed. Some of the strong points of Snowflake are:

  • Uncompromising Simplicity
  • Unlimited Concurrency
  • Breathtaking Performance

Processing masked XML with FaaS API

Flexter exposes its functionality through a RESTful API. Converting XML/JSON to Snowflake can be done in a few simple steps. For more details please refer to the FaaS API documentation.

Step 1 - Authenticate

Step 2 - Define Source Connection (Upload or S3) for Source Data (JSON/XML)

Step 3 - Optionally define Source Connection (Upload or S3) for Source Schema (XSD)

Step 4 - Define your Target Connection, e.g. Snowflake, Redshift, SQL Server, Oracle etc.

Step 5 - Convert your XML/JSON from Source to Target Connection

Step 1 - Authenticate

To get an access_token you need to make a call to /oauth/token with Authorization header and 3 form parameters:

  • username=YOUR_EMAIL
  • password=YOUR_PASSWORD
  • grant_type=password

You will get your username and password from Sonra when you sign up for the service.

Example of output

Step 2 - Define Source Connection (Upload) for Source Data (CDISC XML)

In this step we upload our CDISC XML Source data

Example of output

Step 3 - Define Target Connection (Snowflake)

Since we don’t have a Source Schema (XSD) we skip the optional step of defining a Source Schema.

We define our Target connection. We give the Target Connection a name and supply various connection parameters to the Snowflake database.

Example of output

Convert XML/JSON automatically to a Database, Text, or Hadoop

No manual coding
Cut cost by up to 80%

Find out more

Step 4 - Convert XML data from Source Connection (Upload) to Target Connection (Snowflake)

In last step we will convert XML data. Data will be written directly to Snowflake Target Connection.

Example of output

Example of ER Diagram

We can create and download an ER Diagram of the model that FaaS generated by making a GET call.

Example of output

 

 

You can download the ER Diagram of our CDISC XML file here.

Next we will run an SQL Query where we will select subject level information with the most frequently recorded type of item groups (ITEMGROUPOID) for the first, the third or the fifth measurements (ITEMGROUPREPEATKEY)

Conclusion

And we are finished with this “long and hard process” :-). We have managed to complete a couple of tasks in a few minutes that normally take hours or days.

You can try out the free version of Flexter online.

Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.

If you have any questions please refer to the Flexter FAQ section. You can also request a demo of Flexter or reach out to us directly.

 

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.