Converting XML to Snowflake, Redshift, BigQuery, Azure Synapse. A comparison

May 24, 2021

Today, with the rise of the data and cloud-computing applications, the need of storing and analyzing the growing number of semi-structured data is increased. The data originating from a wide range of sources such as sensors, IOT devices, application logs, etc. use semi-structured data formats such as XML, JSON, Avro, etc. because of their flexibility. However, it is complex and challenging to process this data and join it with the structured data to generate business insights. The most common approach used until now is to convert the semi-structured data into relational format before processing it further.

In this document, we will compare four cloud data warehouse platforms i.e. Snowflake, Amazon Redshift, Google BigQuery, and Azure Synapse Analytics in terms of the functionalities they provide to store and analyze XML data. Out of these, only Snowflake provides a native support to handle XML data whereas for other platforms, we will look at some alternative approaches to deal with XML data.

Handling XML in Snowflake

Snowflake’s “VARIANT” data type is used to deal with semi-structured data formats such as JSON, AVRO, Parquet, and XML. In this section, we will particularly focus on loading XML data in Snowflake and the functions it supports to query XML data. The XML features in Snowflake are currently in preview, but the features are sufficiently stable for loading data in this file format into tables in your account and querying the data once it is loaded.

Loading XML data in Snowflake

This section outlines the steps to load the XML data into Snowflake Data Cloud. As stated earlier, “VARIANT” data type in Snowflake supports the native ingestion of XML data as it does not need to perform any transformations. The following is the structure of the input data that we will load into the Snowflake.

Steps to load the files into Snowflake;

  1. Create a table with a column of a “VARIANT” data type.
  2. Define a file format and create an external or internal stage where we can upload the file. In this case, we have created an external stage.
  3. Copy the file into the table.

–Creating a table

–Defining a file format

–Creating an external stage

–Copy the file into the table

The XML data is loaded as is into the Catalog table. Now we can run a simple “SELECT” query to see how Snowflake has stored it in the table. Following is the output of a select query.

This is how we can easily load the XML data into Snowflake without worrying about the underlying structure. Even if the input structure changes in the next load, this query will work fine and the data will be loaded. The size of each XML data row should be less than 16MB as the “VARIANT” data type imposes a 16 MB (compressed) size limit on individual rows. Now we have loaded the data, let’s see how it can be queried in the following section.

Querying XML data

In this section, we will discuss the functions provided by Snowflake to work with XML strings in the variant column. Before that, we will see some basic notations used by Snowflake to retrieve information from XML strings in the variant column.

“xmldata” is the name of the variant column where we have loaded our XML data.

  • xmldata ⇒ The XML contained in the variant field i.e. the full XML document.
  • xmldata:”$” ⇒ The value of an XML element.
  • xmldata:”@” ⇒ The name of the root element of XML.
  • xmldata:”@attrname” ⇒ The value of the attribute specified.

Let’s run a few very simple queries to see how we can retrieve the data.

–Query to retrieve the complete document

–Query to retrieve the contents of the root element of an XML document

–Query to retrieve the name of the root element

–Query to retrieve the value of the attribute eg. issue

The first query will return the same result as “select * from catalog” i.e. the complete XML string.

$ notation is used to retrieve the contents of the root element. Root element is the element that encloses all other elements in the document. In our example, the root element is catalog. The following is a screenshot showing the output of the second query.

@ notation is used to extract the name of the root element. In our case, this query will return ‘catalog’ which can be seen in the following output.

@attrname notation is used to retrieve the value of the specified attribute. In this case, ‘issue’ attribute is specified. Following is the output of the fourth query.

Thus, we can use these notations to retrieve information from XML strings without any transformations. Now that we are familiar with these notations, let’s have a look at the functions provided by Snowflake to retrieve further details from XML strings.

FLATTEN()

The powerful LATERAL FLATTEN table-value function is one of the most fundamental mechanisms offered by Snowflake for querying and exploring semi-structured data. This function is used to unnest the XML data. Basically, it converts the nested XML data into database relational format so we can retrieve the information from it. Also, we can join with the first level data. The following query will show how we can use the flatten() function to unnest the data.

The output of this query is as follows;

xmldata is the column from the catalog table which has the complete XML string. The six columns; Seq, Key, Path, Index, Value and This represent the nested data. We can access these columns by using the alias name book for eg. ‘book.value’. “Value” represents the individual XML element and we can use functions such as GET() and XMLGET() to retrieve information from it. The contents of the ‘book.value’ column can be seen in the following screenshot.

Flatten() is a powerful function supported by Snowflake which helps us to query nested data without converting it to relational table format.

XMLGET()

Syntax: xmlget(<type> , <tag_name> , [ <instance_num> ])

The XMLGET() function takes three arguments:

  • the name of a variant (either the variant column itself, or something that equates to a variant);
  • the name of the tag you’re looking for (the node or attribute name);
  • an optional instance number that is used as an “index” into the XML structure.

The following query will use xmlget() on the flattened data.

We have passed the value column that we get after flattening the XML data to the xmlget() function. The next argument is the name of the tag to retrieve its contents. We are using the :”$” syntax to access the element’s value. Then :: operator is used to cast this retrieved contents to a particular data type such as string, integer, etc. This is how we can use the xmlget() function to retrieve the contents of the nested elements.

Following screenshot shows the output of the above query.

If we want to retrieve the data from the first level elements then we don’t need to use XMLGET() function. It is used to retrieve the attributes from the lower-level nodes. The following xml is one record from the value column.

In this context i.e. value column, the <book> tag is the first-level node, whereas all other tags enclosed within <book></book> such as <title>, <genre>, <author>, etc. are the second level nodes. To retrieve the attribute of book i.e. ‘id’, we will use the GET() function explained below.

GET()

Syntax: GET( <v> , <field_name> )

This function is used to extract a value from an object, array or a variant; returns NULL if either of the arguments is NULL.

The following query shows how we can use the GET() function to retrieve the contents of first-level nodes.

To retrieve the attributes of <catalog>, we have passed two arguments i.e. column name and the attribute name, whereas to retrieve the attribute of <book>, we have passed the ‘value’ column which is obtained after flattening the ‘xmldata’ column and attribute name id. We have casted it to string data type using the :: operator similar to the xmlget() function. The following is a screenshot showing the output of the above query.

Apart from these functions, there are other functions such as PARSE_XML(), TO_XML(), and CHECK_XML(). PARSE_XML() interprets an input string as an XML document and converts it into an object data type, whereas TO_XML() converts any VARIANT value to a string containing the XML representation of the value. CHECK_XML() functions check the validity of an XML document. All these functions return null, if the input is null.

To summarize, Snowflake provides native support to deal with XML data. With the variant data type, we can write a simple and robust code to load XML data. Snowflake provides a set of functions which help us to query XML data without converting it to relational format. Snowflake has simplified the process of loading XML data and putting it together with the structured data for analysis.

Limitations of converting XML on Snowflake

Various limitations exist for converting XML to a relational format on Snowflake

  • Snowflake only supports XML documents with a size of up to 16 MB. If your documents are larger you will not be able to load your documents
  • XML support is still only in public preview. This means it is not recommended for production workloads
  • Converting complex XML is a labour intensive process which eats up a lot of engineering time. Projects can take weeks or even months to complete.

Flexter, Sonra’s enterprise XML conversion platform addresses all of these problems. You can automatically convert any XML to a relational format. No manual coding is required.

We have put together a video where we convert FpML XML to Snowflake.

Book a demo to find out more about Flexter and how it can help you reduce implementation costs and minimise risk of XML conversion projects.

Handling XML in Amazon Redshift

Redshift has introduced a ‘Super’ data type to support other semi-structured data formats except XML. As of now, Amazon Redshift does not provide any native support to handle XML data. Redshift documentation indicates XML is unsupported both as a format and as a function. In this section, we will see an alternative approach to load XML data into Redshift.

As Redshift has no native support for handling XML data, we need to convert the XML data to CSV format before loading. To convert complex XML into CSV there are few options that can be used.

Converting XML to CSV using AWS Glue

Step 1: Upload data to S3

We first need to upload our XML data files to the S3 storage.

Step 2: Crawl XML data

The next step is to load the XML structure in the Glue Data Catalog. To identify the structure of the underlying XML, we need to create an AWS Crawler that will access the data store, extract metadata, and create table definitions in the AWS Glue Data Catalog. Glue crawler will infer the schema from the top level by using default classifier, otherwise we can create a custom classifier to explicitly choose the row level tag.

Step 3: Convert to CSV using Glue job and use PySpark transforms to flatten the data

The next step is to write a Glue ETL job to convert the XML file to CSV. A job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. When you configure a job, AWS Glue generates an Apache Spark API (PySpark) script that extracts data from sources, transforms the data, and loads it into targets. The converted data in the CSV format can be stored in S3 storage or in the Glue Data Catalog. These Glue generated scripts work well for the simple XML data but fail to flatten the complex XML structures. As we all know, xml data is mostly multilevel nested. The crawled metadata table would have complex data types such as structs, array of structs, etc. To flatten or unnest this data, we need to write custom scripts using PySpark transforms or Databricks Spark-XML. We can write the code by critically analyzing the structure of underlying data.

Step 4: Load data from S3 to Amazon Redshift

Now that we have the data in the desired format in our S3 storage, we can load it into Amazon Redshift by using the COPY FROM option.

Converting XML to CSV using AWS Lambda

Step 1. Upload data to S3

We first need to upload our XML data files to the S3 storage.

Step 2: Trigger a lambda function on each S3 object, convert them to CSV files, and store it to a target bucket.

The next step is to write the lambda function which will convert each XML file to CSV. To write the code, we need to identify the structure of underlying data.

Step 3: Load data from S3 to Amazon Redshift

The CSV files in the S3 storage can now be loaded into Amazon Redshift by using COPY FROM option.

To summarize, we cannot load XML data directly into Redshift as there is no native support. We need to convert XML data into CSV before loading it into Redshift. As mentioned above, we can use a Glue ETL job to convert simple XML files, but for complex XML structures we need to write custom code to convert the XML data. This is not a problem for simple XML documents, but it becomes a significant overhead when the underlying XML contains multilevel nested structures. Also, if there is even a small change in the structure of input data, the code needs to be modified to accommodate each change. We can say that Amazon Redshift is not suitable for use cases in which there is a need to process complex XML files.

Our own product Flexter automatically converts XML to Redshift. Even better it uses AI to optimise the output for simplified target schema and easy consumption of the data. Check out the Flexter product page.

Handling XML in Google BigQuery

Google BigQuery is an enterprise data warehouse that can store large datasets and it helps in superfast querying using Google infrastructure. It has a data type called ‘Record’ to store the nested and repeated data. It supports loading and querying such nested and repeated data from source formats that support object-based schemas, such as JSON files, Avro files, Firestore export files, and Datastore export files. If our data is in Avro, JSON, Parquet, etc. formats then we can load it easily in the BigQuery. But what if we have to load XML data? BigQuery does not provide any native support to deal with XML data. Let’s see alternative ways to deal with XML in BigQuery.

Converting XML to CSV

The first solution is converting our input XML files into CSV and then loading them using BigQuery Load. To convert XML to CSV, we need to write a code by analyzing the underlying xml structure. If the underlying structure is complex, writing a conversion code is a significant overhead. Also, we will have to modify the code if there is any change in the input structure.

Loading XML using Apache Beam pipeline

Step 1. Convert the XML file to Python Dictionary.

In the first step we need to convert the XML file into a Python dictionary using the ‘xmltodict’ package. This package provides a method to parse the XML structure and convert it to a Python dictionary.

Step 2: Specify the schema of the output table in BigQuery.

In this step, we just need to define the schema of the table where we want to load the data. If the particular element has child elements, then we should use ‘record’ (struct) datatype to store it in BigQuery.

Step 3: Run an Apache Beam job to read xml and load it to the BigQuery table.

The next step is to write the Beam pipeline to take an XML file and use it to populate a BigQuery table. This job can be run using Google Cloud Dataflow.

In this approach, we are basically converting XML to JSON like structure and then loading it into BigQuery. We are using the ‘record’ datatype to store the nested elements. The maximum nested depth allowed for the ‘record’ type is 15 levels. So, we cannot use this approach to load multilevel nested data which contains more than 15 levels. Also, there is an overhead of writing code for the apache beam pipeline.

In short, if we want to use BigQuery as a platform for a use case where XML processing is required then the only option is converting it from XML to CSV or JSON like format (Python Dictionary). Once we have converted the file, we can use BigQuery to load or use Dataflow to run Apache Beam jobs. The overhead of writing the code to change the format will be there. To conclude, we can say that BigQuery is not a perfect choice to process XML data.

Our own product Flexter automatically converts XML to BigQuery. Find out how ecobee convert XML to BigQuery using Flexter in our success story.

Handling XML in Azure Synapse Analytics

Azure Synapse is Microsoft’s analytics service that brings together enterprise data warehousing and Big Data analytics. Azure Synapse Analytics was formerly known as Azure SQL Data Warehouse. In various MS SQL servers, XML data type is supported and there are methods to query XML data. However, there is no support for XML in Synapse SQL pools. COPY statement, which is used to load data in Synapse only supports CSV, Parquert and ORC formats. Thus, there is no native support to load or query XML data in Synapse, but we will see some alternative approaches to deal with XML data.

Converting XML to CSV

The obvious solution is converting XML files to CSV and then loading it into Synapse. You need to use a third-party solution for this conversion or develop your own solution. Writing your own solution is quite tedious and time consuming for a complex and nested XML structure.

Load XML using Azure Data Factory Pipeline

Step 1. Load XML into a relational table in Azure SQL database using Azure Data Factory (ADF).

Azure data factory supports XML as source but does not support it as sink. We can create an ETL pipeline in ADF with XML as a source data type and map it to the target table columns. As XML is not supported as sink, we cannot store it in XML format.

Step 2: Load table from Azure SQL to Azure Synapse Analytics

Once XML is loaded in the Azure SQL table, we can create a Synapse pipeline to load the table into the Synapse SQL pool.

Creating an ADF pipeline is easier, but the problem with this approach is that the Data Factory still can’t work well with complex(nested) file formats. So this approach will work fine only for simple XML files. To summarize, we can say that Azure Synapse Analytics cannot be used efficiently for processing complex and nested XML data sources.

Automatically convert XML to a relational format.

All these platforms have limitations when converting XML data to a readable database format. Apart from Snowflake, none of the data warehouse platforms provide native support for converting XML.

  • Converting XML on Snowflake is a preview-only feature, and not generally available and supported.
  • Individual XML documents must not exceed a certain size (16 MB).
  • Engineers still need to write code to manually convert the XML. This is not a problem for simple XML documents. However, once you have a certain complexity, deep nesting, and a large number of different XML documents this overhead becomes significant.

For those limitations and scenarios we have created Flexter. Flexter automatically converts any XML to a readable format in seconds. The conversion process is fully automated. No need for niche skills and long development life cycles.

Flexter not only automates the conversion process, it also simplifies the output for downstream consumption. Flexter uses two optimization algorithms of which you can read about the details in the following blog post https://sonra.io/2021/03/22/the-secrets-to-unlocking-xml-data-in-an-optimised-format-for-downstream-users/

Rapid, seamless access to complex JSON data for advanced, evidence-based analytics

  • More time – cuts hours of manual work to minutes with rapid, automated data conversion.
  • Less manpower – removes the need to hire expensive data specialists to help unlock data.
  • Lower costs –retain more of your budget to deliver real value to the business.
  • Reduced risk – lowers the threat of exceeding budgets and missing project SLAs.
  • Quicker decisions – obtains critical data for essential decision-making more rapidly.
  • Readable output

If you are starting a new XML conversion project we have created a checklist for you.

If you want to find out more about Flexter you can have a look at the product pages and our product data sheet.

Conclusion

In this document, we have compared Snowflake, Amazon Redshift, Google Bigquery, and Azure Synapse Analytics in terms of support they provide to process XML data. Snowflake provides native support to handle XML which is currently in Preview phase, whereas other cloud data warehouse platforms do not provide native support to deal with XML.

A key differentiator is Snowflake’s variant data type that supports loading of XML data as it is without converting it to relational format. Snowflake also provides functions to query and unnest (flatten) the XML data in the variant column. Also, we can create views and join this data with the other structured data. The only limitation is that individual XML documents cannot exceed 16 MB in size.

For the remaining three platforms, the only option is to convert XML in a time consuming manual effort. This conversion will cause significant overhead if the XML files are complex or nested. If the file structure changes, then conversion code needs to be refactores. Thus, it is difficult to use these platforms for XML processing use cases.

The best option for all four platforms is to use Flexter as it fully automates the conversion process and applies AI to optimize and simplify the target schema.