Converting XML to Snowflake Vs. Redshift, BigQuery, Azure Synapse (Guide)

by abbi
May 24, 2021

With the growth of data, variety in data formats, and cloud computing, there’s an increased need for storing and analysing the rising volume of semi-structured data. The data originating from a wide range of sources such as sensors, IOT devices, application logs, etc. use semi-structured data formats such as XML, JSON, Avro, etc. because of their flexibility.

However, these hierarchical data formats are not a good fit for data analysis. It can also be a complex and challenging process to query and join this data to structured data and generate business insights. The most common approach of making this data readable is to convert the semi-structured data into a relational format before processing it further.

In this document, we will compare four cloud data platforms i.e. Snowflake, Amazon Redshift, Google BigQuery, and Azure Synapse Analytics in terms of the functionalities they provide to store, convert, and analyse XML data. Out of these, only Snowflake provides a native support to handle XML data whereas for the other platforms, we will look at some alternative approaches to deal with XML data.

We will also compare the manual approach of writing code with an automated XML conversion approach using an XML conversion tool and outline the scenarios when the automated approach is a good fit.

Key takeaways

    • Redshift, Azure Synapse, and BigQuery do not have native support to convert XML. You need to use external libraries to convert XML to a tabular format
    • XML support in Snowflake has limitations  
      • Your XML documents must be smaller than 16 MB
      • XML support is in public preview only
      • Multiple passes over the XML documents are needed for multi-branch XML
    • For complex XML an automated conversion approach is recommended over writing a lot of conversion code manually. Complex XML can be found in industry data standards such as HL7, FpML, ISO 20022, FHIR etc.
    • For simple XML and ad-hoc requirements a manual approach for converting XML is recommended
    • For automated and complex XML Conversions, tools like Flexter (Sonra’s XML conversion tool) comes handy and reliable. Any size, any type any output.

Converting XML in Snowflake

Snowflake’s “VARIANT” data type is used to deal with semi-structured data formats such as JSON, AVRO, Parquet, and XML. In this section, we will particularly focus on loading XML data in Snowflake and the functions it supports to query XML data.

The XML features in Snowflake are still in preview, but the features are sufficiently stable for loading data in this file format into tables in your account and querying the data once it is loaded.

Loading XML data in Snowflake

This section outlines the steps to load XML data into Snowflake Data Cloud. As stated earlier, the “VARIANT” data type in Snowflake supports the native ingestion of XML data as it does not need to perform any transformations. The following is the structure of the input XML document that we will load into Snowflake.

Steps to load the files into Snowflake;

  1. Create a table with a column of a “VARIANT” data type.
  2. Define a file format and create an external or internal stage where we can upload the file. In this case, we have created an external stage.
  3. Copy the file into the table.

Creating a table

Defining a file format

Creating an external stage

Copy the XML document into the table

The XML data is loaded as is into the Catalog table. Now we can run a simple “SELECT” query to see how Snowflake has stored it in the table. Following is the output of the query.

This is how we can easily load the XML data into Snowflake without worrying about the underlying structure. Even if the input structure changes in the next load, this query will work fine and the data will be loaded. The size of each XML data row should be less than 16MB as the “VARIANT” data type imposes a 16 MB (compressed) size limit . Now we have loaded the data, let’s see how it can be queried in the following section.

Querying XML data

In this section, we will discuss the functions provided by Snowflake to work with XML strings in the Variant column. Let’s first look at some basic notations used by Snowflake to retrieve information from XML strings in the Variant column.

“xmldata” is the name of the VARIANT column where we have loaded our XML data.

Here are some basic commands that you can use to query the XML data

  • xmldata ⇒ The XML contained in the Variant field i.e. the full XML document.
  • xmldata:”$” ⇒ The value of an XML element.
  • xmldata:”@” ⇒ The name of the root element of XML.
  • xmldata:”@attrname” ⇒ The value of the attribute specified.

Let’s run a few very simple queries to see how we can retrieve the data.

Query to retrieve the complete document

The first query will return the same result as “select * from catalog” i.e. the complete XML string.

Query to retrieve the contents of the root element of an XML document

$ notation is used to retrieve the contents of the root element. Root element is the element that encloses all other elements in the document. In our example, the root element is catalog. The following is a screenshot showing the output of the second query.

Query to retrieve the name of the root element

@ notation is used to extract the name of the root element. In our case, this query will return ‘catalog’ which can be seen in the following output.

Query to retrieve the value of the attribute eg. issue

@attrname notation is used to retrieve the value of the specified attribute. In this case, ‘issue’ attribute is specified. Following is the output of the fourth query.

Thus, we can use these notations to retrieve information from XML strings without any transformations. Now that we are familiar with these notations, let’s have a look at the functions provided by Snowflake to retrieve further details from XML documents.

Snowflake XML Functions

Snowflake has many XML functions that make it easier to work with XML data by letting you parse, access, and change it directly in SQL.In the following sections we will explain XML functions in more detail.

FLATTEN()

The powerful LATERAL FLATTEN table-value function is one of the most fundamental mechanisms offered by Snowflake for querying and exploring semi-structured data. This function is used to unnest the XML data. Basically, it converts the nested XML data into database relational format so we can retrieve the information from it. Also, we can join with the first level data. The following query will show how we can use the flatten() function to unnest the data.

The output of this query is as follows;

xmldata is the column from the catalog table where we stored the complete XML document. The six columns; Seq, Key, Path, Index, Value and This represent the nested data. We can access these columns by using the alias name book for eg. ‘book.value’. “Value” represents the individual XML element and we can use functions such as GET() and XMLGET() to retrieve information from it. The contents of the ‘book.value’ column can be seen in the following screenshot.

Flatten() is a powerful function supported by Snowflake which helps us to query nested data.

XMLGET()

Syntax: xmlget(<type> , <tag_name> , [ <instance_num> ])

The XMLGET() function takes three arguments:

  • the name of a Variant (either the Variant column itself, or something that equates to a Variant);
  • the name of the tag you’re looking for (the node or attribute name);
  • an optional instance number that is used as an “index” into the XML structure.

The following query will use xmlget() on the flattened data.

We have passed the value column that we get after flattening the XML data to the xmlget() function. The next argument is the name of the tag to retrieve its contents. We are using the :”$” syntax to access the element’s value. Then the :: operator is used to cast the retrieved contents to a particular data type such as string, integer, etc. This is how we can use the xmlget() function to retrieve the contents of the nested elements.

Following screenshot shows the output of the above query.

If we want to retrieve the data from the first level elements then we don’t need to use the XMLGET() function. It is used to retrieve the attributes from the lower-level nodes. The following XML is one record from the value column.

In this context i.e. value column, the <book> tag is the first-level node, whereas all other tags enclosed within <book></book> such as <title>, <genre>, <author>, etc. are the second level nodes. To retrieve the attribute of book i.e. ‘id’, we will use the GET() function explained below.

GET()

Syntax: GET( <v> , <field_name> )

This function is used to extract a value from an object, array, or a Variant; returns NULL if either of the arguments is NULL. It is different from XMLGET, which is specifically tailored for XML documents, extracting elements by their tag names.

The following query shows how we can use the GET() function to retrieve the contents of first-level nodes.

To retrieve the attributes of <catalog>, we have passed two arguments i.e. column name and the attribute name, whereas to retrieve the attribute of <book>, we have passed the ‘value’ column which is obtained after flattening the ‘xmldata’ column and attribute name id. We have casted it to a string data type using the :: operator similar to the xmlget() function. The following is a screenshot showing the output of the above query.

PARSE_XML()

Syntax: PARSE_XML( <string_containing_xml> [ , <disable_auto_convert> ] )

PARSE_XML() is a function provided by Snowflake to convert a string that contains XML data into a usable XML data type within the Snowflake environment.

The PARSE_XML() function takes a single argument, which is a string containing the XML data. This string is then parsed into an XML data type, which can be used with other XML-specific functions in Snowflake. If the input string is not a well-formed XML, the function returns an error.

This SQL query first sets up a new table specifically for storing XML data as objects. It then fills this table with several examples of XML content, including simple and complex structures, using the PARSE_XML() function to ensure the data is properly formatted. Finally, it retrieves and displays all the entries to verify that everything has been stored correctly.

TO_XML()

Syntax: TO_XML(expression)

The TO_XML() function converts Variant to a VARCHAR. If the input is NULL, the result is also NULL.

The TO_XML() function takes one expression that is a Variant or can be cast to Variant. If the expression is not of a Variant type, and since all other data types can be cast to Variant, Snowflake casts it to Variant. The only exception is the GEOGRAPHY type, which must be explicitly cast to Variant.

If the value is semi-structured like JSON, it can be converted into an XML-formatted value. Snowflake generates XML-compatible tags.

The following SQL creates a table to store semi-structured JSON data, then dynamically inserts JSON objects and arrays using PARSE_JSON() combined with SELECT and UNION ALL for flexibility. After insertion, it uses TO_XML() to convert and display the data in XML format, facilitating integration with systems requiring XML.

CHECK_XML()

Syntax: CHECK_XML( <string_containing_xml> [ , <disable_auto_convert> ] )

This function is used for validating XML. It is designed to check whether a string contains well-formed XML and to identify any parsing errors. If the input is valid XML or NULL, then NULL will be returned.

In this query, we use Snowflake’s CHECK_XML() function to check an XML string for errors. It identifies a problem with the XML markup <name> Invalid </WRONG_CLOSING_TAG> due to a mismatched closing tag and returns a descriptive error message. This demonstrates how CHECK_XML() effectively pinpoints issues in XML formatting.

To summarise, Snowflake provides native support to deal with XML data. With the Variant data type, we can write robust code to load XML data. Snowflake provides a set of functions which help us to query XML data

Working with multiple XML branches

So far we have looked at simple XML documents where the hierarchical structure can be fully flattened and denormalised. This is only the case for the most simple XML documents. In the real world you have to deal with deeply nested XML with multiple branches.

When working with multiple branches you need to make sure that you do not create a cartesian product by cross joining different XML branches by accident. This will lead to incorrect results and as a side effect also poor performance.

Let’s go through an example. Here is an XML document with 3 branches

Faculty > Department
Faculty > ResearchGroup
AdministrativeUnit

In order to convert this data to a relational format you need to write three separate queries. One query for each branch. You can refer to our post on converting XML on SQL Server to get a better understanding of how the approach works.

Limitations of converting XML on Snowflake

Various limitations exist for converting XML to a relational format on Snowflake

  • It is a manual and labour intensive process. You need to potentially write a lot of code. For complex XML schemas that are based on industry data standards it makes sense to fully automate the conversion process using an XML conversion tool. We have written elsewhere on the blog about when automated XML conversion using an XML conversion tool is the right approach.
  • Snowflake only supports XML documents with a size of up to 16 MB. If your documents are larger you will not be able to load your documents
  • XML support is still only in public preview. This means it is not recommended for production workloads
  • Converting complex XML is a labour intensive process which eats up a lot of engineering time. Projects can take weeks or even months to complete.
  • Multiple passes over the data are needed for XML documents with multiple branches.

Flexter, Sonra’s enterprise XML conversion tool addresses all of these problems. You can automatically convert any XML to a relational format. No manual coding is required.

You can try the free version of Flexter to convert your XML / XSD data to CSV.

We have put together a video where we convert complex XML based on the FpML industry data standard to Snowflake.

Book a demo to find out more about Flexter and how it can help you reduce implementation costs and minimise the risk of XML conversion projects.

Handling XML in Amazon Redshift

Redshift has introduced a ‘Super’ data type to support other semi-structured data formats except XML. As of now, Amazon Redshift does not provide any native support to handle XML data. Redshift documentation indicates XML is unsupported both as a format and as a function. In this section, we will see an alternative approach to load XML data into Redshift.

As Redshift has no native support for handling XML data, we need to convert the XML data to CSV format before loading. To convert complex XML into CSV there are few options that can be used.

Converting XML to CSV using AWS Glue

Step 1: Upload data to S3

We first need to upload our XML data files to the S3 storage.

Step 2: Crawl XML data

The next step is to load the XML structure in the Glue Data Catalog. To identify the structure of the underlying XML, we need to create an AWS Crawler that will access the data store, extract metadata, and create table definitions in the AWS Glue Data Catalog. Glue crawler will infer the schema from the top level by using default classifier, otherwise we can create a custom classifier to explicitly choose the row level tag.

Step 3: Convert to CSV using Glue job and use PySpark transforms to flatten the data

The next step is to write a Glue ETL job to convert the XML file to CSV. A job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. When you configure a job, AWS Glue generates an Apache Spark API (PySpark) script that extracts data from sources, transforms the data, and loads it into targets. The converted data in the CSV format can be stored in S3 storage or in the Glue Data Catalog.

These Glue generated scripts work well for the simple XML data but fail to flatten the complex XML structures. As we all know, xml data is mostly multilevel nested. The crawled metadata table would have complex data types such as structs, array of structs, etc. To flatten or unnest this data, we need to write custom scripts using PySpark transforms or Databricks Spark-XML. We can write the code by critically analysing the structure of underlying data.

Step 4: Load data from S3 to Amazon Redshift

Now that we have the data in the desired format in our S3 storage, we can load it into Amazon Redshift by using the COPY FROM option.

Converting XML to CSV using AWS Lambda

Step 1. Upload data to S3

We first need to upload our XML data files to the S3 storage.

Step 2: Trigger a lambda function on each S3 object, convert them to CSV files, and store it to a target bucket.

The next step is to write the lambda function which will convert each XML file to CSV. To write the code, we need to identify the structure of underlying data.

Step 3: Load data from S3 to Amazon Redshift

The CSV files in the S3 storage can now be loaded into Amazon Redshift by using COPY FROM option.

To summarise, we cannot load XML data directly into Redshift as there is no native support. We need to convert XML data into CSV before loading it into Redshift.

As mentioned above, we can use a Glue ETL job to convert simple XML files, but for complex XML structures we need to write custom code to convert the XML data. This is not a problem for simple XML documents, but it becomes a significant overhead when the underlying XML contains multilevel nested structures.

Also, if there is even a small change in the structure of input data, the code needs to be modified to accommodate each change. We can say that Amazon Redshift is not suitable for use cases in which there is a need to process complex XML files.

Our own product Flexter automatically converts XML to Redshift. Even better, it uses AI to optimise the output for simplified target schema and easy consumption of the data. Check out the Flexter product page.

Handling XML in Google BigQuery

Google BigQuery is an enterprise data warehouse that can store large datasets and it helps in superfast querying using Google infrastructure. It has a data type called ‘Record’ to store the nested and repeated data. It supports loading and querying such nested and repeated data from source formats that support object-based schemas, such as JSON files, Avro files, Firestore export files, and Datastore export files.

If our data is in Avro, JSON, Parquet, etc. formats then we can load it easily in the BigQuery. But what if we have to load XML data? BigQuery does not provide any native support to deal with XML data. Let’s see alternative ways to deal with XML in BigQuery.

Converting XML to CSV

The first solution is converting our input XML files into CSV and then loading them using BigQuery Load. To convert XML to CSV, we need to write a code by analyzing the underlying xml structure. If the underlying structure is complex, writing a conversion code is a significant overhead. Also, we will have to modify the code if there is any change in the input structure.

Loading XML using Apache Beam pipeline

Step 1. Convert the XML file to Python Dictionary.

In the first step we need to convert the XML file into a Python dictionary using the ‘xmltodict’ package. This package provides a method to parse the XML structure and convert it to a Python dictionary.

Step 2: Specify the schema of the output table in BigQuery.

In this step, we just need to define the schema of the table where we want to load the data. If the particular element has child elements, then we should use ‘record’ (struct) datatype to store it in BigQuery.

Step 3: Run an Apache Beam job to read xml and load it to the BigQuery table.

The next step is to write the Beam pipeline to take an XML file and use it to populate a BigQuery table. This job can be run using Google Cloud Dataflow.

In this approach, we are basically converting XML to JSON like structure and then loading it into BigQuery. We are using the ‘record’ datatype to store the nested elements. The maximum nested depth allowed for the ‘record’ type is 15 levels. So, we cannot use this approach to load multilevel nested data which contains more than 15 levels. Also, there is an overhead of writing code for the apache beam pipeline.

In short, if we want to use BigQuery as a platform for a use case where XML processing is required then the only option is converting it from XML to CSV or JSON like format (Python Dictionary). Once we have converted the file, we can use BigQuery to load or use Dataflow to run Apache Beam jobs. The overhead of writing the code to change the format will be there. To conclude, we can say that BigQuery is not a perfect choice to process XML data.

Our own product Flexter automatically converts XML to BigQuery. Find out how ecobee convert XML to BigQuery using Flexter in our success story.

Handling XML in Azure Synapse Analytics

Azure Synapse is Microsoft’s analytics service that brings together enterprise data warehousing and Big Data analytics. Azure Synapse Analytics was formerly known as Azure SQL Data Warehouse. In various MS SQL servers, XML data type is supported and there are methods to query XML data.

However, there is no support for XML in Synapse SQL pools. COPY statement, which is used to load data in Synapse only supports CSV, Parquert and ORC formats. Thus, there is no native support to load or query XML data in Synapse, but we will see some alternative approaches to deal with XML data.

Converting XML to CSV

The obvious solution is converting XML files to CSV and then loading it into Synapse. You need to use a third-party solution for this conversion or develop your own solution. Writing your own solution is quite tedious and time consuming for a complex and nested XML structure.

Load XML using Azure Data Factory Pipeline

Step 1. Load XML into a relational table in Azure SQL database using Azure Data Factory (ADF).

Azure data factory supports XML as source but does not support it as sink. We can create an ETL pipeline in ADF with XML as a source data type and map it to the target table columns. As XML is not supported as sink, we cannot store it in XML format.

Step 2: Load table from Azure SQL to Azure Synapse Analytics

Once XML is loaded in the Azure SQL table, we can create a Synapse pipeline to load the table into the Synapse SQL pool.

Creating an ADF pipeline is easier, but the problem with this approach is that the Data Factory still can’t work well with complex(nested) file formats. So this approach will work fine only for simple XML files. To summarize, we can say that Azure Synapse Analytics cannot be used efficiently for processing complex and nested XML data sources.

Automatically convert XML to a relational format

All these platforms have limitations when converting XML data to a readable database format. Apart from Snowflake, none of the data warehouse platforms provide native support for converting XML.

  • Converting XML on Snowflake is a preview-only feature, and not generally available and supported.
  • Individual XML documents must not exceed a certain size (16 MB).
  • Engineers still need to write code to manually convert the XML. This is not a problem for simple XML documents. However, once you have a certain complexity, deep nesting, and a large number of different XML documents this overhead becomes significant.

For those limitations and scenarios we have created Flexter. Flexter automatically converts any XML to a readable format in seconds. The conversion process is fully automated. No need for niche skills and long development life cycles.

If we talk about the use case of Flexter to convert different XML data formats to Snowflake, we can use it for various conversions like:

Flexter not only automates the conversion process, it also simplifies the output for downstream consumption. Flexter uses two optimisation algorithms of which you can read about the details in the following blog post https://sonra.io/2021/03/22/the-secrets-to-unlocking-xml-data-in-an-optimised-format-for-downstream-users/

If you want to find out more about Flexter you can have a look at the product pages and our product data sheet.

Choosing Between Automated and Manual XML Conversion

The advantages of using an automated conversion process have to be weighed against the additional licence costs for a tool such as Flexter.

For simple requirements or ad-hoc queries an XML automation tool is not required. XML conversion software is a good fit if one or more of the following checks applies to you:

  • Is your XML very complex? Does it use an XSD? Is it based on an industry data standard such as HL7 or FpML?
  • Do you have conversion requirements for many different types of XML?
  • Are your XML files huge?
  • Are you parsing vast volumes of XML data with tight SLAs?
  • Does your team lack in-house expertise in XML technologies, including XQuery, XSLT, and XPath?
  • Do you have tight deadlines for your XML conversion project?

If you answered ‘Yes’ to one or more of these questions, why not give the free online version of our XML converter Flexter a try. See for yourself what XML conversion software can do for you. Alternatively, talk directly to one of our XML conversion experts to discuss your use case in detail.

You also have the option to download a PDF that provides a side-by-side comparison of manual versus automated XML conversion.

Conclusion

In this document, we have compared Snowflake, Amazon Redshift, Google Bigquery, and Azure Synapse Analytics in terms of support they provide to process XML data. Snowflake provides native support to handle XML which is currently in Preview phase, whereas other cloud data warehouse platforms do not provide native support to deal with XML.

A key differentiator is Snowflake’s Variant data type that supports loading of XML data as it is without converting it to relational format. Snowflake also provides functions to query and unnest (flatten) the XML data in the Variant column. Also, we can create views and join this data with the other structured data. The only limitation is that individual XML documents cannot exceed 16 MB in size.

For the remaining three platforms, the only option is to convert XML in a time consuming manual effort. This conversion will cause significant overhead if the XML files are complex or nested. If the file structure changes, then conversion code needs to be refactored. Thus, it is difficult to use these platforms for XML processing use cases.

The best option for all four platforms is to use Flexter as it fully automates the conversion process and applies AI to optimise and simplify the target schema.