Home » Snowflake » How to Parse & Flatten XML in Snowflake With Examples

How to Parse & Flatten XML in Snowflake With Examples

Published on June 3, 2024
Updated on October 2, 2024

In this post we cover every detail of parsing XML in Snowflake including storing, loading, querying, and parsing XML.

Very few cloud data platforms have native support for parsing XML. Unlike Redshift or BigQuery, Snowflake stands out and provides the user with various options around working with XML. We will look at these options but also at the limitations of working with XML on Snowflake.

These limitations also need to be looked at in the light of the general limitations and downsides of manually converting XML. Instead of writing code using the Snowflake XML functions to parse complex and deeply nested XML you can fully automate XML parsing with an XML conversion tool.

What can you expect from this blog post?

As a very first step we go through a quick XML tutorial and clarify some of the core concepts you need to be aware of when working with XML and converting it to tables in a relational format.

We then outline the native options of working with XML on Snowflake

Options for storing XML in Snowflake
Options for loading XML into Snowflake
Options for querying XML via SQL (aka Schema on Read)
Options for parsing XML to a relational format (aka Schema on Write)
Options for flattening XML to a denormalised table

We then look at the various limitations of working with XML on Snowflake.

The Snowflake XML limitations also need to be looked at in the wider context of the limitations of writing code to query and convert XML. We will go through the downsides and give clear guidance under what circumstances it makes sense to use an automated approach using an XML conversion tool and when the manual coding approach is a better fit.

You will need to be aware of these limitations to make an informed decision on the best approach of working with XML on Snowflake.

Let’s dive in but before we start …

Here are the KEY TAKEAWAYS for those in a hurry

Key Takeaways

Snowflake supports XML natively, unlike Redshift, Azure Synapse, or BigQuery.

Traditional databases such as Oracle or SQL Server have much wider support for working with XML, e.g. features to work with XML schemas (XSDs) and support for standard XML tools such as XQuery.

In comparison to other semi-structured data types such as JSON, support for XML in Snowflake is limited, e.g. the XML feature is currently in public preview and has been so for the last couple of years.

Parsing and writing XML to Snowflake tables in a relational format is generally preferred over directly querying the XML as schema on read.

When working with XML on Snowflake you will inevitably run into various limitations, in particular if you are working with complex and nested XML.
- The Snowflake XML feature is only in public preview, which means that in a worst case scenario you may not get support for it.
- You can only store XML documents up to a size of 16 MB (uncompressed).
- There are no features for working with XSDs. This is a significant downside as it adds to the manual workload of converting XML.
- No support to use the infer_schema function or schema evolution
- No support for loading zipped XMLs.
- No data lineage or source to target mapping for XML elements to target table columns

When working with complex XML or many different types of XML the benefits of an automated XML conversion approach often outweigh the costs and risks associated with a manual conversion approach.

The manual approach using code is fine for simple XML conversion projects but runs into limitations for complex and deeply nested XML based on industry data standards. Having many different types of XML in an organisation also makes the manual approach costly.

Flexter is an enterprise XML converter that automatically converts XML to a Snowflake relational model. It addresses all of the Snowflake limitations and the wider limitations of manually converting XML by writing code.

Quick XML primer

Before we dive into the content let’s clarify some of the core concepts around XML.

If you are an XML pro you can jump straight to the section about working with XML on Snowflake.

XML element

An XML element is the primary building block of an XML document. It is used to structure the data into a hierarchy and can contain other elements, text, and attributes. An element consists of:

A start tag, which opens the element, such as <name>.
An end tag, which closes the element, such as </name>.
The content between the start and end tags, which can be text, other elements, or both. The content is often referred to as the value.
Optionally, attributes that provide additional information about the element.
An empty element tag is used for elements that do not need to enclose any content. This type of tag combines the opening and closing into a single self-closing tag. The syntax for an empty element tag ends with a slash / just before the closing angle bracket >, e.g. </name>

XML attribute

An XML attribute provides additional information about an element, usually in the form of a name-value pair, and is always included within the start tag of an element. Attributes are used to store data that is not specifically part of the data content, such as identifiers, properties, or values that affect the behaviour of the element.

Example

<author id=”12345”>

1	<author id=”12345”>

Difference between XML attribute and element

What are the differences between XML attribute and element? There is no fast and hard rule, just some rough guidelines.

Function: Elements structure the main data of the document and can contain other elements or text. Attributes provide additional, ancillary information about an element.
Multiplicity: An element can contain multiple other elements and text, whereas attributes are generally simple name-value pairs attached to elements.
Nesting: Elements can be nested within other elements to create a document structure. Attributes cannot contain other attributes or elements; they only exist within the start tag of an element.
Usage: Elements are used to represent the main components and complex data structures of the document. Attributes are best used for metadata or information that specifies properties of elements, such as identifiers, styles, or specific characteristics that do not constitute the primary data content.

XML Fragment

An XML fragment is a portion of an XML document that may consist of multiple elements and other content. Unlike a complete XML document, an XML fragment does not need to include the XML declaration (<?xml version=”1.0″?>) or a single root element. It can be a piece of the XML that is well-formed but not necessarily a complete document on its own.

Example of an XML Fragment

<Department id="ComputerScience">
    <Course>Introduction to Programming</Course>
    <Course>Algorithms and Data Structures</Course>
</Department>
<Department id="ElectricalEngineering">
    <Course>Circuit Analysis</Course>
    <Course>Electromagnetics</Course>
    <Course>Control Systems</Course>
</Department>

<Course>Introduction to Programming</Course>

<Course>Algorithms and Data Structures</Course>

</Department>

<Course>Circuit Analysis</Course>

<Course>Electromagnetics</Course>

<Course>Control Systems</Course>

</Department>

XML entity

XML has a small set of predefined entities, which are necessary for escaping special characters that have specific meanings in XML syntax. These include:

& for the ampersand (&)
< for the less-than sign (<)
> for the greater-than sign (>)
' for the apostrophe (‘)
" for the quotation mark (“)

CDATA

As an alternative to XML entity you can use a CDATA section

CDATA sections in XML are particularly useful in scenarios where you need to embed content that includes characters that would normally be interpreted as XML markup and would need to be escaped via predefined XML entities. You can use CDATA instead of XML entities.

A CDATA section is enclosed within <![CDATA[ and ]]> markers. Anything placed between these markers is ignored by the XML parser, which means it does not process any markup inside the CDATA section.

Here is an example. As you can see you don’t need to escape any elements via XML entities.

<content>
        <![CDATA[
            Welcome to our detailed blog post on XML! Here we discuss various aspects of XML, including:
            <ul>
                <li>Elements</li>
                <li>Attributes</li>
                <li>CDATA sections</li>
            </ul>
            Check out our <a href="http://example.com/more_xml">additional resources</a>.
        ]]>
</content>

<![CDATA[

Welcome to our detailed blog post on XML! Here we discuss various aspects of XML, including:

<ul>

<li>Elements</li>

<li>Attributes</li>

<li>CDATA sections</li>

</ul>

Check out our <a href="http://example.com/more_xml">additional resources</a>.

]]>

</content>

XML Namespace

Namespaces in XML are primarily used to avoid naming conflicts by distinguishing elements and attributes that may have the same name but different meanings, based on their context or domain. This is particularly useful when combining XML documents from different sources or when extending existing XML formats.

<?xml version="1.0" encoding="UTF-8"?>
<lib:Library xmlns:lib="http://www.example.com/library">
    <lib:Book>
        <lib:Title>Great Expectations</lib:Title>
        <lib:Author>Charles Dickens</lib:Author>
        <lib:ISBN>978-0141439563</lib:ISBN>
    </lib:Book>
    <lib:Book>
        <lib:Title>To Kill a Mockingbird</lib:Title>
        <lib:Author>Harper Lee</lib:Author>
        <lib:ISBN>978-0446310789</lib:ISBN>
    </lib:Book>
</lib:Library>

<?xml version="1.0" encoding="UTF-8"?>

<lib:Library xmlns:lib="http://www.example.com/library">

<lib:Book>

<lib:Title>Great Expectations</lib:Title>

<lib:Author>Charles Dickens</lib:Author>

<lib:ISBN>978-0141439563</lib:ISBN>

</lib:Book>

<lib:Book>

<lib:Title>To Kill a Mockingbird</lib:Title>

<lib:Author>Harper Lee</lib:Author>

<lib:ISBN>978-0446310789</lib:ISBN>

</lib:Book>

</lib:Library>

Namespace Declaration: xmlns:lib=”http://www.example.com/library” declares a namespace identified by the URI http://www.example.com/library. This URI is a unique identifier and does not need to point to an actual web resource.
Elements with Namespace Prefix: Each element within the library (like lib:Book, lib:Title, lib:Author, lib:ISBN) is prefixed with lib, associating it with the declared namespace.

XML conversions to relational format

When converting XML to a relational format you need to be aware of the following concepts and their pitfalls.

Denormalisation / Flattening

A common approach when converting XML to a relational format is to denormalise the data. Flattening is another word for denormalisation and refers to the same thing.

For each branch in the XML (more on branches in a minute), we create one fully flattened table.

Let’s go through an example.

We have the following XML document

<Company>
        <CompanyName>Tech Solutions</CompanyName>
        <Department>
            <DepartmentName>Engineering</DepartmentName>
            <Employee>
                <Name>John Doe</Name>
                <Position>Software Engineer</Position>
            </Employee>
            <Employee>
                <Name>Jane Smith</Name>
                <Position>DevOps Engineer</Position>
            </Employee>
        </Department>
        <Department>
            <DepartmentName>Marketing</DepartmentName>
            <Employee>
                <Name>Emily Johnson</Name>
                <Position>Marketing Manager</Position>
            </Employee>
            <Employee>
                <Name>Michael Brown</Name>
                <Position>Content Strategist</Position>
            </Employee>
        </Department>
    </Company>

<CompanyName>Tech Solutions</CompanyName>

<DepartmentName>Engineering</DepartmentName>

<Position>Software Engineer</Position>

</Employee>

<Name>Jane Smith</Name>

<Position>DevOps Engineer</Position>

</Employee>

</Department>

<DepartmentName>Marketing</DepartmentName>

<Name>Emily Johnson</Name>

<Position>Marketing Manager</Position>

</Employee>

<Name>Michael Brown</Name>

<Position>Content Strategist</Position>

</Employee>

</Department>

</Company>

The denormalised version looks as follows. It’s fully flattened into a single table.

CompanyName	DepartmentName	Name	Position
Tech Solutions	Engineering	John Doe	Software Engineer
Tech Solutions	Engineering	Jane Smith	DevOps Engineer
Tech Solutions	Marketing	Emily Johnson	Marketing Manager
Tech Solutions	Marketing	Michael Brown	Content Strategist

Normalisation

Having a normalised relational structure is typically preferred over a denormalised structure for the following reasons.

Data Integrity and Consistency:
- Reduced Redundancy: Normalisation reduces data redundancy by ensuring that each piece of data is stored in only one place. This minimises the risk of having inconsistent data across the database.
- Update Anomalies: When data needs to be updated, it only has to be changed in one place, reducing the risk of update anomalies and ensuring consistency.
- Referential Integrity: Normalised databases enforce referential integrity, meaning that relationships between tables are maintained accurately. This ensures that references between tables remain consistent.
- Elimination of Anomalies: Normalisation eliminates insertion, update, and deletion anomalies, ensuring that the database remains accurate and reliable.
Smaller Data Sets: By eliminating redundant data, normalised databases typically require less storage space.
Faster Updates: Updates are faster because only one copy of the data needs to be modified.
Scalability: Scalability: Normalised databases can be more easily scaled. Because data is stored in a structured way, adding new types of data or relationships typically involves adding new tables rather than modifying existing ones.

Multiple branches

Apart from simple scenarios, XML documents typically contain multiple branches. When denormalising XML documents you need to be aware that denormalising multiple branches may lead to cartesian products if not done properly. Later on we will look at an example on how to handle multiple branches in XML documents in Snowflake.

Have a look at the following XML document with two branches

<root>
    <BranchA>
        <ItemA>A1</ItemA>
        <ItemA>A2</ItemA>
    </BranchA>
    <BranchB>
        <ItemB>B1</ItemB>
        <ItemB>B2</ItemB>
        <ItemB>B3</ItemB>
    </BranchB>
</root>

<root>

</BranchA>

</BranchB>

</root>

If not done carefully the conversion to a denormalised tabular format may lead to a cartesian product as shown in the following example. It results in a multiplication of the resultset. It is definitely something you want to avoid as it will have a significant impact on performance and might even bring down your Snowflake Virtual Warehouse. More importantly it will give you the wrong results and bad data.

Cartesian table

ItemA	ItemB
A1	B1
A1	B2
A1	B3
A2	B1
A2	B2
A2	B3

You will need to create separate tables when you convert branches in an XML doc. One denormalised table for each branch.

The correct conversion of the previous examples has the following tables.

Table 1: BranchA

ItemA
A1
A2

Table 2: BranchB

ItemB
B1
B2
B3

Loading and storing XML in Snowflake

Before you can work with XML on Snowflake you first have to load and store the XML documents in the Snowflake database.

Storing XML in Snowflake

Storing XML in the Snowflake database is straightforward. While XML on Oracle and XML on MS SQL Server you have many more options of working with XML. However, understanding these options and choosing the right one can be time-consuming. Many of the options have been deprecated, e.g. Oracle EXTRACT and EXTRACTVALUE which adds to the confusion.

Snowflake keeps things simple. XML is stored in the VARIANT data type, which is also used to store other hierarchical data such as JSON.

Internally, Snowflake represents VARIANT data in an efficient, compressed format, which helps in optimising storage and query performance. Snowflake optimises queries on VARIANT data by leveraging its internal storage format and indexing mechanisms. This allows for efficient retrieval and manipulation of semi-structured data.

Important Tip

When working with XML and the VARIANT data type you will need to be knowledgeable around OBJECTs and ARRAYs. Before you start your XML journey on Snowflake make sure to read up on how OBJECTs and ARRAYs work and the various ways of working with these on Snowflake.

Loading XML

There are multiple options around loading XML into Snowflake

IIS (INSERT INTO … SELECT)

IIS can be used for ad-hoc loading of individual XML documents to a Snowflake table with a VARIANT column. It is great for quickly loading XMLs and running some test queries. For bulk loading in a production environment we recommend to use COPY INTO.

Here’s how you would structure this operation:

Create the table: First, create a table that includes a VARIANT column capable of storing semi-structured XML data.

CREATE OR REPLACE TABLE xml_data_table (
 
  xml_content VARIANT
);

CREATE OR REPLACE TABLE xml_data_table (

xml_content VARIANT

);

Insert XML data: Use the INSERT INTO … SELECT method to load an XML document.

INSERT INTO xml_data_table (xml_content)
SELECT PARSE_XML('<root><item id="1">Item 1</item><item id="2">Item 2</item></root>') AS xml_content;

1 2	INSERT INTO xml_data_table (xml_content) SELECT PARSE_XML('<root><item id="1">Item 1</item><item id="2">Item 2</item></root>') AS xml_content;

The PARSE_XML function is used to convert a string that contains XML into a VARIANT format that Snowflake can store in the xml_content column.

External Table

While it is possible to create an external table over JSON documents this is not possible for XML. As XML is not a widely used format, support for external tables was not added by Snowflake.

Copy INTO

Copy INTO is a command that is used to load files. Both tabular (e.g. CSV) and hierarchical data (JSON, XML, Parquet etc.) can be loaded into Snowflake with Copy INTO.

The command comes with a ton of options. For XML you have the following options

COMPRESSION = AUTO | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE
IGNORE_UTF8_ERRORS = TRUE | FALSE
PRESERVE_SPACE = TRUE | FALSE
STRIP_OUTER_ELEMENT = TRUE | FALSE
DISABLE_SNOWFLAKE_DATA = TRUE | FALSE
DISABLE_AUTO_CONVERT = TRUE | FALSE
REPLACE_INVALID_CHARACTERS = TRUE | FALSE
SKIP_BYTE_ORDER_MARK = TRUE | FALSE

We will look at COMPRESSION, CopyOptions, and STRIP_OUTER_ELEMENT as the more commonly used options.

COMPRESSION

If your XML comes compressed you can specify the type of compression.

From the options around COMPRESSION you can see that various algorithms are supported. We gzipped XML documents and successfully loaded our XML data into Snowflake.

As you can see, ZIP is not included in the list. ZIP is better described as an archive or file format that supports the compression and archiving of multiple files and directories. It uses various compression algorithms, e.g. DEFLATE.

When we tried to load XML in a ZIP we got an error.

COPY INTO employee_data
FROM @my_internal_xml_stage/employees.zip
FILE_FORMAT = (TYPE = 'XML');

COPY INTO employee_data

FROM @my_internal_xml_stage/employees.zip

FILE_FORMAT = (TYPE = 'XML');

Error Message:

100100 (22P02): Error parsing XML: not an XML element

1	100100 (22P02): Error parsing XML: not an XML element

This is unfortunate as XML documents are often shipped as ZIP files and first need to unzipped and then compressed using one of the support compression algorithms before they can be loaded.

Copy options

You can specify one of the following copy options:

ON_ERROR = CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT

1	ON_ERROR = CONTINUE \| SKIP_FILE \| SKIP_FILE_num \| 'SKIP_FILE_num%' \| ABORT_STATEMENT

You might be wondering about the difference between CONTINUE and SKIP_FILE

Let’s go through some test cases to show the difference in detail

Test Case: CONTINUE versus SKIP_FILE in COPY INTO Command

Objective

To evaluate the behaviour of the COPY INTO command in Snowflake when processing XML files with errors, using ON_ERROR options CONTINUE and SKIP_FILE. This test focuses on observing how Snowflake handles a file containing a malformed XML tag during data load operations.

Files Used

customer.xml: Contains three customer records; the third record has a malformed XML tag (</email1> instead of </email>).

<customer>
    <id>001</id>
    <name>John Doe</name>
    <email>john.doe@example.com</email>
    <age>30</age>
    <city>New York</city>
</customer>
<customer>
    <id>002</id>
    <name>Jane Smith</name>
    <email>jane.smith@example.com</email>
    <age>25</age>
    <city>Los Angeles</city>
</customer>
<customer>
    <id>003</id>
    <name>Bob Johnson</name>
    <email>bob.johnson@example.com</email1>//wrong tag
    <age>40</age>
    <city>Chicago</city>
</customer>

<email>john.doe@example.com</email>

</customer>

<name>Jane Smith</name>

<email>jane.smith@example.com</email>

<city>Los Angeles</city>

</customer>

<name>Bob Johnson</name>

<email>bob.johnson@example.com</email1>//wrong tag

<city>Chicago</city>

</customer>

employee.xml: Contains two well-formed employee records.

<?xml version="1.0" encoding="UTF-8"?>
<employees>
    <employee>
        <id>1</id>
        <name>John Doe</name>
        <department>Finance</department>
        <position>Manager</position>
    </employee>
    <employee>
        <id>2</id>
        <name>Jane Smith</name>
        <department>IT</department>
        <position>Developer</position>
    </employee>
</employees>

<?xml version="1.0" encoding="UTF-8"?>

<department>Finance</department>

<position>Manager</position>

</employee>

<name>Jane Smith</name>

<position>Developer</position>

</employee>

</employees>

Test Execution

Loading with ON_ERROR = CONTINUE

COPY INTO customer_data
FROM @my_internal_xml_stage
FILE_FORMAT = (TYPE = 'XML')
ON_ERROR = CONTINUE;

COPY INTO customer_data

FROM @my_internal_xml_stage

FILE_FORMAT = (TYPE = 'XML')

ON_ERROR = CONTINUE;

Expected behaviour: Snowflake should load all valid XML entries and ignore the entry with the malformed XML tag. Specifically, it should load two valid customer records from customer.xml and two employee records as a single row from employee.xml.

Loading with ON_ERROR = SKIP_FILE

COPY INTO customer_data
FROM @my_internal_xml_stage
FILE_FORMAT = (TYPE = 'XML')
ON_ERROR = SKIP_FILE;

COPY INTO customer_data

FROM @my_internal_xml_stage

FILE_FORMAT = (TYPE = 'XML')

ON_ERROR = SKIP_FILE;

Expected Behaviour: Snowflake should skip the entire customer.xml file due to the malformed tag in the third record, resulting in no records from this file being loaded. It should successfully load the two employee records from employee.xml.

Observations

With CONTINUE:
- Snowflake processed the customer.xml and employee.xml files. Two records from customer.xml (those without errors) and two records from employee.xml were successfully loaded into the customer_data table.
- The record with the malformed tag in customer.xml was ignored, and no error halted the process.

With SKIP_FILE:
- Snowflake skipped the entire customer.xml file due to the error in one of the records. No data from this file was loaded into the customer_data table.
- Both records from employee.xml were successfully loaded, indicating that the error in customer.xml did not affect the processing of other files.

When you have an error in your XML file and you use CONTINUE, Snowflake will still process the file but just ignore the erroneous element / record. When using SKIP_FILE the whole document will be ignored.

STRIP_OUTER_ELEMENT

STRIP_OUTER_ELEMENT is used to strip away the outermost element in the document, which can simplify the data structure and make it easier to work with XML inside Snowflake.

If your data has an unnecessary outer wrapper, you can strip it to work directly with the contained data.
By removing the outer element, you can avoid additional querying steps or path navigation to access the core data elements.

In this simple example the outer wrapper of <customers> does not have much of a purpose and we can use STRIP_OUTER_ELEMENT to remove this element as part of the load to Snowflake.

<customers>
    <customer>
        <id>001</id>
        <name>John Doe</name>
        <email>john.doe@example.com</email>
        <age>30</age>
        <city>New York</city>
    </customer>
    <customer>
        <id>002</id>
        <name>Jane Smith</name>
        <email>jane.smith@example.com</email>
        <age>25</age>
        <city>Los Angeles</city>
    </customer>
    <customer>
        <id>003</id>
        <name>Bob Johnson</name>
        <email>bob.johnson@example.com</email>
        <age>40</age>
        <city>Chicago</city>
    </customer>
</customers>

<email>john.doe@example.com</email>

</customer>

<name>Jane Smith</name>

<email>jane.smith@example.com</email>

<city>Los Angeles</city>

</customer>

<name>Bob Johnson</name>

<email>bob.johnson@example.com</email>

<city>Chicago</city>

</customer>

</customers>

We use COPY INTO to load the XML file into a Snowflake table.

COPY INTO customer_data
FROM @my_internal_xml_stage/customers.xml
FILE_FORMAT = (TYPE = 'XML'
STRIP_OUTER_ELEMENT = TRUE));

COPY INTO customer_data

FROM @my_internal_xml_stage/customers.xml

FILE_FORMAT = (TYPE = 'XML'

STRIP_OUTER_ELEMENT = TRUE));

Next we query the table to ensure that data has been loaded.

SELECT * FROM customer_data;

1	SELECT * FROM customer_data;

Snowflake processed the XML fragment successfully, despite the absence of a root element. This indicates that Snowflake can parse and load XML data structures that are non-standard yet well formed.

While your XML needs to be well formed to be loaded into Snowflake it does not necessarily have to be valid, e.g. loading an XML fragment without a root element works.

Loading Metadata

When using COPY INTO you can also load metadata such as the filename into the target table. This is useful to trace back the XML data to the file it originated from.

Load the XML data from the staging area, capturing both the file name and the content.

COPY INTO customer_data
FROM (
	SELECT METADATA$FILENAME, t.$1
	FROM @my_internal_xml_stage/customers.xml
	(file_format => my_xml_file_format) t
);

COPY INTO customer_data

FROM (

SELECT METADATA$FILENAME, t.$1

FROM @my_internal_xml_stage/customers.xml

(file_format => my_xml_file_format) t

);

The feature INCLUDE_METADATA is not supported. Using INCLUDE_METADATA as part of COPY INTO will result in an error: 003535 (0A000)

Performance considerations for loading XML

The general recommendations for loading files into Snowflake also apply to XML. Here is a quick summary of the most important points.

File Size: Aim for file sizes between 100 MB and 1 GB (compressed) for optimal performance. Smaller files can increase overhead, while larger files may lead to timeouts and increased load times.

Splitting Large Files: If you have very large files, consider splitting them into smaller chunks within the optimal size range.

Compress Files: Use compression formats like gzip to reduce file size. Compressed files consume less storage and can be loaded faster.

Multiple Files: Load data from multiple files concurrently to leverage Snowflake’s automatic parallelism. Snowflake can parallelize the loading process to improve performance.

Querying XML

In this section we cover the options around querying XML on Snowflake in depth. We start by introducing some useful Snowflake XML functions and then show you how to traverse and flatten the XML hierarchy. We provide tips and tricks as we go.

Snowflake XML functions

Snowflake offers various XML functions to check and cast XML.

PARSE_XML

The PARSE_XML function in Snowflake interprets an input string as an XML document, producing an OBJECT value. This allows you to work with XML data within Snowflake by converting it into a format that Snowflake can manipulate and query.

Inserting XML without the PARSE_XML function would just store the XML as VARCHAR and you would not be able to use the Snowflake SQL features to query and read XML.

Syntax

PARSE_XML( <string_containing_xml> [ , <disable_auto_convert> ] )

PARSE_XML( STR => <string_containing_xml> [ , DISABLE_AUTO_CONVERT => <disable_auto_convert> ] )

PARSE_XML( <string_containing_xml> [ , <disable_auto_convert> ] )

PARSE_XML( STR => <string_containing_xml> [ , DISABLE_AUTO_CONVERT => <disable_auto_convert> ] )

string_containing_xml: The XML string to be parsed.
disable_auto_convert: (Optional) With this parameter you can decide whether or not to convert string values that appear to be numbers into a numerical format.

Example

You can use PARSE_XML to store XML data in a Snowflake table as follows:

CREATE TABLE xml_data_table (xml_content OBJECT);

INSERT INTO xml_data_table
	SELECT PARSE_XML('<org>
    	<company id="1">
        	<employee id="1">
            	<name>John Doe</name>
            	<position>Manager</position>
        	</employee>
        	<employee id="2">
            	<name>Jane Smith</name>
            	<position>Developer</position>
        	</employee>
    	</company>
    	<company id="2">
        	<employee id="1">
            	<name>Hans Wurscht</name>
            	<position>Hansel</position>
        	</employee>
        	<employee id="2">
            	<name>Leo A</name>
            	<position>Support</position>
        	</employee>
    	</company>
	</org>') AS xml_content;

CREATE TABLE xml_data_table (xml_content OBJECT);

INSERT INTO xml_data_table

SELECT PARSE_XML('<org>

<position>Manager</position>

</employee>

<name>Jane Smith</name>

<position>Developer</position>

</employee>

</company>

<name>Hans Wurscht</name>

<position>Hansel</position>

</employee>

<position>Support</position>

</employee>

</company>

</org>') AS xml_content;

Now you can run SQL queries against the table to work on the XML data.

Select * from xml_data_table;

1	Select * from xml_data_table;

CHECK_XML

The CHECK_XML function in Snowflake checks the validity of an XML document. If the input string is NULL or a valid XML document, the function returns NULL. If there is an XML parsing error, the function returns an error message describing the issue.

Use Cases:

Validating XML Data: To ensure that XML data is correctly formatted before processing or storing it in the database.
Error Diagnostics: To diagnose and locate specific issues in XML documents that fail to parse correctly.

Syntax:

CHECK_XML( <string_containing_xml> [ , <disable_auto_convert> ] )

CHECK_XML( STR => <string_containing_xml> [ , DISABLE_AUTO_CONVERT => <disable_auto_convert> ] )

CHECK_XML( <string_containing_xml> [ , <disable_auto_convert> ] )

CHECK_XML( STR => <string_containing_xml> [ , DISABLE_AUTO_CONVERT => <disable_auto_convert> ] )

string_containing_xml: The XML string to be checked for validity.
disable_auto_convert: (Optional) Boolean value to disable automatic conversion of numeric and boolean values, similar to the parameter in the PARSE_XML function.

Examples

Example 1: Checking Valid XML

SELECT CHECK_XML('<name> Valid </name>');

1	SELECT CHECK_XML('<name> Valid </name>');

Example 2: Checking Invalid XML

SELECT CHECK_XML('<name> Invalid </WRONG_CLOSING_TAG>');

1	SELECT CHECK_XML('<name> Invalid </WRONG_CLOSING_TAG>');

TO_XML

The TO_XML function in Snowflake converts a VARIANT to a VARCHAR containing an XML representation of the value. If the input is NULL, the result is also NULL. This function is useful for converting semi-structured data into an XML-formatted string.

Use Cases:

Generating XML Strings: To generate XML-formatted strings from VARIANT data, including JSON or other semi-structured formats.

Preserving Data Format: Ensuring that data originally formatted as XML can be converted back to its XML representation after manipulation.

Syntax:

TO_XML( <expression> )

1	TO_XML( <expression> )

Example:Converting JSON Data to XML

CREATE TABLE json_to_xml_table (json_col VARIANT);

INSERT INTO json_to_xml_table
	SELECT PARSE_JSON('{"key1": ["val1", "val2"], "key2": {"subkey": "subval"}}') AS json_col;

SELECT json_col, TO_XML(json_col)
FROM json_to_xml_table;

CREATE TABLE json_to_xml_table (json_col VARIANT);

INSERT INTO json_to_xml_table

SELECT PARSE_JSON('{"key1": ["val1", "val2"], "key2": {"subkey": "subval"}}') AS json_col;

SELECT json_col, TO_XML(json_col)

FROM json_to_xml_table;

Walking the XML hierarchy with XMLGET

You can use XMLGET to traverse the hierarchy of the elements in your XML document.

The documentation states the purpose of XMLGET clearly

XMLGET extracts an XML element object (often referred to as simply a tag) from the content of the outer XML element based on the name and instance number of the specified tag.

XMLGET returns an OBJECT. It does not return tag values or attribute values but the whole element or fragment.

Tip

If you need to read up on XML elements check the XML primer in this blog post.

Let’s go through some examples.

First we load some nested XML into a table

CREATE TABLE xml_demo (ID INTEGER, object_col OBJECT);

INSERT INTO xml_demo (id, object_col)
    SELECT 
       1,
       PARSE_XML('
<?xml version="1.0" encoding="UTF-8"?>
<level1 level1attr="this is level 1">
  #1
  <level2 level2attr="this is level 2">
    #2
    <level3 level3attr="this is level 3" level3attr2="this is level 3. another one">
      #3.1
      <level4>#4.1<level5>5</level5></level4>
      <level4>#4.2</level4>
    </level3>
    <level3 level3attr="this is also level 3">
      #3.2
    </level3>
  </level2>
</level1>
');

CREATE TABLE xml_demo (ID INTEGER, object_col OBJECT);

INSERT INTO xml_demo (id, object_col)

SELECT

PARSE_XML('

<?xml version="1.0" encoding="UTF-8"?>

#3.1

</level3>

#3.2

</level3>

</level2>

</level1>

');

Let’s run XMLGET to retrieve the level2 element

SELECT 
       XMLGET(object_col, 'level2')
    FROM xml_demo;

SELECT

XMLGET(object_col, 'level2')

FROM xml_demo;

The result is the element level2 and its child elements.

<level2 level2attr="this is level 2">
  #2
  <level3 level3attr="this is level 3" level3attr2="this is level 3. another one">
    #3.1
    <level4>
      #4.1
      <level5>5</level5>
    </level4>
    <level4>#4.2</level4>
  </level3>
  <level3 level3attr="this is also level 3">#3.2</level3>
</level2>

#3.1

#4.1

</level4>

</level3>

</level2>

Important note

You can not use XMLGET to extract the outermost element, which is level1 in our XML doc.

SELECT 
       XMLGET(object_col, 'level1')
    FROM xml_demo;

SELECT

XMLGET(object_col, 'level1')

FROM xml_demo;

Running this query for the level1 element will result in NULL being returned.

Using GET or a shorthand of XMLGET you can extract names of attributes and elements and their values. More on this topic in the next section.

We can walk the hierarchy of XML elements by using nested XMLGET calls.

Let’s move down one level in the hierarchy to element level3

SELECT 
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1)
    FROM xml_demo;

SELECT

XMLGET(XMLGET(object_col, 'level2'), 'level3',1)

FROM xml_demo;

There are multiple level3 elements and the positions of these elements will be represented as an array. If we don’t specify a position, XMLGET will default to position 0. In our example we specify position 1 in XMLGET and get the second child element inside the element level3

<level3 level1attr="this is also level 3">3.2</level3>

1	<level3 level1attr="this is also level 3">3.2</level3>

Extracting attributes, tags and values with GET

This is all great, but how do we actually get to the data itself?

You can use Operators of the GET function to extract the following information

@attrname: To extract attribute values, use GET(tag/element, ‘@attrname’)
$: To extract the content, use GET(tag/element, ‘$’)
@: To extract the tag name, use GET(tag/element, ‘@’)

Using our XML document from the previous section, let’s go through some examples.

Let’s first extract the values (content) of an element. In this case element level2

SELECT 
       GET(XMLGET(object_col, 'level2'),'
)
    FROM xml_demo;

SELECT

GET(XMLGET(object_col, 'level2'),'

)

FROM xml_demo;

GET returns an ARRAY if the XML element does have children

In this example, the level2 element contains multiple items (text and multiple nested elements). The nested tags are represented by OBJECTs (key-value pairs). The @ property contains the nested tag name and the $ property contains the nested tag contents.

Example with $

$ returns the value of the element if the element does not have any children of its own.

The second child of level 3, which is element <level3 level3attr=”this is also level 3″>#3.2</level3> does not have any children.

Let’s run GET with $ operator against this element.

SELECT 
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1),
       GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'
)
    FROM xml_demo;

SELECT

XMLGET(XMLGET(object_col, 'level2'), 'level3',1),

GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'

)

FROM xml_demo;

This returns text “#3.2”

Example with @

SELECT 
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1),
       GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'@')
    FROM xml_demo;

SELECT

XMLGET(XMLGET(object_col, 'level2'), 'level3',1),

GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'@')

FROM xml_demo;

This will return the tag name “level3”

Example with @attr

SELECT 
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1),
       GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'@level3attr')
    FROM xml_demo;

SELECT

XMLGET(XMLGET(object_col, 'level2'), 'level3',1),

GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'@level3attr')

FROM xml_demo;

With @level3attr you can get the value of the attribute level3attr. The query will return “this is also level 3”

Instead of GET you can use the following shorthand notation.

SELECT 
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1),
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1):"$"
    FROM xml_demo;

SELECT

XMLGET(XMLGET(object_col, 'level2'), 'level3',1),

XMLGET(XMLGET(object_col, 'level2'), 'level3',1):"$"

FROM xml_demo;

The same XMLGET() function is used but is followed by a single colon character, then followed by the argument.

It is the equivalent of

SELECT 
       XMLGET(XMLGET(object_col, 'level2'), 'level3',1),
       GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'$’)
    FROM xml_demo;

SELECT

XMLGET(XMLGET(object_col, 'level2'), 'level3',1),

GET(XMLGET(XMLGET(object_col, 'level2'), 'level3',1),'$’)

FROM xml_demo;

Both the queries will return “#3.2”.

Filtering XML with Higher Order Functions

Using higher-order functions in Snowflake, such as FILTER and TRANSFORM, you can apply Lambda expressions to arrays for filtering and transforming data. With these functions you can manipulate both structured and semi-structured data without the need for complex SQL operations or user-defined functions (UDFs).

Key Components of Higher-Order Functions:

Lambda Expression: A block of code that takes an argument and returns a value. In the context of Snowflake higher-order functions, Lambda expressions are anonymous functions specified directly within the higher-order function call.
Array: The data structure that contains the elements to be filtered or transformed. Arrays can be either structured or semi-structured.

Syntax of Higher-Order Functions:

The general syntax for higher-order functions like FILTER in Snowflake is as follows:

FILTER(<array>, <lambda_expression>)

1	FILTER(<array>, <lambda_expression>)

Higher Order Functions in action

The following example focuses on filtering products from a parsed XML content stored in a temporary SQL table based on their price attributes.

Data Preparation

First, we create a temporary table named xml_data. It contains XML content representing a list of products. Each product has an id and a price. We parse the XML into a structured format suitable for querying.

CREATE TEMP TABLE xml_data AS
SELECT PARSE_XML('<products>
<product>
<id>1</id>
<price>100</price>
</product>
<product>
<id>2</id>
<price>200</price>
</product>
</products>') AS xml_content;

CREATE TEMP TABLE xml_data AS

SELECT PARSE_XML('<products>

</product>

</product>

</products>') AS xml_content;

XML Structure and Conversion to Array

We access the XML data stored in xml_data using the xml_content:”$” expression. This notation converts the XML structure into an array format. This conversion is necessary because the FILTER function requires an array as input to apply the provided lambda function across each element of the array.

Output of xml_content:”$”

The xml_content:”$” notation translates the XML structure into a JSON-like array where each product in the XML becomes an object in this array. The structure of this array looks like this:

[
  {
	"$": [
  	{"$": 1, "@": "id"},
  	{"$": 100, "@": "price"}
	],
	"@": "product",
	"id": 0,
	"price": 1
  },
  {
	"$": [
  	{"$": 2, "@": "id"},
  	{"$": 200, "@": "price"}
	],
	"@": "product",
	"id": 0,
	"price": 1
  }
]

[

{

"$": [

{"$": 1, "@": "id"},

{"$": 100, "@": "price"}

"@": "product",

"id": 0,

"price": 1

{

"$": [

{"$": 2, "@": "id"},

{"$": 200, "@": "price"}

"@": "product",

"id": 0,

"price": 1

}

]

In this structure, each product is an object with a $ array holding its attributes. The attributes are also objects, where the $ denotes the value and @ the name of the attribute.

Filtering Query

The main query uses the FILTER function to select only those products whose price is greater than 100.

SELECT
	to_xml(FILTER(
    	a.xml_content:"$",
    	p ->  p:"$"[1]."$" >100
	)) AS updated_products
FROM xml_data a;

SELECT

to_xml(FILTER(

a.xml_content:"$",

p -> p:"$"[1]."$" >100

)) AS updated_products

FROM xml_data a;

Understanding the Lambda Function

The Lambda expression in Snowflake follows this structure:

<arg> [<datatype>] -> <expr>

1	<arg> [<datatype>] -> <expr>

<arg>: The argument representing each element in the array.
<datatype>: Optional. Specifies the data type of the argument.
<expr>: The expression that defines the operation to be performed on each element.

In the FILTER function, the lambda expression p -> p:”$”[1].”$” > 100 operates as follows:

p:”$”: This accesses the array of attributes and values within the product object. It effectively retrieves the array stored under the $ key of each product.
[1]: This is an array index that selects the second element of the attributes array (“$”). Given the consistent structure of each product, the second element in this array is always the price attribute.
“$”: This further drills down to extract the value of the price attribute. It selects the value of the price from the object, which looks like {“$”: 100, “@”: “price”}.
> 100: This part of the lambda function checks if the extracted price value is greater than 100.

Result

We can now convert the filtered XML back to XML to_xml(). This final XML output consists only of products that meet the filtering criteria.

Denormalising with Flatten and Lateral Flatten

We can use denormalisation and flattening techniques to transform complex, hierarchical data into simpler tabular forms. Hierarchical data, such as XML or JSON, often contains nested structures. Flattening involves combining and merging data related by parent – child relationships into a single table with rows and columns. This breaks down nested structures into flattened tables.

FLATTEN

The FLATTEN function in Snowflake converts hierarchical, semi-structured data into a tabular format by breaking down data types such as VARIANT, OBJECT, or ARRAY into multiple rows.

Input and Output:

Input: The function takes an expression of data type VARIANT, OBJECT, or ARRAY as input. Optional parameters can specify the path within the data structure, whether to include empty expansions, recursive expansion, and the mode (OBJECT, ARRAY, or BOTH).
Output: The output consists of a fixed set of columns (SEQ, KEY, PATH, INDEX, VALUE, THIS) that detail the flattened data elements.

Example of using FLATTEN for XML

SELECT
	GET(company.value, '@id') company_id,
	company.value
FROM TABLE(
	FLATTEN(
    	input => PARSE_XML('<org>
	<company id="1">
    	<employee id="1">
        	<name>John Doe</name>
        	<position>Manager</position>
    	</employee>
    	<employee id="2">
        	<name>Jane Smith</name>
        	<position>Developer</position>
    	</employee>
	</company>
	<company id="2">
    	<employee id="1">
        	<name>Hans Wurscht</name>
        	<position>Hansel</position>
    	</employee>
	</company>
</org>'):"$"
	)
) company;

SELECT

GET(company.value, '@id') company_id,

company.value

FROM TABLE(

FLATTEN(

input => PARSE_XML('<org>

<position>Manager</position>

</employee>

<name>Jane Smith</name>

<position>Developer</position>

</employee>

</company>

<name>Hans Wurscht</name>

<position>Hansel</position>

</employee>

</company>

</org>'):"$"

)

) company;

Output

The output will look like this, with each of the two <company> elements being exploded into its own row:

LATERAL FLATTEN

LATERAL FLATTEN is a combination of the FLATTEN function with the LATERAL keyword. The LATERAL keyword allows the FLATTEN function to access columns from preceding tables in the FROM clause. This means that LATERAL FLATTEN can reference and operate on the data produced by the preceding FROM clause elements, making it more flexible and powerful for dealing with complex data transformations and correlations.

Example of using LATERAL FLATTEN

Step 1: Create a Table and Insert XML Data

CREATE OR REPLACE TABLE xml_data_table (
	id INTEGER AUTOINCREMENT,
	xml_content XML
);

INSERT INTO xml_data_table (xml_content)
	SELECT
   	PARSE_XML('<org>
	<company id="1">
    	<employee id="1">
        	<name>John Doe</name>
        	<position>Manager</position>
    	</employee>
    	<employee id="2">
        	<name>Jane Smith</name>
        	<position>Developer</position>
    	</employee>
	</company>
	<company id="2">
    	<employee id="1">
        	<name>Hans Wurscht</name>
        	<position>Hansel</position>
    	</employee>
    	<employee id="2">
        	<name>Leo A</name>
        	<position>Support</position>
    	</employee>
	</company>
</org>');

CREATE OR REPLACE TABLE xml_data_table (

id INTEGER AUTOINCREMENT,

xml_content XML

);

INSERT INTO xml_data_table (xml_content)

SELECT

PARSE_XML('<org>

<position>Manager</position>

</employee>

<name>Jane Smith</name>

<position>Developer</position>

</employee>

</company>

<name>Hans Wurscht</name>

<position>Hansel</position>

</employee>

<position>Support</position>

</employee>

</company>

</org>');

Step 2: Use LATERAL FLATTEN to Extract Employee Data Along with Company ID

select
	GET(Company.value, '@id')::integer as Company_id,
	GET(employee.value, '@id')::integer as employee_id,
    	XMLGET( employee.value, 'name' ):"$"::string as emp_id
from
	XML_DATA_TABLE
	,  lateral FLATTEN(XML_CONTENT:"$") Company,
   LATERAL FLATTEN(Company.value:"$")  employee

select

GET(Company.value, '@id')::integer as Company_id,

GET(employee.value, '@id')::integer as employee_id,

XMLGET( employee.value, 'name' ):"$"::string as emp_id

from

XML_DATA_TABLE

, lateral FLATTEN(XML_CONTENT:"$") Company,

LATERAL FLATTEN(Company.value:"$") employee

LATERAL FLATTEN on xml_content: The first LATERAL FLATTEN function processes the company elements within the xml_content.
LATERAL FLATTEN on Company.value: The second LATERAL FLATTEN function processes the employee elements nested within each company.

Output

The output will look like this, with each row representing an employee and their associated company ID and name:

For each level of nesting you will need to add a separate LATERAL FLATTEN function. If you have branches you will need to do this for each branch in your XML. This brings us to our next topic..

Issue flattering XML with single element arrays

When we used LATERAL FLATTEN with an XML structure where a company has only one employee we ran into an issue. You will come across the problem because Snowflake’s LATERAL FLATTEN treats single element arrays differently than multi element arrays, leading to improper flattening. So for an XML like the following example:

INSERT INTO xml_data_table (xml_content)
    SELECT 
       
       PARSE_XML('<?xml version="1.0" encoding="UTF-8"?>

<org>

<company id="1">

  <employee id="1">

    <name>John Doe</name>

    <position>Manager</position>

  </employee>

  <employee id="2">

    <name>Jane Smith</name>

    <position>Developer</position>

  </employee>

  <employee id="3">

    <name>Mike Johnson</name>

    <position>Designer</position>

  </employee>

  <employee id="4">

    <name>Alice Brown</name>

    <position>Tester</position>

  </employee>

  <employee id="5">

    <name>Chris Davis</name>

    <position>Support</position>

  </employee>

</company>

<company id="2">

  <employee id="1">

    <name>Hans Wurscht</name>

    <position>Hansel</position>

  </employee>
  
  

</company>

</org>
');

INSERT INTO xml_data_table (xml_content)

SELECT

PARSE_XML('<?xml version="1.0" encoding="UTF-8"?>

<org>

<position>Manager</position>

</employee>

<name>Jane Smith</name>

<position>Developer</position>

</employee>

<name>Mike Johnson</name>

<position>Designer</position>

</employee>

<name>Alice Brown</name>

<position>Tester</position>

</employee>

<name>Chris Davis</name>

<position>Support</position>

</employee>

</company>

<name>Hans Wurscht</name>

<position>Hansel</position>

</employee>

</company>

</org>

');

If you try to flatten the XML using the same query from the last section as follows:

SELECT
  typeof(emp.value) as emp_doc,
  emp.value,
  get(comp.value, '@id') as company_id,
  get(emp.value, '@id') as emp_id,
  xmlget(emp.value, 'name'):"$"::string as emp_name,
  xmlget(emp.value, 'position'):"$"::string as emp_position
FROM xml_data_table,
LATERAL FLATTEN(INPUT => xml_data_table.xml_content:"$") comp,
LATERAL FLATTEN(INPUT => comp.value:"$") emp;

SELECT

typeof(emp.value) as emp_doc,

emp.value,

get(comp.value, '@id') as company_id,

get(emp.value, '@id') as emp_id,

xmlget(emp.value, 'name'):"$"::string as emp_name,

xmlget(emp.value, 'position'):"$"::string as emp_position

FROM xml_data_table,

LATERAL FLATTEN(INPUT => xml_data_table.xml_content:"$") comp,

LATERAL FLATTEN(INPUT => comp.value:"$") emp;

The query will produce incorrect results, particularly for the company with only one employee:

As seen, the data for the second company (with only one employee) was not flattened correctly. The employee data was returned as an array instead of individual XML elements.

Solution: Using Array Conversion

To address the issue with single-element arrays, we modified the query to convert the input to an array during the flattening process:

SELECT
  typeof(emp.value) as emp_doc,
  emp.value,
  get(comp.value, '@id') as company_id,
  get(emp.value, '@id') as emp_id,
  xmlget(emp.value, 'name'):"$"::string as emp_name,
  xmlget(emp.value, 'position'):"$"::string as emp_position
FROM xml_data_table,
LATERAL FLATTEN(INPUT => xml_data_table.xml_content:"$") comp,
LATERAL FLATTEN(INPUT => comp.value:"$"::ARRAY) emp;//Converting the input to an array

SELECT

typeof(emp.value) as emp_doc,

emp.value,

get(comp.value, '@id') as company_id,

get(emp.value, '@id') as emp_id,

xmlget(emp.value, 'name'):"$"::string as emp_name,

xmlget(emp.value, 'position'):"$"::string as emp_position

FROM xml_data_table,

LATERAL FLATTEN(INPUT => xml_data_table.xml_content:"$") comp,

LATERAL FLATTEN(INPUT => comp.value:"$"::ARRAY) emp;//Converting the input to an array

The corrected query produced the expected results, properly handling single-element arrays:

By converting the input to an array during the LATERAL FLATTEN process, we were able to handle single-element arrays correctly and ensure that the XML data is parsed and flattened properly in Snowflake.

Flattening XML with multiple branches

When flattening XML that contains multiple branches we need to be very careful to avoid a cartesian product.

Let’s have a look at an XML file with multiple branches.

<University id="TUD">
    <Faculty id="Engineering">
        <Department id="ComputerScience">
         <Course>Introduction to Programming</Course>
         <Course>Algorithms and Data Structures</Course>
        </Department>
        <Department id="ElectricalEngineering">
         <Course>Circuit Analysis</Course>
            <Course>Electromagnetics</Course>
         <Course>Control Systems</Course>
        </Department>
        <ResearchGroup id="ArtificialIntelligence">
         <Project>Machine Learning Advances</Project>
         <Project>Neural Network Optimization</Project>
         <Project>AI Ethics and Society</Project>
        </ResearchGroup>
        <ResearchGroup id="Nanotechnology">
         <Project>Nano-materials Engineering</Project>
         <Project>Quantum Computing</Project>
        </ResearchGroup>
    </Faculty>
    <AdministrativeUnit id="Library">
        <Service>Book Lending</Service>
        <Service>Research Assistance</Service>
    </AdministrativeUnit>
    <AdministrativeUnit id="IT">
        <Service>Campus Network</Service>
        <Service>Helpdesk</Service>
    </AdministrativeUnit>
</University>

<Course>Introduction to Programming</Course>

<Course>Algorithms and Data Structures</Course>

</Department>

<Course>Circuit Analysis</Course>

<Course>Electromagnetics</Course>

<Course>Control Systems</Course>

</Department>

<Project>Machine Learning Advances</Project>

<Project>Neural Network Optimization</Project>

<Project>AI Ethics and Society</Project>

</ResearchGroup>

<Project>Nano-materials Engineering</Project>

<Project>Quantum Computing</Project>

</ResearchGroup>

</Faculty>

<Service>Book Lending</Service>

<Service>Research Assistance</Service>

</AdministrativeUnit>

<Service>Campus Network</Service>

<Service>Helpdesk</Service>

</AdministrativeUnit>

</University>

We can represent this XML in a tree like structure

University

├── Faculty

│ ├── Department

│ │ └── Course

│ ├── ResearchGroup

│ │ └── Project

├── AdministrativeUnit

└── Service

As we can see there are three branches

University > Faculty > Department > Course

University > Faculty > ResearchGroup > Project

University > AdministrativeUnit > Service

Challenges of flattening multiple XML branches

Let’s consider the two arrays <FACULTY> and <AdministrativeUnit> from the above example. We could use Flatten and the Lateral function to flatten the arrays. However, if we try to build a single query that flattens both arrays, we’ll get the dreaded Cartesian Product! In other words a multiplication of our result.

In the example above, there are 10 rows (i.e., subnodes) in the <Faculty> array and two rows in the <AdministrativeUnit > array. So the following query returns a total of 20 rows. This is definitely wrong and not what we want.

 select  
 Unit.value:"@id"::string::string as AdministrativeUnit_id,
 COURSE_PROJECT.value:"$"::string as Project_Course
 from
  XML_DATA_TABLE
 ,LATERAL FLATTEN(XML_CONTENT:"$") Unit
 ,LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department
 , LATERAL FLATTEN(Department.value:"$") AS COURSE_PROJECT
 where GET( Unit.value, '@') = 'AdministrativeUnit';

select

Unit.value:"@id"::string::string as AdministrativeUnit_id,

COURSE_PROJECT.value:"$"::string as Project_Course

from

XML_DATA_TABLE

,LATERAL FLATTEN(XML_CONTENT:"$") Unit

,LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department

, LATERAL FLATTEN(Department.value:"$") AS COURSE_PROJECT

where GET( Unit.value, '@') = 'AdministrativeUnit';

The best way to handle multiple branches is to create separate tables. One for each branch. As we have three branches we will end up with 3 tables.

Let’s do that in the following steps.

Project Branch

You can use the following code to flatten the project branches:

SELECT
GET(XML_CONTENT, '@id')::string as university_id,
XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,
GET(Department_.value, '@id')::string as Research_group_id,
COURSE_PROJECT.value:"$"::string as Project

 FROM  XML_DATA_TABLE
 ,LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_
 , LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT
 where GET( COURSE_PROJECT.value, '@') = 'Project'
 ;

SELECT

GET(XML_CONTENT, '@id')::string as university_id,

XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,

GET(Department_.value, '@id')::string as Research_group_id,

COURSE_PROJECT.value:"$"::string as Project

FROM XML_DATA_TABLE

,LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_

, LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT

where GET( COURSE_PROJECT.value, '@') = 'Project'

;

The flattened table looks like this:

Course branch

SELECT
GET(XML_CONTENT, '@id')::string as university_id,
XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,
GET(Department_.value, '@id')::string as dept_id,
COURSE_PROJECT.value:"$"::string as Course

 FROM  XML_DATA_TABLE
 ,LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_
 , LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT
 where GET( COURSE_PROJECT.value, '@') = 'Course'
 ;

SELECT

GET(XML_CONTENT, '@id')::string as university_id,

XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,

GET(Department_.value, '@id')::string as dept_id,

COURSE_PROJECT.value:"$"::string as Course

FROM XML_DATA_TABLE

,LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_

, LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT

where GET( COURSE_PROJECT.value, '@') = 'Course'

;

AdministrativeUnit Branch

select
GET(XML_CONTENT, '@id')::string as university_id,
 Department_.value:"@id"::string::string as AdministrativeUnit_id,
admin_unit.value:"$"::string as Service
from
  XML_DATA_TABLE
 ,LATERAL FLATTEN(XML_CONTENT:"$") Department_
 ,LATERAL FLATTEN(Department_.value:"$") AS admin_unit
 where GET( Department_.value, '@') = 'AdministrativeUnit'
 ;

select

GET(XML_CONTENT, '@id')::string as university_id,

Department_.value:"@id"::string::string as AdministrativeUnit_id,

admin_unit.value:"$"::string as Service

from

XML_DATA_TABLE

,LATERAL FLATTEN(XML_CONTENT:"$") Department_

,LATERAL FLATTEN(Department_.value:"$") AS admin_unit

where GET( Department_.value, '@') = 'AdministrativeUnit'

;

Query XML with schema on read

Now let’s create views on top of the flattened branches to create our reusable schema on read. Creating database views makes the XML queries reusable and easy to use with SQL.

-- View for Research Groups and Their Projects
CREATE OR REPLACE VIEW view_research_groups AS
SELECT
	GET(XML_CONTENT, '@id')::string as university_id,
	XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,
	GET(Department_.value, '@id')::string as Research_group_id,
	COURSE_PROJECT.value:"$"::string as Project
FROM  
	XML_DATA_TABLE,
	LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_,
	LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT
WHERE
	GET(COURSE_PROJECT.value, '@') = 'Project';

-- View for Departments and Their Courses
CREATE OR REPLACE VIEW view_departments_courses AS
SELECT
	GET(XML_CONTENT, '@id')::string as university_id,
	XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,
	GET(Department_.value, '@id')::string as dept_id,
	COURSE_PROJECT.value:"$"::string as Course
FROM  
	XML_DATA_TABLE,
	LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_,
	LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT
WHERE
	GET(COURSE_PROJECT.value, '@') = 'Course';

-- View for Administrative Units and Their Services
CREATE OR REPLACE VIEW view_administrative_units AS
SELECT
	GET(XML_CONTENT, '@id')::string as university_id,
	Department_.value:"@id"::string as AdministrativeUnit_id,
	admin_unit.value:"$"::string as Service
FROM
	XML_DATA_TABLE,
	LATERAL FLATTEN(XML_CONTENT:"$") Department_,
	LATERAL FLATTEN(Department_.value:"$") AS admin_unit
WHERE
	GET(Department_.value, '@') = 'AdministrativeUnit';

-- View for Research Groups and Their Projects

CREATE OR REPLACE VIEW view_research_groups AS

SELECT

GET(XML_CONTENT, '@id')::string as university_id,

XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,

GET(Department_.value, '@id')::string as Research_group_id,

COURSE_PROJECT.value:"$"::string as Project

FROM

XML_DATA_TABLE,

LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_,

LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT

WHERE

GET(COURSE_PROJECT.value, '@') = 'Project';

-- View for Departments and Their Courses

CREATE OR REPLACE VIEW view_departments_courses AS

SELECT

GET(XML_CONTENT, '@id')::string as university_id,

XMLGET(XML_CONTENT, 'Faculty'):"@id"::string as Faculty_id,

GET(Department_.value, '@id')::string as dept_id,

COURSE_PROJECT.value:"$"::string as Course

FROM

XML_DATA_TABLE,

LATERAL FLATTEN(XMLGET(XML_CONTENT, 'Faculty'):"$") Department_,

LATERAL FLATTEN(Department_.value:"$") AS COURSE_PROJECT

WHERE

GET(COURSE_PROJECT.value, '@') = 'Course';

-- View for Administrative Units and Their Services

CREATE OR REPLACE VIEW view_administrative_units AS

SELECT

GET(XML_CONTENT, '@id')::string as university_id,

Department_.value:"@id"::string as AdministrativeUnit_id,

admin_unit.value:"$"::string as Service

FROM

XML_DATA_TABLE,

LATERAL FLATTEN(XML_CONTENT:"$") Department_,

LATERAL FLATTEN(Department_.value:"$") AS admin_unit

WHERE

GET(Department_.value, '@') = 'AdministrativeUnit';

This gives us three database views

view_research_groups
view_departments_courses
view_administrative_units

We can now give these database views to our data analysts for further querying.

Working with mixed content in XML

In XML, mixed content refers to an element that contains both text and other child elements. This means the content is a mix of text nodes and element nodes, as opposed to purely text or purely child elements.

Here is an example of mixed content.

<?xml version="1.0" encoding="UTF-8"?>
<org>
<company>1
  <employee id="1">
    <name>John Doe</name>
    <position>Manager</position>
  </employee>
  <employee id="2">
    <name>Jane Smith</name>
    <position>Developer</position>
  </employee>
  <employee id="3">
    <name>Mike Johnson</name>
    <position>Designer</position>
  </employee>
  <employee id="4">
    <name>Alice Brown</name>
    <position>Tester</position>
  </employee>
  <employee id="5">
    <name>Chris Davis</name>
    <position>Support</position>
  </employee>
</company>
<company>2
  <employee id="1">
    <name>Hans Wurscht</name>
    <position>Hansel</position>
  </employee>
<employee id="2">
    <name>Uli</name>
    <position>Hansel</position>
  </employee>
</company>
</org>

<?xml version="1.0" encoding="UTF-8"?>

<org>

<company>1

<position>Manager</position>

</employee>

<name>Jane Smith</name>

<position>Developer</position>

</employee>

<name>Mike Johnson</name>

<position>Designer</position>

</employee>

<name>Alice Brown</name>

<position>Tester</position>

</employee>

<name>Chris Davis</name>

<position>Support</position>

</employee>

</company>

<company>2

<name>Hans Wurscht</name>

<position>Hansel</position>

</employee>

<position>Hansel</position>

</employee>

</company>

</org>

The <company> element contains both text (in this case the number 1) and other child elements (<employee>)

<company>1
  <employee id="1">
    <name>John Doe</name>
    <position>Manager</position>

<company>1

<position>Manager</position>

Extracting data from XML with mixed content is not straightforward due to the mixing of text and elements. However, it’s possible to parse such content using Snowflake’s XML functions. The following query demonstrates how you can achieve this:

SELECT

GET(COMPANY.value, '
)[0] AS company_id,
GET(EMPLOYEE.value, '@id') AS employee_id,
XMLGET( EMPLOYEE.value, 'name' ):"$"::string as employee_name,
XMLGET( EMPLOYEE.value, 'position' ):"$"::string as employee_position
FROM XML_DATA_TABLE,
lateral FLATTEN(XML_CONTENT:"$") COMPANY,
lateral FLATTEN(COMPANY.VALUE:"$") EMPLOYEE
WHERE  GET(EMPLOYEE.value, '@') = 'employee' ;

SELECT

GET(COMPANY.value, '

)[0] AS company_id,

GET(EMPLOYEE.value, '@id') AS employee_id,

XMLGET( EMPLOYEE.value, 'name' ):"$"::string as employee_name,

XMLGET( EMPLOYEE.value, 'position' ):"$"::string as employee_position

FROM XML_DATA_TABLE,

lateral FLATTEN(XML_CONTENT:"$") COMPANY,

lateral FLATTEN(COMPANY.VALUE:"$") EMPLOYEE

WHERE GET(EMPLOYEE.value, '@') = 'employee' ;

Explanation

SELECT Clause:

GET(COMPANY.value, ‘$’)[0] AS COMPANY_ID: Extracts the first text node from the <company> element, which contains the company ID.
XMLGET(EMPLOYEE.value, ‘name’):”$”::string AS employee_name: Extracts the name element from each <employee> element and casts it to a string.
XMLGET(EMPLOYEE.value, ‘position’):”$”::string AS employee_position: Extracts the position element from each <employee> element and casts it to a string.

FROM Clause:

LATERAL FLATTEN(XML_CONTENT:”$”) COMPANY: Flattens the XML content at the root level to extract each <company> element.
LATERAL FLATTEN(COMPANY.VALUE:”$”) EMPLOYEE: Further flattens each <company> element to extract nested <employee> elements.

WHERE Clause:

GET(EMPLOYEE.value, ‘@’) = ’employee’: Ensures that only <employee> elements are processed. Otherwise the output will also have null rows because of text nodes present in the mixed content

The output of the query is as follows:

Parsing XML to Snowflake tables

In the previous section we covered in detail how we can query XML and create a schema on read on top of XML data stored inside Snowflake’s VARIANT data type.

We used XMLGET, GET, and LATERAL FLATTEN to walk the XML hierarchy. We then created database Views on top of these queries to make the queries reusable and expose them to data analysts.

Another approach is to parse the data to Snowflake tables and write the output to a relational format. We typically do this as part of ETL and a data pipeline. For example we can load data into Snowflake tables from the database views created previously as follows.

-- Step 1: Create the table
CREATE TABLE research_groups (
    university_id STRING,
    faculty_id STRING,
    research_group_id STRING,
    project STRING
);

-- Step 2: Insert data into the table from the view
INSERT INTO research_groups
SELECT 
    university_id,
    faculty_id,
    research_group_id,
    project
FROM 
    view_research_groups;

-- Step 1: Create the table

CREATE TABLE research_groups (

university_id STRING,

faculty_id STRING,

research_group_id STRING,

project STRING

);

-- Step 2: Insert data into the table from the view

INSERT INTO research_groups

SELECT

university_id,

faculty_id,

research_group_id,

project

FROM

view_research_groups;

While the VARIANT data type offers relatively good performance, querying Snowflake relational tables is always faster. This is particularly true for the following scenarios

You have complex and deeply nested XML
You require many joins across database views that you created as part of the schema on read.

The other significant advantage is that data analysts are very skilled using SQL and they prefer writing SQL against tables instead of semi-structured data.

Unless you have an ad hoc requirement or a very simple scenario and use case we recommend writing out your XML data to relational tables in Snowflake.

XML in data warehouse, data lake, or data lakehouse

When we convert XML to a relational format we typically do this as part of a wider data warehouse load where one of the data sources is XML.

You can check our guide on XML conversion in the context of data warehousing and data lakes for some reference architecture and what you need to take into account when building an XML conversion workflow.

You can use Flexter to automate the parsing of XML and conversion to Snowflake tables. Store your XML files on object storage, e.g. S3 or Azure Data Lake Storage (ADSL) and automatically convert the XML to Snowflake tables. No coding needed. As a bonus Flexter can also parse and convert JSON.

Parsing XML with Python or Java

Apart from the Snowflake XML functions you can also use Snowpark to parse and convert Snowflake in XML.

You can either use Python inside Snowpark to load the XML into a data frame and then use Python native features to parse the XML. A similar process applies to using Java. Both approaches require a lot of coding and are very time consuming.

Converting XML to JSON

Another option of parsing XML is to convert the XML to JSON using XSLT and then use Snowflake’s JSON functions to process the data.

Converting XML to Snowflake tables with Flexter

Converting clinical trials XML to Snowflake tables

We downloaded clinical trials data with 299,015 studies in XML format from the ClinicalTrials.gov website with thousands of XML files

Each XML document contains information about a clinical trial, e.g. facilities or sites where the study is performed, names of doctors and other study personnel etc.

We will use Flexter to convert the clinical trials data locked away in the XML. In this example we will use Flexter’s command line utilities. Flexter also ships with an API.

We will use the XML data to infer a relational schema

Step 1: Create Data Flow and Target model

In a first step we create the target data model and the mappings from the XML elements to the table columns in the relational target schema. This information is stored in Flexter’s metadata repository. We will use the XML data only to infer a relational schema. In the next section we show you how you can use an XSD in combination with XML to derive a relational schema.

xml2er -g1 /ClinicalTrials.zip

1	xml2er -g1 /ClinicalTrials.zip

The -g switch tells Flexter which optimisation algorithm to use. Flexter ships with some algorithms that make it easier to work with the target schema by simplifying the structures.

We also point Flexter to the zipped up XML data. This is used to derive the relational target schema.

As output from this step we get a schema ID. In this case 2556. We will use this ID as input for the next step.

Step 2: Convert XML data to tables

We will now convert the XML to the relational schema we created in the previous step. We pass the schema ID 2556 into the -l switch and specify the connection details

xml2er -l2556 /ClinicalTrials.zip -S o “snowflake://https://<account>.snow flakecomputing.com/?warehouse=ware&db=db&schema=CLINICAL_TRIALS” -U <username> -P <pwd>

1	xml2er -l2556 /ClinicalTrials.zip -S o “snowflake://https://<account>.snow flakecomputing.com/?warehouse=ware&db=db&schema=CLINICAL_TRIALS” -U <username> -P <pwd>

That’s it. We can now query the data with SQL in Snowflake.

We mentioned earlier that Flexter ships with a metadata catalog. The metadata can be used to generate DDL scripts, compare different schema versions, evolve the schema, generate data lineage, generate an ER diagram, etc.

Here is the ER diagram that we generated from the Flexter UI and exported to a PDF. We have made the data model of the clinical trials data available for download.

Converting HL7 XML and XSD to Snowflake

In this example we show you how you can use an XSD to parse XML to Snowflake tables using Flexter. It is also a two step process but this time we use both the XSD and XML to generate the target schema.

For this example we will use HL7 CDA data. HL7 is a widespread standard in healthcare.

Step 1: Generate relational target schema.

You already know the first command from the previous section on processing clinical trials XML. It generates a schema ID. In this case the ID is 5964.

We use this ID to enrich and complement the created schema with additional information from the XSD, e.g. constraints, data types, relationships etc.

xml2er -g1 /CDA_XML.zip

xsd2er -a5964 -g1 /CDA_XSD.zip

xml2er -g1 /CDA_XML.zip

xsd2er -a5964 -g1 /CDA_XSD.zip

The schema ID is passed to the -a switch of the xsd2er command line tool.

We also pass the CDA XSD as a zip to this Flexter command line utility.

The output is an updated schema ID 2406. This will be needed in the next step.

Step 2: Convert HL7 XML to Snowflake tables

We use the schema ID from step 1 to convert the data to Snowflake.

xml2er -l2406 /CDA_XML.zip -S o “snowflake://https://<snowflake account>.snow flakecomputing.com/?warehouse=ware&db=db&schema=CDA” -U <user> -P <pass>

1	xml2er -l2406 /CDA_XML.zip -S o “snowflake://https://<snowflake account>.snow flakecomputing.com/?warehouse=ware&db=db&schema=CDA” -U <user> -P <pass>

Now we can query the converted data with SQL on Snowflake.

Here is the documentation that Flexter generated as an ER diagram in PDF.

Limitations of working with XML in Snowflake

So far we have seen what you can do with XML in Snowflake. Now, let’s focus on what you can’t do with XML inside Snowflake.

The biggest limitation of working with XML on Snowflake is that the conversion to a readable and queryable format is a manual process that requires a lot of manual effort both in terms of analysis and coding. This is not a Snowflake specific issue but a challenge in general. We will cover this topic in more detail in a minute as a separate section about automating XML conversion.

Let’s first go through some of the more technical limitations of XML parsing on Snowflake

XML is public preview feature

The XML feature is only in public preview, which means that there is only limited support. In other words Snowflake does not guarantee that everything will work as expected. In a worst case scenario you may not get support or a fix for an issue you encounter.

16 MB limit for VARIANT

You can only store XML documents up to a size of 16 MB. This is the uncompressed size. In reality it will be slightly smaller than this as there is some internal overhead of storing the data.

ZIPped XML not working

Often you will get XML documents in a .zip file. Snowflake does not support loading zipped files directly. You will first need to unzip the file and then copy them into Snowflake.

XML Schema XSD not supported

A bigger limitation is that unlike other databases such as Oracle or MS SQL Server, Snowflake does not support XML schema. XSDs are not supported. This means that you need to manually infer the data types, relationships, and constraints from a sample of XML files. XSD No support for XSD. The data types, relationships and constraints need to be derived manually, which is very time consuming. If new information in the form of new XML becomes available, e.g. the data type is a Character rather than a NUMBER the solution needs to be refactored.

Schema evolution

We just noted that Snowflake does not have support for XSDs. Related to this is the issue that Snowflake does not have a feature to notify the user of unexpected / rogue XML elements or of changes to the XML in general. This means that the user can never be sure that they have full coverage for all possible XML elements as they have to rely on a sample of XMLs which is typically incomplete.

Let’s illustrate this with an example.

When the engineer first wrote the code to convert the XML to relational tables on Snowflake they relied on the following XML sample

<Company>
    <Name>Tech Solutions</Name>
    <Location>San Francisco</Location>
    <Employee>
        <FirstName>John</FirstName>
        <LastName>Doe</LastName>
        <Position>Software Engineer</Position>
    </Employee>
</Company>

<Name>Tech Solutions</Name>

<Location>San Francisco</Location>

<Position>Software Engineer</Position>

</Employee>

</Company>

The solution went into production and everything works seemingly fine.

However, as it turns out the sample that was used to infer the schema was too narrow and several elements were missed in the analysis.

<Company>
    <Name>Tech Solutions</Name>
    <Location>San Francisco</Location>
    <Employee>
        <FirstName>John</FirstName>
        <LastName>Doe</LastName>
        <Position>Software Engineer</Position>
       <Email>john.doe@techsolutions.com</Email>
        <Phone>555-1234</Phone>
    </Employee>
</Company>

<Name>Tech Solutions</Name>

<Location>San Francisco</Location>

<Position>Software Engineer</Position>

<Email>john.doe@techsolutions.com</Email>

</Employee>

</Company>

Email and Phone will not be included in the conversion and there is no way to get notified about this issue.

A related issue is that information will go missing in case of the renaming of an element, e.g. Position to Role

<Company>
    <Name>Tech Solutions</Name>
    <Location>San Francisco</Location>
    <Employee>
        <FirstName>John</FirstName>
        <LastName>Doe</LastName>
        <Role>Software Engineer</Role>
    </Employee>
</Company>

<Name>Tech Solutions</Name>

<Location>San Francisco</Location>

<Role>Software Engineer</Role>

</Employee>

</Company>

The conversion code will ignore this change and populate the downstream Position column with NULL values.

INFER_SCHEMA not working for XML

While Snowflake does have a feature to derive a schema from a sample of semi-structured documents, infer_schema is only supported for JSON but not for XML.

XML versus JSON support

In general support for XML is a lot less mature than support for JSON. While support for JSON is super comprehensive, XML is treated as a second class citizen (also in the documentation). As XML is not widely used this is understandable but will not help you if you use XML extensively in your organisation.

XQuery not supported

XQuery is a widely used language and standard for querying XML. Oracle or MS SQL Server have comprehensive support for XQuery and integrated it with SQL. Snowflake does not have XQuery support but has implemented its own proprietary set of functions to work with XML. Engineers with XQuery skills will need to reskill on the proprietary features.

Snowflake does not support XML schema evolution, which is supported for other semi-structured file types. Schema evolution on Snowflake is a manual process. As a prerequisite you will need to have some code that monitors the incoming XML documents for changes or unidentified XML elements. For any changes that are detected you can send out a notification to the operations or engineering team. The latter will then need to refactor the code to cater for the changes. This is not a Snowflake specific limitation but a general limitation of manual XML conversion.

No data lineage for XML

Snowflake does not provide data lineage that maps the XML elements and their XPaths to target table columns.

No schema evolution

Refactoring and schema evolution are manual processes.

Automated XML conversion on Snowflake

We have created Flexter to address the limitations of working with XML on Snowflake. As a bonus it also eliminates the risks and headaches associated with manual XML conversion projects. Flexter is an XML conversion tool aka XML converter. It automates the conversion of XML to a relational format on Snowflake and other popular data platforms and databases such as BigQuery, Databricks, Azure, Oracle, SQL Server, Teradata etc.

Converting XML to Snowflake with Flexter is a two step process. We have provided step by step walkthroughs in an earlier section when we talked about the options of parsing XML on Snowflake.

Step 1: Based on an XSD, a sample of XML files, or a combination of both you generate a relational target schema and the mappings between XML elements and the target table columns in Snowflake

Step 2: Convert XML

Yes, it’s that simple.

We have put together a brief demo video where we convert XML based on the FpML industry data standard to Snowflake. It is one of the more complex standards with tens of thousands of elements. It would be quite challenging to convert this data manually using code.

Did you know?

You can also use Flexter to automate the conversion of JSON to a relational format on Snowflake. The same simple two step process as for XML applies.

Manual versus automated XML conversion

Let’s discuss when to use an automated approach and when to manually analyse and convert XML.

We have put together a PDF that compares a manual approach of converting XML to the automated conversion process. We make the comparison along the XML conversion project life cycle

	Manual approach using code	Automated approach with Flexter
Analysis to understand structure of XML and XSD	For complex schemas this process can be super time consuming. The analysis can take more time than all the other steps combined. – Go through documentation of industry data standard if available. – Understanding XML Schema/Structure. – Pinpoint the main elements, attributes, and their hierarchies within the XML. – Understand how different elements and attributes relate to each other. – Identify any hierarchical structures, such as parent-child relationships, within the XML data. – Checking for Repeating and Optional Elements: – Identify elements that occur multiple times and understand their significance. – Note any optional elements that may or may not appear in the XML. – Understanding Namespaces – Evaluate the size of the XML files and the complexity of the structure, as this will impact the mapping process. – Evaluate XML for Anomalies or Irregularities	Running a single command covers steps 1-3 This creates a Flexter Data Flow, which analyses the XSD and/or XML, creates a relational target data model, and creates the mappings between XML source elements and target table columns Option 1: Generating a data flow from an XML. We apply the elevate and reuse optimisation algorithm (g3 switch) $ xml2er -g3 donut.xml Option 2: Generating a Data Flow from an XSD $ xsd2er -g3 donut.xsd Option 3: Generating a Data Flow from an XML and XSD to get the best of both worlds $ xml2er donut.xml $ xsd2er -a4 -g3 donut.xsd In this example we create the Data Flow from the command line. You can also use the Flexter API or Flexter UI to create a Data Flow.
Model and create target schema	Once you have completed the analysis you need to use your findings to create a relational schema for the target format. Translate between the hierarchical structure to the relational format Map data types Define and implement naming conventions Define and implement parent-child relationships Define and implement constraints	Already covered under Step 1 It is worthwhile mentioning that Flexter applies optimisation algorithms that significantly simplify the structure of the target schema
Create mapping	Manually map the XML elements to the target table columns	Already covered under Step 1
XML Conversion	This is another very time consuming step. Developers write the code to load XML to the database and convert it with extensions to SQL such as XQuery to the relational databases tables and columns. Developers build a data pipeline and schedule a time to run it. The solution is tested	A single command is run to convert the data. The switch l3 references the ID of the Data Flow that was generated in the previous step. $ xml2er -l3 donut.zip In this example we create the XML Conversion from the command line. You can also use the Flexter API or Flexter UI to convert the XML.
Error handling, logging, and alerting	In XML mapping error handling, there are two crucial checks to perform. Firstly, ensure that the XML to be converted is valid. Secondly, verify that the XML adheres to the target schema. This means it should not have any additional XML elements and must comply with the specified data types.	This is automatically handled as part of Step 4 XML Conversion
Documentation	The most important documentation artefact is a Source to Target Map (STM). The STM acts as a blueprint that defines how each piece of data is transformed and transferred from the source to the target. It specifies how each field in the XML corresponds to the target including any data type conversions.	Already covered under Step 1 When creating the Data Flow Flexter stores all of the information in its metadata catalog. The metadata can then be used to auto generate Source to Target Maps, ER diagrams, and diffs and deltas between different versions of a schema.
Refactoring for changes	If changes are made to the XML or the XML schema you need to repeat steps 1-6 to cater for the change	Flexter ships with a metadata store that semi-automates the upgrade between different versions of an XML or XSD.

When considering an automated XML conversion tool like Flexter, weigh the benefits against the additional licence costs.

For simple or occasional needs, an XML automation tool may not be necessary. However, XML conversion software is beneficial if any of the following apply:

Your XML is very complex or uses an XSD.
It is based on an industry standard like HL7 or FpML.
You need to convert many types of XML.
Your XML files are large.
You handle vast volumes of XML data with tight SLAs.
Your team lacks expertise in XML technologies (XQuery, XSLT, XPath).
You have tight deadlines for your XML conversion project.

If you answered ‘Yes’ to any of these, try the free online version of our XML converter, Flexter, or set up a free consultation with one of our XML experts to discuss your needs.

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

How to Parse & Flatten XML in Snowflake With Examples

Key Takeaways

Quick XML primer

XML element

XML attribute

Difference between XML attribute and element

XML Fragment

XML entity

CDATA

XML Namespace

XML conversions to relational format

Denormalisation / Flattening

Normalisation

Multiple branches

Loading and storing XML in Snowflake

Storing XML in Snowflake

Important Tip

Loading XML

IIS (INSERT INTO … SELECT)

External Table

Copy INTO

COMPRESSION

Copy options

STRIP_OUTER_ELEMENT

Loading Metadata

Performance considerations for loading XML

Querying XML

Snowflake XML functions

PARSE_XML

Example

CHECK_XML

TO_XML

Walking the XML hierarchy with XMLGET

Tip

Extracting attributes, tags and values with GET

Filtering XML with Higher Order Functions

Key Components of Higher-Order Functions:

Higher Order Functions in action

Denormalising with Flatten and Lateral Flatten

FLATTEN

LATERAL FLATTEN

Example of using LATERAL FLATTEN

Issue flattering XML with single element arrays

Flattening XML with multiple branches

Query XML with schema on read

Working with mixed content in XML

Parsing XML to Snowflake tables

XML in data warehouse, data lake, or data lakehouse

Parsing XML with Python or Java

Converting XML to JSON

Converting XML to Snowflake tables with Flexter

Converting clinical trials XML to Snowflake tables

Converting HL7 XML and XSD to Snowflake

Limitations of working with XML in Snowflake

Automated XML conversion on Snowflake

Did you know?

Manual versus automated XML conversion

Further reading

Snowflake data types

Snowflake XML functions

XML conversion

Flexter

Flexter Converting XML to Snowflake tables

Related Articles

CSV vs JSON vs XML – The Best Comparison Guide 2024

XSD to Database Schema Guide – Create SQL Tables Easily

Convert XML to CSV Like a Pro – Essential Tools & Methods

Cookies consent