XML,

Home » XML » The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users

The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users

by Maciek

Maciek is the Co-founder of Sonra. He has a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek is the magic ingredient to making sure your systems don’t just work—they shine.

Published on January 22, 2021
Updated on March 1, 2025

All the data in the world is not worth much unless you can use it.
That means it must be intelligible, readable, and analysable.
Organisations spend huge resources on data warehousing projects that attempt to wrestle with the data, in order to get it into a format that makes sense to those who must analyse it and make decisions based upon what it tells them.
Unfortunately, many of these projects stall while others barely get off the ground. Sometimes, the sheer volume of data or the complexities involved in conversion are major spokes in the wheel.
At best, this results in costly delays. At worst, it leads to expensive abandonments and a lack of accurate data on which to base important business decisions.

We have created a checklist that outlines the 6 Factors You Need to Get Right to Make Your XML Conversion Project a Success.

To help solve these significant problems, the data conversion process needs to be more automated and the format of the data more optimised.
You may know already about Flexter’s in-built automation. The tool transforms data rapidly from XML in a highly automated process that cuts hours of work down to minutes.

But did you know that it also optimises the conversion so that the format in which the data is presented is more readable?

This means those who need to actually interpret the data not only get access to the data more quickly, but more efficiently, making it easier to read and use the data.
Here’s how…

Converting and optimising data

What makes sense to a data warehouse manager or a data architect may not make a great deal of sense to a “downstream” user of the data.
Converting data from XML to a database is one thing. But for that data to be intelligible for a decision-maker, it needs to be optimised into a recognisable and highly readable format.
This is where the real beauty of Flexter comes in.
A powerful in-built optimisation algorithm works to greatly simplify the auto-generated target schema so that any downstream user can query and read the necessary data.
There are actually two separate algorithms: one called Elevate and the other called Re-use. Both optimise XML data in different ways and this is explained below.

Elevate optimisation

XML files are often overly complex due to un-optimised modelling. Even some experienced modellers create artificial hierarchies that are unnecessary.
This makes the data less readable after conversion into a relational format.
When converting XML data, we create a separate table for each level in the hierarchy. The problem with this strategy is that it often leads to a huge number of obsolete tables in the target schema.
By default, the Elevate optimisation algorithm within Flexter is able to detect and fix these “anomalies”. It detects 1:1 relationships in the XML hierarchy that have been modelled as 1:N relations.

Elevate optimisation in action

The sample XML below relates to musician data. The relationship between artis_t and _artist info has been modelled as two distinct levels in the hierarchy.
This does not make sense. The tag artist info makes the XML document unnecessarily complex.

When Flexter analyses a representative sample of XML documents, it is able to detect this anomaly and elevate the child elements to the parent table.
In this example, all elements of artist info are elevated to the _artis_t tag using Flexter.
Let’s have a look at the target schema that Flexter generates with and without _Elevate_optimisation.

Target schema without optimisation

Music database schema showing artist album and song relationships

Target schema with Elevate optimisation

The Elevate optimisation algorithm cuts the number of tables down from 13 to four by elevating child attributes to parent tables, wherever possible.
Expand this example out into huge data files with thousands of artists and you can see how downstream users of the data will find it much easier to work with it in this format.
But there’s more…

Re-use optimisation

XML schemas (XSDs) can contain complex types, which can be reused multiple times inside an XML document.
With an un-optimised conversion, multiple different tables will be created for the same types.
Flexter greatly simplifies this by detecting the behaviour and consolidating the information of different type instances in the same relational entity. That means a single table for the same types.
So, an XSD developer just needs to define the complex type once, and then they can re-use it for different instances.
For instance, the complex type of address can be used as home address, shipping address, billing address, etc.
If a change is made to the complex type, it is applied to all instances inside the schema.
This is again best illustrated with an example…

Re-use optimisation in action

In the sample below again relating to musician data, we have defined audiotrack as a complex type.

As you can see below, audiotrack is reused multiple times in album, both as song_and as _bonus track.

Remember, Flexter consolidates the data from multiple instances of a complex type in the same table in the target schema.
The data for song and bonus_track will, therefore, end up in the same target table.
Let’s take a closer look at this example.

Target schema without Re-use optimisation

Target schema with Re-use optimisation

As we can see, there are fewer tables. Tables song, bonus_track, and soundtrack are now grouped under one table called audiotrack.

The power of using both optimisation algorithms

Individually, the optimisation algorithms work to simplify data. When they work in tandem, as Flexter is designed to do, the results are even more impressive.
Let’s consider the example of a clinical trial data set sourced from the clinicaltrials.gov website …

Target schema without optimisation

This is virtually unreadable. A total of 172 target tables would take a lot of time to pick through, make sense of, and arrive at decisions with.

Target schema with Elevate optimisation

As you can see, Elevate optimisation simplifies the output from 172 target tables to just 24 tables.
That’s an impressive degree of simplification but let’s look at the results when we also apply the Re-use optimisation algorithm to the data set.

Target schema with Elevate and Re-use optimisation

Data structure diagram for clinical study management and documentation

You can see that the original 172 tables are reduced again from 24 to a mere 12 tables.
This is exactly how Flexter is intended for use – data simplification, readability, manageability.
Imagine the time, resources, complexity, and headaches saved for downstream users by working with 12 tables rather than 172!
That’s where you start to see the real impact of Flexter on your organisation’s productivity.

How to convert XML data with these in-built optimisation tools

Converting XML data can be performed in three simple steps:

Step 1

In a rapid, one-time operation, we scan and traverse XML/JSON documents for information and intelligence.

Step 2

We create a logical target schema and the mappings between XML/JSON elements and the database tables and columns.

Step 3

We process and convert the XML/JSON documents each time new data arrives.

For instructions on how to complete each of these steps, installing Docker CE on Ubuntu, please refer to the final sections of this blog post.

Done-for-you data conversion

At first glance, Flexter is a powerful tool for the data conversion process.
That’s true and it’s the main reason why we created it. Our own frustration of writing manual code to convert XML documents into a data warehouse saw many projects fail or run over budget.
That’s why we built powerful automation features into the tool.
However, there is much more to Flexter. It doesn’t just make the data conversion process faster, simpler, cheaper, more predictable, and less likely to cause major disruptions in projects.
Data punched into the system also emerges from the other end of the system in a smarter and more intelligible format.
This makes the results more meaningful and valuable to an organisation where non-technical decision-makers may need to work with large volumes of data in real-time.
With the powerful, inbuilt algorithms that Flexter is shipped with your data warehousing job is made much easier with user-ready data that can be passed up or down the line as needed and without considerable extra data management work.

Enterprise edition

Can be installed on a single node or for very large volumes of XML on a cluster of servers.

About the author:

Maciek

Co-founder of Sonra

Follow Maciek:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users

Converting and optimising data