The Secrets to Unlocking XML Data in an Optimised Format for Downstream Users
All the data in the world is not worth much unless you can use it.
That means it must be intelligible, readable, and analysable.
Organisations spend huge resources on data warehousing projects that attempt to wrestle with the data, in order to get it into a format that makes sense to those who must analyse it and make decisions based upon what it tells them.
Unfortunately, many of these projects stall while others barely get off the ground. Sometimes, the sheer volume of data or the complexities involved in conversion are major spokes in the wheel.
At best, this results in costly delays. At worst, it leads to expensive abandonments and a lack of accurate data on which to base important business decisions.
Table of Contents
We have created a checklist that outlines the 6 Factors You Need to Get Right to Make Your XML Conversion Project a Success.
To help solve these significant problems, the data conversion process needs to be more automated and the format of the data more optimised.
You may know already about Flexter’s in-built automation. The tool transforms data rapidly from XML in a highly automated process that cuts hours of work down to minutes.
But did you know that it also optimises the conversion so that the format in which the data is presented is more readable?
This means those who need to actually interpret the data not only get access to the data more quickly, but more efficiently, making it easier to read and use the data.
Converting and optimising data
What makes sense to a data warehouse manager or a data architect may not make a great deal of sense to a “downstream” user of the data.
Converting data from XML to a database is one thing. But for that data to be intelligible for a decision-maker, it needs to be optimised into a recognisable and highly readable format.
This is where the real beauty of Flexter comes in.
A powerful in-built optimisation algorithm works to greatly simplify the auto-generated target schema so that any downstream user can query and read the necessary data.
There are actually two separate algorithms: one called Elevate and the other called Re-use. Both optimise XML data in different ways and this is explained below.
XML files are often overly complex due to un-optimised modelling. Even some experienced modellers create artificial hierarchies that are unnecessary.
This makes the data less readable after conversion into a relational format.
When converting XML data, we create a separate table for each level in the hierarchy. The problem with this strategy is that it often leads to a huge number of obsolete tables in the target schema.
By default, the Elevate optimisation algorithm within Flexter is able to detect and fix these “anomalies”. It detects 1:1 relationships in the XML hierarchy that have been modelled as 1:N relations.
Elevate optimisation in action
The sample XML below relates to musician data. The relationship between artis_t and _artist info has been modelled as two distinct levels in the hierarchy.
This does not make sense. The tag artist info makes the XML document unnecessarily complex.
When Flexter analyses a representative sample of XML documents, it is able to detect this anomaly and elevate the child elements to the parent table.
In this example, all elements of artist info are elevated to the _artis_t tag using Flexter.
Let’s have a look at the target schema that Flexter generates with and without _Elevate_optimisation.
Target schema without optimisation
Target schema with Elevate optimisation
The Elevate optimisation algorithm cuts the number of tables down from 13 to four by elevating child attributes to parent tables, wherever possible.
Expand this example out into huge data files with thousands of artists and you can see how downstream users of the data will find it much easier to work with it in this format.
But there’s more…
XML schemas (XSDs) can contain complex types, which can be reused multiple times inside an XML document.
With an un-optimised conversion, multiple different tables will be created for the same types.
Flexter greatly simplifies this by detecting the behaviour and consolidating the information of different type instances in the same relational entity. That means a single table for the same types.
So, an XSD developer just needs to define the complex type once, and then they can re-use it for different instances.
For instance, the complex type of address can be used as home address, shipping address, billing address, etc.
If a change is made to the complex type, it is applied to all instances inside the schema.
This is again best illustrated with an example…
Re-use optimisation in action
In the sample below again relating to musician data, we have defined audiotrack as a complex type.
As you can see below, audiotrack is reused multiple times in album, both as song_and as _bonus track.
Remember, Flexter consolidates the data from multiple instances of a complex type in the same table in the target schema.
The data for song and bonus_track will, therefore, end up in the same target table.
Let’s take a closer look at this example.
Target schema without Re-use optimisation
Target schema with Re-use optimisation
As we can see, there are fewer tables. Tables song, bonus_track, and soundtrack are now grouped under one table called audiotrack.
The power of using both optimisation algorithms
Individually, the optimisation algorithms work to simplify data. When they work in tandem, as Flexter is designed to do, the results are even more impressive.
Let’s consider the example of a clinical trial data set sourced from the clinicaltrials.gov website …
Target schema without optimisation
This is virtually unreadable. A total of 172 target tables would take a lot of time to pick through, make sense of, and arrive at decisions with.
Target schema with Elevate optimisation
As you can see, Elevate optimisation simplifies the output from 172 target tables to just 24 tables.
That’s an impressive degree of simplification but let’s look at the results when we also apply the Re-use optimisation algorithm to the data set.
Target schema with Elevate and Re-use optimisation
You can see that the original 172 tables are reduced again from 24 to a mere 12 tables.
This is exactly how Flexter is intended for use – data simplification, readability, manageability.
Imagine the time, resources, complexity, and headaches saved for downstream users by working with 12 tables rather than 172!
That’s where you start to see the real impact of Flexter on your organisation’s productivity.
How to convert XML data with these in-built optimisation tools
Converting XML data can be performed in three simple steps:
In a rapid, one-time operation, we scan and traverse XML/JSON documents for information and intelligence.
We create a logical target schema and the mappings between XML/JSON elements and the database tables and columns.
We process and convert the XML/JSON documents each time new data arrives.
For instructions on how to complete each of these steps, installing Docker CE on Ubuntu, please refer to the final sections of this blog post.
Done-for-you data conversion
At first glance, Flexter is a powerful tool for the data conversion process.
That’s true and it’s the main reason why we created it. Our own frustration of writing manual code to convert XML documents into a data warehouse saw many projects fail or run over budget.
That’s why we built powerful automation features into the tool.
However, there is much more to Flexter. It doesn’t just make the data conversion process faster, simpler, cheaper, more predictable, and less likely to cause major disruptions in projects.
Data punched into the system also emerges from the other end of the system in a smarter and more intelligible format.
This makes the results more meaningful and valuable to an organisation where non-technical decision-makers may need to work with large volumes of data in real-time.
With the powerful, inbuilt algorithms that Flexter is shipped with your data warehousing job is made much easier with user-ready data that can be passed up or down the line as needed and without considerable extra data management work.
Can be installed on a single node or for very large volumes of XML on a cluster of servers.