Maciek is the Co-founder of Sonra. He has a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek is the magic ingredient to making sure your systems don’t just work—they shine.
Published on January 22, 2021 Updated on November 20, 2024
All the data in the world is not worth much unless you can use it. That means it must be intelligible, readable, and analysable. Organisations spend huge resources on data warehousing projects that attempt to wrestle with the data, in order to get it into a format that makes sense to those who must analyse it and make decisions based upon what it tells them. Unfortunately, many of these projects stall while others barely get off the ground. Sometimes, the sheer volume of data or the complexities involved in conversion are major spokes in the wheel. At best, this results in costly delays. At worst, it leads to expensive abandonments and a lack of accurate data on which to base important business decisions.
We have created a checklist that outlines the 6 Factors You Need to Get Right to Make Your XML Conversion Project a Success.
To help solve these significant problems, the data conversion process needs to be more automated and the format of the data more optimised. You may know already about Flexter’s in-built automation. The tool transforms data rapidly from XML in a highly automated process that cuts hours of work down to minutes.
But did you know that it also optimises the conversion so that the format in which the data is presented is more readable?
This means those who need to actually interpret the data not only get access to the data more quickly, but more efficiently, making it easier to read and use the data. Here’s how…
✕
Use Flexter to turn XML and JSON into Valuable Insights
What makes sense to a data warehouse manager or a data architect may not make a great deal of sense to a “downstream” user of the data. Converting data from XML to a database is one thing. But for that data to be intelligible for a decision-maker, it needs to be optimised into a recognisable and highly readable format. This is where the real beauty of Flexter comes in. A powerful in-built optimisation algorithm works to greatly simplify the auto-generated target schema so that any downstream user can query and read the necessary data. There are actually two separate algorithms: one called Elevate and the other called Re-use. Both optimise XML data in different ways and this is explained below.
Elevate optimisation
XML files are often overly complex due to un-optimised modelling. Even some experienced modellers create artificial hierarchies that are unnecessary. This makes the data less readable after conversion into a relational format. When converting XML data, we create a separate table for each level in the hierarchy. The problem with this strategy is that it often leads to a huge number of obsolete tables in the target schema. By default, the Elevate optimisation algorithm within Flexter is able to detect and fix these “anomalies”. It detects 1:1 relationships in the XML hierarchy that have been modelled as 1:N relations.
Elevate optimisation in action
The sample XML below relates to musician data. The relationship between artis_t and _artist info has been modelled as two distinct levels in the hierarchy. This does not make sense. The tag artist info makes the XML document unnecessarily complex. When Flexter analyses a representative sample of XML documents, it is able to detect this anomaly and elevate the child elements to the parent table. In this example, all elements of artist info are elevated to the _artis_t tag using Flexter. Let’s have a look at the target schema that Flexter generates with and without _Elevate_optimisation.
Target schema without optimisation
Target schema with Elevate optimisation
The Elevate optimisation algorithm cuts the number of tables down from 13 to four by elevating child attributes to parent tables, wherever possible. Expand this example out into huge data files with thousands of artists and you can see how downstream users of the data will find it much easier to work with it in this format. But there’s more…
Re-use optimisation
XML schemas (XSDs) can contain complex types, which can be reused multiple times inside an XML document. With an un-optimised conversion, multiple different tables will be created for the same types. Flexter greatly simplifies this by detecting the behaviour and consolidating the information of different type instances in the same relational entity. That means a single table for the same types. So, an XSD developer just needs to define the complex type once, and then they can re-use it for different instances. For instance, the complex type of address can be used as home address, shipping address, billing address, etc. If a change is made to the complex type, it is applied to all instances inside the schema. This is again best illustrated with an example…
Re-use optimisation in action
In the sample below again relating to musician data, we have defined audiotrack as a complex type. As you can see below, audiotrack is reused multiple times in album, both as song_and as _bonus track. Remember, Flexter consolidates the data from multiple instances of a complex type in the same table in the target schema. The data for song and bonus_track will, therefore, end up in the same target table. Let’s take a closer look at this example.
Target schema without Re-use optimisation
Target schema with Re-use optimisation
As we can see, there are fewer tables. Tables song, bonus_track, and soundtrack are now grouped under one table called audiotrack.
The power of using both optimisation algorithms
Individually, the optimisation algorithms work to simplify data. When they work in tandem, as Flexter is designed to do, the results are even more impressive. Let’s consider the example of a clinical trial data set sourced from the clinicaltrials.gov website …
Target schema without optimisation
This is virtually unreadable. A total of 172 target tables would take a lot of time to pick through, make sense of, and arrive at decisions with.
Target schema with Elevate optimisation
As you can see, Elevate optimisation simplifies the output from 172 target tables to just 24 tables. That’s an impressive degree of simplification but let’s look at the results when we also apply the Re-use optimisation algorithm to the data set.
Target schema with Elevate and Re-use optimisation
You can see that the original 172 tables are reduced again from 24 to a mere 12 tables. This is exactly how Flexter is intended for use – data simplification, readability, manageability. Imagine the time, resources, complexity, and headaches saved for downstream users by working with 12 tables rather than 172! That’s where you start to see the real impact of Flexter on your organisation’s productivity.
How to convert XML data with these in-built optimisation tools
Converting XML data can be performed in three simple steps:
Step 1
In a rapid, one-time operation, we scan and traverse XML/JSON documents for information and intelligence.
Step 2
We create a logical target schema and the mappings between XML/JSON elements and the database tables and columns.
Step 3
We process and convert the XML/JSON documents each time new data arrives.
For instructions on how to complete each of these steps, installing Docker CE on Ubuntu, please refer to the final sections of this blog post.
Done-for-you data conversion
At first glance, Flexter is a powerful tool for the data conversion process. That’s true and it’s the main reason why we created it. Our own frustration of writing manual code to convert XML documents into a data warehouse saw many projects fail or run over budget. That’s why we built powerful automation features into the tool. However, there is much more to Flexter. It doesn’t just make the data conversion process faster, simpler, cheaper, more predictable, and less likely to cause major disruptions in projects. Data punched into the system also emerges from the other end of the system in a smarter and more intelligible format. This makes the results more meaningful and valuable to an organisation where non-technical decision-makers may need to work with large volumes of data in real-time. With the powerful, inbuilt algorithms that Flexter is shipped with your data warehousing job is made much easier with user-ready data that can be passed up or down the line as needed and without considerable extra data management work.
Enterprise edition
Can be installed on a single node or for very large volumes of XML on a cluster of servers.
Maciek is the Co-founder of Sonra. He has a knack for turning messy semi-structured formats like XML, JSON, and XSD into readable data. With a brain wired for product and data architecture, Maciek is the magic ingredient to making sure your systems don’t just work—they shine.
Get expert advice with our free support session. Fill in the form and schedule a chat with our experts.
What will you get on the call?
Helpful advice on the weak spots in your XML Project
No sales pitches
No commitments
Thank you for providing your details. We'll get back to you asap!
Book your FREE XML support session 30 min session
Cookies consent
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.
Cookie
Duration
Description
__cfruid
session
Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing
1 month
This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary
1 month
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences
1 month
This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics
1 month
This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified
1 month
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent
1 month
Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken
1 year
This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks
Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.
Cookie
Duration
Description
AnalyticsSyncHistory
1 month
Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie
2 years
LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie
2 years
LinkedIn sets this cookie to store performed actions on the website.
lang
session
LinkedIn sets this cookie to remember a user's language setting.
li_gc
2 years
Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc
1 day
LinkedIn sets the lidc cookie to facilitate data center selection.
mgref
1 year
This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby
1 year
This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory
1 month
LinkedIn sets this cookie for LinkedIn Ads ID syncing.
Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.
Cookie
Duration
Description
G
1 year
Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID
session
This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.
Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers..
Cookie
Duration
Description
_ga
2 years
The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5
2 years
This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1
1 minute
Set by Google to distinguish users.
_gat_UA-44804396-1
1 minute
A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au
3 months
Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid
1 day
Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT
2 years
YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC
6 Months
The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie
15 minutes
The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE
5 months 27 days
A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC
session
YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id
never
YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId
never
This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests
never
This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.