Converting complex XML to CSV

Maciek Kocon XML

Have you ever tried to process data in XML?

Did you have to load its values into a relational database or simply convert it to plain CSV?

Before I show you a simple way of processing complex XMLs, let’s first think out loud what XML actually is used for. Note: If you have a good grasp on XML specifics you may skip the next few paragraphs.

Well, you might think “This is just another structured text file, like CSV so it should be straightforward…At the end of the day the difference is you have the values inside their own tag delimiters rather than just commas”, right?

Uhmmm, no. Not really I’m afraid.

XML can indeed accommodate flat, tabular data structures. However, that leaves those remaining 99.5% cases where XML is utilized because we need to represent some complex structure that CSV can’t handle.

XML to CSV converter myth

It is exactly those complex scenarios that all  the “XML2CSV” converters out there can’t handle.

It’s a catchy promise but unless you’re working with flat data structures, like for instance the output from Oracle’s dbms_XMLgen.getXML function you can pretty much forget about it.

Unfortunately, in the majority of cases you will probably deal with those complex scenarios, with hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file.

Think of it as relational database that has been encapsulated into a single text file.

Hypothetically, if you have two <supplier> child entries from <root> and at the same time, 3 independent <customer> children of the same root – how would you flatten this out?

Those data points are different. They are independent XPaths and represent one-to-many child relationships. Extracting them together would result in a cartesian product:

company vat_id supplier customer
ACME EU123 sA cK
ACME EU123 sA cL
ACME EU123 sA cM
ACME EU123 sB cK
ACME EU123 sB cL
ACME EU123 sB cM

Notice that the company's attributes are repeated, which also creates redundant entries.

“OK. But what now? Surely there must be a better way to dump this all to a basic format like CSV”

You’re dead right. The answer is quite simple – rather than trying to squeeze everything in a single file, normalize and split it, one file per independent “branch”. A loosely defined branch could be any close combination of values and attributes that are all linked together with strict 1:1 relationship. This process is common in the database world and called normalization. In this particular example it would produce a simple tree of three files:

The company would act as a root or header file, storing one record per each processed XML. It would be accompanied with two other files, Supplier storing two records and Customer with three records, pretty much reflecting the relationships from the origin XML file.

XML processing with Flexter in Action

Now it’s time to show you the right way to convert XML files to CSV with some practical example. We will work with the full version of Flexter Data Liberator. We will demonstrate its features and also show you some of its internals. The steps to process XML files through Flexter are

  1. input/output analysis (once!)
    a) process and analyse the XSD schema
    b) calculate the target layout and generate source to target  mappings
  2. process the XML data files (repeatable)

We need some XSD schema files and some sample data now.

For the use case I have picked the Pain.001 (Payments Initiation but can be a real pain too 😉 ) schema that depicts Credit Transfer messages in XML format. This is part of a wider ISO 20022 standard that has been already put in place to process payments in the EU/SEPA (pre-Single Euro Payments Area) zone. The standard describes dozens of different financial messages and there is a pretty good chance that you come across it one day, especially if you’re in the IT department of the banking sector.
Find out more about PAIN.001.001.03 XML File Structure V 1.2

You can examine the sample XML files for SEPA yourself.

The XSD schema is accessible from ISO 20022 site.

Once we have downloaded both XML and XSD files we can start processing.

Analysing input XML layout

We first use Flexter to analyze the input schema:

The -R option tells Flexter to use the smart detection feature of root elements in the XSD (so there's no need to specify the name explicitly).

It only takes two seconds, but prints out a lot of useful information on the screen. You may access the full log here but the most interesting part is below:

After deep analysis of the XSD schema Flexter now knows how all potential source XML are about to be structured. This information is saved into Flexter's metadata for subsequent use. We can see from the above log that Flexter found only one possible root element (“Document”) in our XSD and there were no other XSD files referenced. The XML schema has been resolved into 940 possible tag elements.

flexter-src-sepa001-001-03

You can access the full layout here.

The internal Flexter ID for our input schema is 1496.

Building the target output layout

In a next step we calculate the normalized target layout and the mapping to connect source layout to it.. We pass the internal ID of our schema (here 1496) to the Flexter CalcMap tool

During run-time it prints out the log information on the particular phases of the analysis. At the end we can see short summary:

What we can observe here is target layout compaction in action. The straightforward transformation would be to treat every tag element from source schema as independent table. This would result in 1:1 mapping and resulted in the same number (940) of tables. Based on the insight Flexter gathered in the first, Schema Analysis phase it's able optimize the normalization process which results in less potential target files.This means Flexter was able to compact all possible data points from the input XML into just 58 files with a total of 780 columns and created 1024 links from the source to define mapping for the data transformation.

Processing XML data

Now we are ready to process our XML files. We pass the internal schema ID to the Flexter xml2er command line tool

This produces our output in TSV format (excerpt):

The default output mode is TSV, which in its structure is very similar to CSV but the delimiter is tab character (tab separated file). This tends to be a more universal standard as it generates less potential issues in subsequent use (commas can generally also occur in text and number formatting).  

As you might have noticed Flexter only  produced 13 files. This is how append save mode, another optimization is evident. The default operation mode of Flexter is to create or add only the tables data was present for in the input. This resulted in only 13 out of 58 files being created and populated.

Please note those 58 files were already compacted, so you can see now the full path it took: from 940 tag elements to 58 potential target files to just 13 instance files. Pretty good huh?

I hope you enjoyed this brief overview and we encourage you to try out Flexter online demo.

In the next posts we will look at more demanding schemas and also demonstrate how to integrate Flexter with other data integration tools. Stay tuned!

Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.