XML,

Converting Complex XML to CSV: A Practical Guide

November 28, 2016

Have you ever tried to process data in XML?

Did you have to load its values into a relational database or simply convert it to plain CSV?

What if I told you that there is a simpler way of processing complex XMLs — including solving some misconceptions around the specifics of XML?

I’m here to help. Keep reading to find out how you can practically convert complex XML to CSV better and more efficiently.

XML to CSV converter myth

Let’s first cover what XML is actually used for.

You might think, “This is just another structured text file, like CSV so it should be straightforward… Ultimately, the difference is you have the values inside their own tag delimiters rather than just commas.” 

Right?

Not really I’m afraid.

XML can indeed accommodate flat, tabular data structures. However, that leaves out the remaining 99.5% of cases where XML is utilized because we need to represent a complex structure that CSV can’t handle.

It’s that kind of complex scenario that all the “XML2CSV” converters out there can’t handle.

It’s a catchy promise but unless you’re working with flat data structures (like the output from Oracle’s dbms_XMLgen.getXML function below) you can pretty much forget about getting much use out of those converters.

 

Unfortunately, in the majority of cases you will probably deal with complex scenarios that involve hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file.


Think of it as a relational database that has been encapsulated into a single text file.
Hypothetically, if you have two <supplier> child entries from and at the same time, 3 independent <customer< span=””>> children of the same root — how would you flatten this out?</customer<>

Those data points are different. They are independent XPaths and represent one-to-many child relationships. Extracting them together would result in a cartesian product:

companyvat_idsuppliercustomer
ACMEEU123sAcK
ACMEEU123sAcL
ACMEEU123sAcM
ACMEEU123sBcK
ACMEEU123sBcL
ACMEEU123sBcM

Notice that the company’s attributes are repeated, which also creates redundant entries.

You’re likely now thinking: “OK. But what now? Surely there must be a better way to dump all of this into a basic format like CSV.”


You’re dead right. The answer is quite simple – rather than trying to squeeze everything in a single file, instead normalize and split it into one file per independent “branch.”

A loosely defined branch could be any close combination of values and attributes that are all linked together with a strict 1:1 relationship. 

This process is common in the database world and is called normalization. In this particular example, it would produce a simple tree of three files:

  1. The company would act as a root or header file, storing one record per each processed XML. It would be accompanied with two other files…. 
  2. Supplier storing two records and…
  3. Customer with three records, pretty much reflecting the relationships from the origin XML file.

XML processing with Flexter in action

Now it’s time to show you the right way to convert XML files to CSV with a practical example. We will work with the full version of Flexter Data Liberator, demonstrating its features and showing you some of its internals. The steps to process XML files through Flexter are:

  1. Input/Output analysis (just once!)
    1. Process and analyze the XSD schema
    2. Calculate the target layout and generate source to target mappings
  2. Process the XML data files (at a repeatable basis)


For this use case, I have picked the Pain.001 (Payments Initiation but can be a real pain too 😉) schema that depicts Credit Transfer messages in XML format. 

This is part of a wider ISO 20022 standard that has been already put in place to process payments in the EU/SEPA (pre-Single Euro Payments Area) zone. The standard describes dozens of different financial messages and there’s a pretty good chance that you’ll come across it one day, especially if you’re working within IT in the banking sector.
Once we have downloaded both XML and XSD files we can start processing.

Analysing input XML layout

We first use Flexter to analyze the input schema:

The -R option tells Flexter to use the smart detection feature of root elements in the XSD (so there’s no need to specify the name explicitly).

It only takes two seconds, but prints out a lot of useful information on the screen. You may access the full log here but the most interesting part is below:

After deep analysis of the XSD schema, Flexter now knows how all potential source XML are about to be structured. This information is saved into Flexter’s metadata for subsequent use. We can see from the above log that Flexter found only one possible root element (“Document”) in our XSD and there were no other XSD files referenced. The XML schema has been resolved into 940 possible tag elements.


You can access the full layout here.

The internal Flexter ID for our input schema is 1496.

Building the target output layout

In the next step, we calculate the normalized target layout and the mapping to connect the source layout to it. We pass the internal ID of our schema (here 1496) to the Flexter CalcMap tool:

During run-time, it prints out the log information on the particular phases of the analysis. At the end, we can see a short summary:

What we can observe here is target layout compaction in action. The straightforward transformation would be to treat every tag element from source schema as an independent table. This would result in 1:1 mapping and resulted in the same number (940) of tables. 

Based on the insight Flexter gathered in the first Schema Analysis phase, it can optimize the normalization process, resulting in fewer potential target files. This means Flexter was able to compact all possible data points from the input XML into just 58 files with a total of 780 columns and created 1024 links from the source to define the mapping for the data transformation.

Processing XML data

Now we are ready to process our XML files. We pass the internal schema ID to the Flexter xml2er command line tool:

This produces our output in TSV format (excerpt):

 

The default output mode is TSV which, in its structure, is very similar to CSV but the delimiter is its tab character (tab separated file). This tends to be a more universal standard as it generates fewer potential issues in subsequent use (commas can generally also occur in text and number formatting.)


As you might have noticed, Flexter only produced 13 files. This is how through append mode, another optimization is evident. The default operation mode of Flexter is to create or only add table data that was present in the input. This resulted in only 13 out of 58 files being created and populated.

Please note those 58 files were already compacted, so you can see now the full path it took: from 940 tag elements to 58 potential target files to just 13 instance files. Pretty good huh?

Suggested reading: Add to your understanding on XML with our deep dive on XML conversion: including converters, tools, and projects. 

See the benefits of Flexter for yourself

I hope you enjoyed this brief overview and I encourage you to try out Flexter in your own time.

In our next several posts, we will look at more demanding schemas and also demonstrate how to integrate Flexter with other data integration tools. Stay tuned!

Which data formats apart from XML are you struggling with and need help with? We could give you the support you need — just reach out!