Have you ever tried to process data in XML?
Did you have to load its values into a relational database or simply convert it to plain CSV?
Before I show you a simple way of processing complex XMLs, let’s first think out loud what XML actually is used for. Note: If you have a good grasp on XML specifics you may skip the next few paragraphs.
Well, you might think “This is just another structured text file, like CSV so it should be straightforward…At the end of the day the difference is you have the values inside their own tag delimiters rather than just commas”, right?
Uhmmm, no. Not really I’m afraid.
XML can indeed accommodate flat, tabular data structures. However, that leaves those remaining 99.5% cases where XML is utilized because we need to represent some complex structure that CSV can’t handle.
XML to CSV converter myth
It is exactly those complex scenarios that all the “XML2CSV” converters out there can’t handle.
It’s a catchy promise but unless you’re working with flat data structures, like for instance the output from Oracle’s dbms_XMLgen.getXML function you can pretty much forget about it.
1 |
SELECT dbms_XMLgen.getXML('SELECT * FROM HR.EMPLOYEES WHERE rownum<3') xmldoc FROM DUAL |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
<!--?xml version="1.0"?--> 100 Steven King SKING 515.123.4567 17-JUN-87 AD_PRES 24000 90 101 Neena Kochhar NKOCHHAR 515.123.4568 21-SEP-89 AD_VP 17000 100 90 |
Unfortunately, in the majority of cases you will probably deal with those complex scenarios, with hierarchical data structures and multiple one-to-many relationships all being stored in a single XML file.
Think of it as relational database that has been encapsulated into a single text file.
Hypothetically, if you have two <supplier> child entries from and at the same time, 3 independent <customer< span=””>> children of the same root – how would you flatten this out?</customer<>
1 2 3 4 5 6 7 8 9 |
<!--suppliers--> sA sB <!--/suppliers--> <!-- customers--> cK cL cM <!--/customers--> |
Those data points are different. They are independent XPaths and represent one-to-many child relationships. Extracting them together would result in a cartesian product:
company | vat_id | supplier | customer |
ACME | EU123 | sA | cK |
ACME | EU123 | sA | cL |
ACME | EU123 | sA | cM |
ACME | EU123 | sB | cK |
ACME | EU123 | sB | cL |
ACME | EU123 | sB | cM |
Notice that the company’s attributes are repeated, which also creates redundant entries.
“OK. But what now? Surely there must be a better way to dump this all to a basic format like CSV”
You’re dead right. The answer is quite simple – rather than trying to squeeze everything in a single file, normalize and split it, one file per independent “branch”. A loosely defined branch could be any close combination of values and attributes that are all linked together with strict 1:1 relationship. This process is common in the database world and called normalization. In this particular example it would produce a simple tree of three files:
The company would act as a root or header file, storing one record per each processed XML. It would be accompanied with two other files, Supplier storing two records and Customer with three records, pretty much reflecting the relationships from the origin XML file.
XML processing with Flexter in Action
Now it’s time to show you the right way to convert XML files to CSV with some practical example. We will work with the full version of Flexter Data Liberator. We will demonstrate its features and also show you some of its internals. The steps to process XML files through Flexter are
- input/output analysis (once!)
a) process and analyse the XSD schema
b) calculate the target layout and generate source to target mappings - process the XML data files (repeatable)
We need some XSD schema files and some sample data now.
For the use case I have picked the Pain.001 (Payments Initiation but can be a real pain too 😉 ) schema that depicts Credit Transfer messages in XML format. This is part of a wider ISO 20022 standard that has been already put in place to process payments in the EU/SEPA (pre-Single Euro Payments Area) zone. The standard describes dozens of different financial messages and there is a pretty good chance that you come across it one day, especially if you’re in the IT department of the banking sector.
Find out more about PAIN.001.001.03 XML File Structure V 1.2
You can examine the sample XML files for SEPA yourself.
The XSD schema is accessible from ISO 20022 site.
Once we have downloaded both XML and XSD files we can start processing.
Analysing input XML layout
We first use Flexter to analyze the input schema:
1 |
xsd2er -R /mapr/dev.sonra.io/tmp/sepa/pain.001.001.03.xsd |
The -R option tells Flexter to use the smart detection feature of root elements in the XSD (so there’s no need to specify the name explicitly).
It only takes two seconds, but prints out a lot of useful information on the screen. You may access the full log here but the most interesting part is below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
18:18:49.098 INFO Registering new job 18:18:49.352 INFO Parsing pain.001.001.03.xsd 18:18:49.580 INFO Linking pain.001.001.03.xsd 18:18:49.603 INFO Root elements: pain.001.001.03.xsd - Document 18:18:49.607 INFO Building metadata 18:18:49.686 INFO Writing metadata 18:18:50.048 INFO Registering success of job 3748 18:18:50.074 INFO Finished successfully in 1294 milliseconds # schema schemaXml: 1496 org: 1 job: 3748 elements: 940 # statistics load: 452 ms parse: 372 ms build: 83 ms write: 362 ms |
After deep analysis of the XSD schema Flexter now knows how all potential source XML are about to be structured. This information is saved into Flexter’s metadata for subsequent use. We can see from the above log that Flexter found only one possible root element (“Document”) in our XSD and there were no other XSD files referenced. The XML schema has been resolved into 940 possible tag elements.
You can access the full layout here.
The internal Flexter ID for our input schema is 1496.
Building the target output layout
In a next step we calculate the normalized target layout and the mapping to connect source layout to it.. We pass the internal ID of our schema (here 1496) to the Flexter CalcMap tool
1 |
calcmap -x 1496 {true, false} |
During run-time it prints out the log information on the particular phases of the analysis. At the end we can see short summary:
1 |
Tables/Columns/Mappings: 58/780/1024 |
What we can observe here is target layout compaction in action. The straightforward transformation would be to treat every tag element from source schema as independent table. This would result in 1:1 mapping and resulted in the same number (940) of tables. Based on the insight Flexter gathered in the first, Schema Analysis phase it’s able optimize the normalization process which results in less potential target files. This means Flexter was able to compact all possible data points from the input XML into just 58 files with a total of 780 columns and created 1024 links from the source to define mapping for the data transformation.
Processing XML data
Now we are ready to process our XML files. We pass the internal schema ID to the Flexter xml2er command line tool
1 |
gt; xml2er -x 1496 tmp/sepa/pain.001.001.03.xmls.zip |
This produces our output in TSV format (excerpt):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# input path: maprfs:///mapr/dev.sonra.io/tmp/sepa/pain.001.001.03.xmls.zip # output (default) path: maprfs:///mapr/dev.sonra.io/tmp/sepa/out.opt/ format: tsv savemode: overwrite 18:19:48.775 INFO Registering new job 18:19:49.109 INFO Loading metadata [Stage 1:=> (7 + 1) / 200] [Stage 1:==> (10 + 1) / 200] [Stage 1:====> (17 + 1) / 200] [Stage 1:======> (23 + 1) / 200] [Stage 1:========> (30 + 1) / 200] [Stage 1:==========> (38 + 1) / 200] [Stage 1:============> (44 + 1) / 200] [Stage 1:==============> (53 + 1) / 200] [Stage 1:=================> (62 + 1) / 200] [Stage 1:===================> (71 + 1) / 200] [Stage 1:======================> (79 + 1) / 200] [Stage 1:========================> (89 + 1) / 200] [Stage 1:===========================> (99 + 1) / 200] [Stage 1:=============================> (108 + 1) / 200] [Stage 1:=================================> (120 + 1) / 200] [Stage 1:====================================> (133 + 1) / 200] [Stage 1:========================================> (146 + 1) / 200] [Stage 1:==========================================> (154 + 2) / 200] [Stage 1:=============================================> (165 + 1) / 200] [Stage 1:================================================> (178 + 1) / 200] [Stage 1:====================================================> (192 + 1) / 200] 18:20:01.750 INFO Parsing data 18:20:07.297 INFO calculating data 18:20:07.303 INFO writing tables 18:20:07.305 INFO writing table GrpHdr_InitgPty_Id_OrgId_Othr 18:20:07.763 INFO writing table GrpHdr_InitgPty_Id_PrvtId_Othr 18:20:08.001 INFO writing table Document 18:20:08.238 INFO writing table CdtTrfTxInf_Cdtr_Id_OrgId_Othr 18:20:08.407 INFO writing table CdtTrfTxInf 18:20:08.669 INFO writing table Cdtr_PstlAdr_AdrLine 18:20:08.865 INFO writing table Strd 18:20:09.050 INFO writing table Ustrd 18:20:09.214 INFO writing table UltmtCdtr_Id_OrgId_Othr 18:20:09.390 INFO writing table PmtInf_Dbtr_Id_OrgId_Othr 18:20:09.557 INFO writing table PmtInf_Dbtr_PstlAdr_AdrLine 18:20:09.722 INFO writing table PmtInf 18:20:09.913 INFO writing table PmtInf_UltmtDbtr_Id_OrgId_Othr 18:20:10.340 INFO calculating statistics 18:20:10.683 INFO writing statistics rows 18:20:11.513 INFO Registering success of job 3749 18:20:11.531 INFO Finished successfully in 47601 milliseconds # schema xml: 1496 logical: 419 org: 1 job: 3749 xpaths: 515 # statistics load: 37820 ms parse: 5547 ms write: 2775 ms stats: 1440 ms |
The default output mode is TSV, which in its structure is very similar to CSV but the delimiter is tab character (tab separated file). This tends to be a more universal standard as it generates less potential issues in subsequent use (commas can generally also occur in text and number formatting).
As you might have noticed Flexter only produced 13 files. This is how append save mode, another optimization is evident. The default operation mode of Flexter is to create or add only the tables data was present for in the input. This resulted in only 13 out of 58 files being created and populated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
gt; → ls -l -rwxr-xr-x 1 flexter 109 Nov 17 14:08 Authstn.tsv -rwxr-xr-x 1 flexter 13279 Nov 17 19:20 CdtTrfTxInf.tsv -rwxr-xr-x 1 flexter 98 Nov 17 19:20 CdtTrfTxInf_Cdtr_Id_OrgId_Othr.tsv -rwxr-xr-x 1 flexter 283 Nov 17 14:08 CdtrSchmeId_Id_PrvtId_Othr1.tsv -rwxr-xr-x 1 flexter 123 Nov 17 19:20 Cdtr_PstlAdr_AdrLine.tsv -rwxr-xr-x 1 flexter 3215 Nov 17 19:20 Document.tsv -rwxr-xr-x 1 flexter 11336 Nov 17 14:08 DrctDbtTxInf.tsv -rwxr-xr-x 1 flexter 354 Nov 17 19:20 GrpHdr_InitgPty_Id_OrgId_Othr.tsv -rwxr-xr-x 1 flexter 91 Nov 17 19:20 GrpHdr_InitgPty_Id_PrvtId_Othr.tsv -rwxr-xr-x 1 flexter 5902 Nov 17 19:20 PmtInf.tsv -rwxr-xr-x 1 flexter 85 Nov 17 19:20 PmtInf_Dbtr_Id_OrgId_Othr.tsv -rwxr-xr-x 1 flexter 104 Nov 17 19:20 PmtInf_Dbtr_PstlAdr_AdrLine.tsv -rwxr-xr-x 1 flexter 93 Nov 17 19:20 PmtInf_UltmtDbtr_Id_OrgId_Othr.tsv -rwxr-xr-x 1 flexter 1711 Nov 17 19:20 Strd.tsv -rwxr-xr-x 1 flexter 98 Nov 17 19:20 UltmtCdtr_Id_OrgId_Othr.tsv -rwxr-xr-x 1 flexter 607 Nov 17 19:20 Ustrd.tsv |
Please note those 58 files were already compacted, so you can see now the full path it took: from 940 tag elements to 58 potential target files to just 13 instance files. Pretty good huh?
I hope you enjoyed this brief overview and we encourage you to try out Flexter.
In the next posts we will look at more demanding schemas and also demonstrate how to integrate Flexter with other data integration tools. Stay tuned!
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.