Converting XML files to Oracle without writing a single line of code

Maciek Kocon Flexter, Oracle, XML, XSD

Parsing XML files to Oracle without writing a single line of code

In a previous post we demonstrated how to use Flexter to convert XML data into TSV files.
This was just an introduction demonstrating some basic features of Flexter including concepts around XML schema normalization and optimization.

Converting XML into an Oracle Database

This time we are going to show you a more interesting and common use case. Querying XML in the context of data analytics. Our requirement is to create some reports or dashboards over data in XML files. We could use XQuery to do this. Unfortunately, none of the BI or visualisation tools really support this. XQuery is really awkward to use and most XQuery engines don’t have support for running queries across multiple XML files. They also lack a query optimizer and as a result queries don’t perform well. Finally, very few developers possess XQuery skills. A much better approach is to convert the XML files into a relational schema in a database and use SQL, the lingua franca for data, to query it.

Loading XML into the the Oracle database using Flexter is pretty easy. Let's take the same pain.001.001.03 schema and data example we have used before. The first step is the same: analyse the XML schema and extract the metadata to create the target schema. As we've done that already in the first post I'm going to skip it.

Instead we reuse the metadata and point Flexter to our Oracle database to create the target schema.

Target schema registered.

Loading XML to the Oracle database

That's it - we are now ready to process the data. It’s as simple as running Flexter’s xml2er command line tool.

-x 1496 this is to specify what is the expected layout of the XML (we processed schema in previous post)
-l 420 this is to specify where to write the output (we've just registered the staging schema in Oracle)

As per our previous post when we were parsing XML into TSV files we have created the same 13 datasets. This time as regular Oracle tables.

flexter-tgt-subset-sepa001-001-03-opt

Pain of loading XML to a Database

Pretty simple, right? Compare this to the manual steps it takes to load the data into a database:

  • the XML needs to be manually normalized
  • the target tables need to be modelled in the database
  • table relationships need to be established and created
  • surrogate keys for parent-child relationships need to be generated.
  • keys need to be globally unique (no conflicts when loading multiple XML files)
  • each target table needs to be mapped manually
  • each mapping needs to be orchestrated into a data flow
  • each source XML file needs to be loaded into the target schema

As you can imagine this is very labour intensive and boring. It also requires a lot of time to implement.

Benefits of using Flexter to load XML to a Database

With Flexter the whole process is quick and fully automated. Developers can focus on creating real value, e.g. writing the SQL to generate reports or loading and integrating the data into a data warehouse.

What makes the analysis or downstream development even easier and faster is the fact that the number of the tables has been highly reduced by applying some of Flexter’s optimization algorithms. From 940 possible XML entities we went down to just 13 relational tables populated with data:

XSD

xsd2er

XML map

(940 entities)

calcmap

logical Layout

(58 tables)

xml2er

Target

(13 tables)

schema
analysis

layout
optimization

xml data
processing


This simply means less (a lot less!) tables to focus on when working with the output:

less tables → less joins → optimized performance

In terms of supported target platforms Flexter works with multiple RDBMS engines, e.g. Oracle, DB2, Teradata etc.. Besides TSV files, Flexter can also generate Parquet files which can be loaded into your Data Lake, and read through Hive, Impala, Spark, Drill, Flink or a similar data processing engine.

NDC. A more complex schema example

940 XML elements seems like a lot. In fact most industry XML standards contain a lot more. Their schemas are often highly modularized and with multiple XSD files that cross-reference each other. In those complex XML schemas we typically have one or two XSD files that contain all of the reusable types - reference data in other words. The other XSDs typically contain different types of transactions or messages, e.g. bookings or orders.

To illustrate this better, let's take a look at the NDC (New Distribution Capability) XML standard. This standard is developed by IATA.

The list of XSD files and modules is quite comprehensive (60!):

You can download it from here.

We can use Flexter to show us all of the root elements across the 60 XSDs. From this information we get an overview on the type of messages and data we are dealing with.  Let’s run the Flexter xsd2er tool with a couple of extra options:

    -s         Skip writing to Flexter's metadata repository (dry run)
    -R         Unreferenced elements will be considered roots
    -F         Unreferenced files will be considered root files
    For each XSD, Flexter prints out the root elements and types that were detected

    As you can see, the edist_comontypes.xsd contains a lot of different types such as Currencies or PaymentForms. These types are the reusable reference data of the standard. The output from Flexter (represented in the diagram below) visualises this information and gives us a nice overview of the standard in general. For each XSD file we get a list of the available types.

    XML Processing with the NDC Standard

    Let’s see Flexter in action. We have collected some sample data for the FlightPriceRS type, which if you remember from the output above is one of the XSD files. As usual, we analyse the schema in a first step. We will use an extra filter to only process the FlightPriceRS type as this is the one that we have sample data for.

    This resulted in 4012 entry elements. The root element filter for FlightPriceRS we applied above reduced the number of elements. Out of curiosity, we analysed the whole XSD (all 60 files and not just the FlightPriceRS XSD) and this resulted in a whopping 219k XML elements.  I won't make an attempt to visualise that here 😉

    Calculating the target schema and mapping is next on our list. We will use both of Flexter’s optimization modes

    Elevation. This optimization detects one-to-one relationships and merges them into parent entities
    Reusability. This optimization de-duplicates the occurrence of reference tables. Without this optimization we would get a separate table each time a reference type is used. (single, multi-parent table if the type referenced multiple times)

    Let’s process the sample data straight away:

    this time we will store the result into Flexter’s metadata repository.

    Let's focus on the output now. Flexter was able to reduce the target schema to just 24 tables!. Flexter’s optimizations did a great job. We now have 99.5% less tables to query, join and analyse.

    If we compare the input schema of the XSD to the final target schema in the database we see the difference straight away. Pretty impressive.

    Source Schema (if we don't apply any optimizations this would be the Target Schema as well)

    Flexter Data Lineage. From XML Source elements to Oracle table columns

    You might think that this is pretty awesome. But hold on, there is even more. If you check the staging schema generated by Flexter you will notice that tables and columns are commented. The commentary provides source to target mappings, e.g. we can see that column JOURNEY_TIME comes from /FlightPriceRS/DataLists/FlightList/Flight/Journey/Time.

    Flexter also provides full documentation on data lineage and the mappings:

    Both data lineage and the optimized target schema make it easy and fast to query, analyse, and visualise our flight data for downstream data analytics and reporting.

    One last thing to notice is the extra columns Flexter generates that allow querying of some of the metadata, like source filename, size, last modification timestamp etc… which might be very useful for XML content validation, reconciliation, troubleshooting etc…

    Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.