Tips for Reverse Engineering complex XSDs such as SEPA and NDC in ODI

Maciek Kocon Flexter, Oracle, Oracle Data Integrator (ODI), XML, XSD

In this post we will show you a few tips and ODI issues when reverse engineering complex industry XML schemas (XSD) such as NDC FlightPriceRS or SEPA Pain.001.

This is the first part in a multi part series. The second part gives you some tips and tricks when parsing XML files in ODI. The third part compares ODI to Flexter, our very own dedicated enterprise XML parser.

How does XSD Reverse Engineering in ODI work under the Hood?

Oracle Data Integrator (ODI) is a metadata driven ETL tool and to build your data flows you need to know about the source and target for the data transformation. That metadata information is stored as an ODI model inside the ODI repository. How does it get there? This is done through the process of reverse engineering, creates corresponding datastores and relationships inside an ODI model. This is true for every type of data source: Oracle, Teradata, MS SQL, file etc. XML is no exception.

ODI uses an XML JDBC driver to reverse engineer XSDs. When we use the ODI XML JDBC driver to work with XML we have two options as to where to store the intermediary data:.

(1) In-Memory engine (default) - the data is stored in the embedded HSQL in-memory database engine maintained by the XML driver

(2) External database engine - the data is stored in the external database ie. Oracle, MS SQL Server, DB2 through another JDBC connection maintained internally and transparently by the XML driver

As XML is just a file and not a database engine - the ODI XML JDBC driver acts as an intermediate layer that takes up that functionality: loads XML into memory and provides a standard JDBC interface to query it.

When ODI accesses the XML file using its XML driver for the first time, it analyses the structure of the associated schema and creates corresponding structures with tables and relationships. When we reverse engineer an XML schema, ODI queries the XML driver about those structures using JDBC API interface and creates new datastores in the ODI model based on that.

The reverse engineer process can be done from an associated XSD or DTD or by supplying an XML file, which ODI uses to first derive a DTD.

We are going to show you how this is done by using the in-memory engine.

Step by Step Instructions and Tips for Reverse Engineering a simple XSD in ODI

We will now walk you through an example of reverse engineering an XSD and give you plenty of tips along the way. We will use the same SEPA pain schema as in this post and share some tips along the way.

Just to remind you, Pain.001 (Payments Initiation) schema depicts Credit Transfer messages in XML format. This is part of a wider ISO 20022 standard that has been already put in place to process payments in the EU/SEPA (pre-Single Euro Payments Area) zone. The standard describes dozens of different financial messages and there is a pretty good chance that you come across it one day, especially if you’re in the IT department of the banking sector. Find out more about PAIN.001.001.03 XML File Structure V 1.2
You can download the sample XML files for SEPA.

The XSD schema is accessible from ISO 20022 site.

In a first step we create the physical data server. Let's call it XML_SEPA

Next we create the physical data server, also XML_SEPA.

The ODI JDBC driver comes with plenty of options in the form of driver properties. Don't feel overwhelmed, most of them are for special cases. As we will use the in-memory mode the whole External DB properties section is not relevant.

You can explore the meaning of all of the options here [http://docs.oracle.com/middleware/1221/odi/develop-connectivity-km/appendix_xml_driver.htm#CHDDFBGJ] As a rule of thumb it’s best to leave the defaults and change only what's absolutely necessary:

dtd=/home/oracle/host/sepa/001.001.03/pain.001.001.03.xsd
          this is the path to the XSD file.

root_elt=Document
          this is the name (case sensitive!) of the top level XML element to be processed.

Tip 1: You can find the value of the top level XML by examining the XSD, but the easiest method is to look up the XML data file and see what the name of the outermost element is. This would be the first tag that comes in the file, containing all other tags in it. See the highlighted part taken from an example sepa xml message:

<?xml version="1.0" encoding="utf-8" ?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>ABC/060928/CCT001</MsgId>
<CreDtTm>2008-09-28T14:07:00</CreDtTm>
<NbOfTxs>3</NbOfTxs>
<CtrlSum>2400.56</CtrlSum>
<InitgPty>
<Nm>Cobelfac</Nm>

schema=SEPA

Tip 2: make sure this is filled in. This should be short and sweet. ODI uses it internally to refer to the schema.

Tip 3: make sure compat_mode=v3 is set. This is the default setting in ODI 12c. Historically the logic to handle XML has been rewritten (breaking backward compatibility) by Oracle three times already. From our experience the latest implementation gives the best results and also works with modular schemas (schemas that are defined by not just a single but multiple XSD files)

If you are using an older version of ODI you're pretty much out of luck. Apart from lack of support for modular schemas it has problems with some of the XSD syntax that is supported only in v3 mode.

If your XSD schema is modular but only spans across just a few files files and its overall structure is not too complex you may try to merge it all into a single XSD file. You can also try other software, like Altova XML Spy to do that for you.

Interestingly, we also noticed that the structures produced by reverse engineering are quite different between v3 and other modes (v1, v2).

Tip 4: Log any action the JDBC driver performs At this stage, for an XML data source setup this is practically a mandatory step. Remember to revert back once you're finished with model. It’s a bit painful but you don’t want to go live with logging switched on.

Pass the name of the logfile (lf=) and loglevel (ll=) to the JDBC URL. For us it's going to be: 'jdbc:snps:xml?lf=/home/oracle/host/snpxml.log&ll=16'

Once we have set up our logical and physical schemas we create a model that will hold our reverse engineered XML structures

Go to Designer > Models > New Model

A new window will open

Fill in the Name and the Code fields: XML_SEPA

Pick XML as the technology

Once you do this, a drop down box with relevant logical schemas will reload. Pick XML_SEPA from it.

Now, the last thing left to do is to press the 'Reverse Engineer' button.

And pray.

Once started, the driver will parse out the schema, analyse it, and build a relational equivalent of sorts of the XSD in memory. ODI will query and save this metadata in the ODI repository as the model’s datastores. Depending on the complexity of the XSD schema, it might take anything from a few seconds to a couple of hours (so get your tea or whatever takes your fancy ready).

During the analysis phase the ODI GUI will freeze and it would be better not to disturb ODI by clicking randomly somewhere on the screen.

If you're impatient like me, it’s best to check the log from time to time (as per tip 4)

It took six minutes to reverse engineer this schema.

You can see it has created 287 tables/datastores inside the model. As you can see from the figure below we can’t navigate the hierarchy of the model.


Hence we have provided a visual representation below.

 

It's a tree structure with the 'Document' table at the top. You might have noticed that the list starts with 23 instances of ADDRLINE tables.

These are all structured the same and come from the references to the same complex type, ADDRLINE: <xsd:element type="ADDRLINE"/>

The ODI XML driver unnecessarily decided to create a separate instance for each one of them. This means that for ADDRLINE alone we get 22 redundant data structures.

That is one reason why our model now contains 287(!) tables.

The ODI developer or analyst now is in the unenviable position to make sense of this huge schema. The task is made even more difficult as ODI does not provide a visual representation of the schema and the various relationships.

Tip 5. You may follow the parent/foreign keys of created tables to recreate the full diagram manually (as we did above)

Reverse Engineering a highly complex XSD. More Tips and Tricks.

To illustrate more problems and give more tips let's look at another, more complex schema FlightPriceRS, which is part of the NDC (New Distribution Capability) XML standard. This standard is developed by IATA and aims to facilitate XML-based data exchange for the airline industry. It is intended to be widely implemented as the successor of the OTA standard by addressing its most problematic shortcomings, especially around user profiles. You can find more detailed information on the NDC standard on the IATA website.

We will basically follow the same steps as above:

Create Logical and Physical schema, configure JDBC properties, create the ODI model by reverse engineering the XSD.

The schema is highly complex and ODI failed after 40 minutes with the following constraint violation error:

For some reason, the ODI driver detects and tries to insert the same relationship information multiple times.

Tip 6. The solution is actually pretty simple. All you need to do is to log into the ODI work repository and disable the constraints. As the REV_ tables are meant to hold temporary data (only used for reverse engineering) it's pretty safe to switch them off.

NOW: try to reverse-engineer the XSD again!…

That worked! we can now re-enable the constraints:

This brings us to our final tip.

In many cases when we change any properties in the physical server (Topology), the JDBC driver does not pick up this change.

Tip 7. For all cases where you get an error during reverse engineering, before retrying it (ie. with new JDBC settings etc.) you may want to close and reopen ODI.

It seems like a drastic measure but what we've observed is that once ODI XML driver has finished parsing the the schema it caches it under the Schema name given (cf. Tip 2) and is not updated anymore.

Restarting ODI helped in each of those cases when we were scratching our heads trying to figure out what's going on while simple stuff that worked before now stopped working. Just make a mental note - it might spare you some frustration one day.

With this workaround in place ODI was able to reverse engineer the model. It took a whopping 2 hrs and 36 minutes (more tea…) and resulted in 2333 tables. Huge! And this is already after some compaction optimizations that the XML driver applies in compat_mode=v3 for us.

In this post we have covered reverse engineering of the XSD, which is only part of the overall job of loading XML data into the database in ODI. We demonstrated multiple problems or inefficiencies specific to this task, in particular:

  • ODI creates unnecessarily complex target schemas with duplicated and redundant tables and data points
  • It takes a very long time for the process of reverse engineering itself. It's negligible for simple schemas but then it grows exponentially with the XSD schema complexity.
  • Debugging is slow and painful
  • For very complex schemas the process of reverse engineering actually fails as in our example.

We can clearly see that ODI presents developers with significant problems when asked to deal with even comparatively simple XML processing and these problems multiply exponentially as the schema becomes more complex and larger until it eventually fails. What is harder to understand is the opportunity cost of this wasted resource and the business risk of being unable to simply comply with the industry standards in airline, banking and life science verticals. To eliminate these issues the clever fellows at Sonra created Flexter, the XML parser and converter with built in algorithmic optimisation that can reduce and simplify the target schema to allow both faster or real time processing without the pain.

In our next post we will cover the steps to bring the XML data into a database and give more tips and tricks.

If you are looking for a simple solution to resolve your XML headaches you should try out Flexter, our enterprise parser for XML.

Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.