In some previous posts we demonstrated how to load XML data into an Oracle database using ODI. We first looked at some of the issues when reverse engineering an XSD in ODI. Next we looked at various issues when parsing and processing XML files in ODI. We also showed that Flexter parses XML files without creating a single data flow or writing one line of code. And even better. It just scales no matter the number or size of your XML files.
ODI is a brilliant ETL tool, one of the best on the market. However, for converting XML files into a database or data warehouse you are better off to use a dedicated enterprise XML parser such as Flexter.
Let’s do a side by side comparison of Flexter and ODI
Analysing the XSD (XML schema)
|modular schema(multiple XSD files): ODI ≥ v220.127.116.11||support for modular schema: multiple XSD files
auto-import of XSDs from http(s), ftp, sftp, hdfs
support for schemas in archives (zip, gz, tgz, 7z)
|varies: seconds-hours||steady: seconds|
|low - medium||high|
|some basic optimizations||multiple options - fully configurable|
Sources: Older versions of ODI can't process schemas with multiple XSD files. Flexter supports modular XSDs including XSD URL import links - it automatically downloads and analyses externally referenced schema files.
Processing speed: The time ODI needs to reverse engineer a schema grows exponentially with the complexity of the schema.
|SEPA (pain.001.001.03).||8 mins||24 seconds|
|NDC (FlightPriceRS)||2 hrs 36 mins||1 min 4 seconds|
Support XSD Specification: The ODI XML driver claims to support a wide range of the XSD specification. However, many of the complex XSDs we tried, failed to reverse engineer. Troubleshooting proves to be quite difficult even when logging is switched on.
The various issues preventing ODI from successfully reverse engineering the schema were:
- no resolution for naming conflicts (element names vs. SQL reserved words, e.g.: AT or COMMENT, DATE etc)
- problems in handling certain namespace scenarios in XML schemas
- no support for inline or external XSD references
- problems with Unicode BOM markers or nullbyte characters
- limited support for detecting and handling recursive XPaths in the schema. This is partially due to the next issue:
- matching names of subtypes in a hierarchy are not detected, e.g. person > department > manager > person. The ODI driver wrongly assumes that the second person type in the hierarchy has the same structure as the first.
The last issue is documented:
Using different sub-elements with the same name but with different types is not supported by XML driver. An XSD with such a structure will not be processed correctly.
Flexter was purpose-built to handle large and complex schemas and has been thoroughly tested against a large number of industry standards. Usually it only fails when there are errors in the schema structure itself.
Optimizations: Both tools analyse an XSD schema to generate a relational target schema.
Flexter optimizes the target schema and significantly reduces the number of output tables. This makes it so much easier to make sense of the data and work with it downstream.
Flexter implements two main optimizations:
Elevate - include child elements into parent structure for 1:1 relationships
Reuse - create a reusable placeholder table for all references to the same XSD type (reference table in database speak)
Both optimizations can be used together or independently.
In addition, Flexter offers the ability to only create those target tables for which data exists in the XMLs. If the XSD contains a couple of hundred tables but the XML just uses five of those then only those five tables will be generated. This means that there will be a lot less tables to query, join and analyse downstream.
|SEPA (pain.001.001.03).||287 tables||13 tables||4.5%|
|NDC (FlightPriceRS)||2333 tables||24 tables||1%|
Import data to database
|Development Effort||very high||none|
|Data Lineage (source to target map)||N/A||automatic|
|XML schema visualisation||N/A||automatic|
|Run-time Error Handling||none||graceful|
|Scalability||N/A||to infinity and
Development Effort: In ODI it’s the developer's job to create the mappings between source XML and relational target schema.
Once the data is in the target database there is another problem to handle. The keys generated by the ODI driver will get reset with each new data load. Inevitably, this will lead to key clashes. Yet another problem to deal with. Your only options are to either tinker with the key generation or process the data further before performing another load.
In contrast, Flexter automatically creates an optimised target schema and populates the required tables directly in the database. The generated keys are globally unique . Not a single line of code is needed. Data can be easily appended without worrying about key conflicts. Data analysts or ODI developers can then use SQL for further processing against a highly simplified target schema.
Data Lineage/Schema Visualisation:
Flexter provides data lineage in the form of source to target maps. The user can browse and drill into the relational target schema through a visual interface. This is not something you can do in ODI.
Orchestration and Performance:
Once all of the required mappings have been built in ODI, they need to be executed in a loop for every available XML file. The file looping operation is not natively supported by ODI and needs to be manually built along with all file and error handling. Executing this operation repetitively in a loop over tables AND files makes the whole process tiresome and inefficient.
Flexter takes a URI path as input, this can be anything that points to a directory or file: local disk path, URL, FTP, SFTP or even HDFS. Flexter then automatically loads all of the available files in parallel. It also supports a streaming mode where it automatically picks up any files that have been added to source URI.
Run-time Error Handling: For cases where the encountered data values violate the XSD, Flexter tries to handle this situation, ie. by trimming the string values to the allowed length. If Flexter can't process the data, e.g. for data elements we find in the XML but have not been specified in the XSD. Flexter parks such data in an error dump. In both cases Flexter records the information in the log and continues processing.
ODI doesn't handle those scenarios and fails immediately.
ODI is limited to the resources given to single Agent machine. Multiple ODI Agents can't be used to parallelize the load (it's only for redundancy).
Flexter is built in Scala on top of Apache Spark, a distributed and highly scalable data processing framework. It can be launched on a single machine as well as a multi-node cluster. Flexter scales linearly and can handle any XML data volume. It shreds the XML on the XPath, stores the data in memory, and then loads it into the specified target. All in parallel. It runs on both Linux and Windows.
While ODI supports XML the development process can be quite tedious and slow. Complex schemas may prove very difficult to implement. Performance is also an issue as bulk loading of XMLs is not supported natively. Don't get me wrong, ODI is a brilliant ETL tool, probably the best on the market. When processing XMLs it is suitable for simple requirements with low data volumes. However, for complex XMLs and a large number of XML files you are better off using a dedicated enterprise XML parser like Flexter. This will save you time, effort, and dollars. It will also eliminate any risk from your XML projects.
In the next post we will show how these two tools can work together to ensure maximum productivity and utilisation of developers' time.
Find out for yourself how easy it is to process complex XML files with Flexter XML Parser
Which data formats apart from XML also give you the heebie jeebies and need to be liberated? Please leave a comment below or reach out to us.