Generating a relational target schema. From XSD. From XML. From XML and XSD

Uli Bethke Flexter, XML, XSD

Flexter, our data warehouse automation solution for the conversion of XML, JSON, industry data standards (HL7, FpML, ISO 20022 etc.) automatically creates Target Schema when we create a Data Flow. It allows us to see relational representation of our XML or JSON documents.

In this post we show you how easy it is to create a Target Schema from an XSD, a sample of XML documents, and from a combination of XSD and a sample of XML files. When you create a Target Schema, you can optionally apply the elevate or re-use Optimisation.

Flexter is available in three editions. An enterprise edition, which can be installed on-premise. A managed SaaS edition Flexter as a Service, which you can access by API. You can also try Flexter for free.

Creating Target Schema and Data Flow

A Data Flow is a core concept in Flexter. It maps the data points in the Source to the data points in the Target Schema.

Before we create a Conversion of XML or JSON documents with Flexter, we first have to create a Data Flow. We can then use this Data Flow in a Conversion Task. We are able to re-use the same Data Flow in one or more conversions.

When you create a Data Flow, Flexter collects Statistics (optional), creates the Target Schema and the mapping between Source and Target Schema.

Once the Data Flow has been created we can use it N number of times to convert Source Data. The conversion happens in a Conversion task.

Depending on the available type of Source (XML, JSON, XSD), the steps to create a Data Flow vary.

Creating Target Schema from XSD

Using a Source Schema (XSD) to create a Data Flow has some advantages and disadvantages:

Pros:

  • The Target Schema you generate from an XSD is comprehensive. Any XML document that conforms to the XSD can be processed, All the possible data points from the input are mapped to the output with their cardinality and datatypes known and constant. Note this remains true as long as the Source Data fully conform to the XSD definitions used.
  • The Target Schema can be built without having to rely on a representative sample of XML documents or even the complete set of XML documents. This makes it much faster to generate the Target Schema.
  • It can optimise the output by analysing shared types and creating a single table of these shared types in the target schema (Reuse Optimisation). This simplifies the Target Schema.

Cons:

  • XSDs that are used in industry data standards typically model a huge number of business processes. In reality a company typically only implements a subset of the business processes modeled in the XSD. As a result the Target Schema contains many tables and columns that are not used or populated. They are redundant. This makes the Target Schema unnecessarily complex and difficult to work with.
  • Not ideal for deep recursive XPaths. Such cases require more memory to build and store as metadata.

You create a Data Flow from a Source Schema (XSD) in a single step. This is done in a single command with optional and mandatory parameters. Flexter automatically creates the Target Schema and Mapping for you.

A Conversion task is a single command with optional and mandatory parameters. A successful Conversion task loads the Source Data into the Target Format, e.g. from an XML archive on an FTP server to a Redshift database.

We will run a command with the Elevate optimisation.

Example of output:

 

At the end of the successful XSD analysis process, Flexter prints out the ID of the Data Flow (logical). In the example above, the ID of the Data Flow is 1.

This should be noted and later used in your Conversion task.

Example of Target Schema

Creating Target Schema from XML only

It is possible to convert XML documents without a Source Schema (XSD). For JSON documents this is the only option as Flexter does not currently support JSON schema.

You need to provide a representative sample of XML/JSON documents for this to work best. Alternatively, you can provide all of the available XML or JSON documents. Flexter collects information from the data sample and stores it in the Metadata DB.

Here is a list of advantages and disadvantages when using a sample of XML documents.

Pros:

  • It makes XML processing possible when you don’t have an XSD.
  • It makes it possible to build a Target Schema for JSON files
  • It generates a set of lightweight metadata: only what is really present in the Source Data is reflected.
  • It may offer an alternative solution for scenarios where the XSD and XML samples don't match, or have different versions.
  • It's ideal for recursive XPaths.

Cons:

  • The Reuse optimisation is not applicable.
  • It requires collecting statistics over XML/JSON data before it can be translated into a Target Schema. If you can’t easily derive a representative sample of XML files you will require the full set of Source Data to collect Statistics. This may be time consuming.

 

With Source Data as your only Source you create a Data Flow in two steps. You first collect Statistics from the Source Data. Flexter collects Statistics such as the XML elements, the data types etc.

This step can be repeated N number of times to collect Statistics incrementally. Statistics can optionally be collected incrementally each time you run a conversion process to detect if the source XML documents have changed.

Using the collected Statistics you create the Data Flow in a second step.

For this showcase we will run a command that will do both steps at the same time. We will run it with Elevate optimisation.

Example of output

This command collects Statistics (origin ID 4). At the same time it also creates a Data Flow (logical ID 2).

Example of Target Schema

Creating Target Schema from XSD and XML

Generating a Target Schema based on both a Source Schema and Source Data allows you to combine the information from both Sources. This approach tends to give you the best results as you can combine the information Flexter infers from both the Source Data and Source Schema.

Pros:

  • It generates the most optimized logical schema, based on information from both Sources.
  • The Target Schema is simplified. You can augment the information from the XSD with the intelligence collected from the Source Data documents, e.g. only generating the Data Points in the Target Schema that are in use.
  • It lets you process recursive XPaths found in XSD definitions with Statistics.
  • Reuse optimisation is applicable

Cons:

  • It does not work with JSON documents
  • It requires collecting statistics over XML/JSON data before it can be translated into target. If you can’t easily derive a representative sample of XML files you will require the full set of Source Data. This may be time consuming.

With Source Data and a Source Schema you create a Data Flow in two steps. You first collect Statistics from the Source Data. This step can be repeated n number of times. Using the collected Statistics AND a Source Schema you create a Data Flow in a second step.

In the first step we will collect Statistics from the Source Data.

Example of output

We have generated a new set of Statistics (origin 4). We are now ready to generate the new Data Flow with the xsd2er module. We need to use the use-stats parameter (-k switch) and also the -g switch for the optimisation algorithm (elevate, reuse, or both elevate and reuse) to generate the new Data Flow.

In the second step we will use xsd2er module to generate new Data Flow.

Example of output

The generated Data Flow ID (logical 3) can be used in a Conversion task to convert XML documents to a relational Target Schema.

Example of Target Schema

We have made the files we used in this blog post available for download

Conclusion

We can create a relational target schema from an XSD, a sample of XML/JSON files, or a combination of the two. As we have outlined there are advantages and disadvantages for each approach.

In one of our next posts we will show you how we can handle changes to the Source Data in Flexter and options for evolving a target schema.

Why did we create Flexter?

Flexter was born from our own frustration of converting XML and JSON documents into a data warehouse. We saw many projects fail or run over budget. Why waste weeks of converting XML/JSON and risking failure instead of focusing on delivering real value to your business users? We think that the time spent writing manual code to convert the data to It just takes too much time of converting the data

Our customers are delighted with Flexter. Hear what they say

“This will save us weeks”

“You did in one day what we could not achieve in 3 years”

And this is what Ralph Kimball, one of the fathers of data warehousing has to say about XML conversion:

“The construction of an XML parser is a project in itself - not to be attempted by the data warehouse team”. Ralph Kimball, ETL Toolkit

We provide Flexter in three editions:

You can try out the free version of Flexter for JSON or XML.

Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.

Our SaaS edition, Flexter as a Service (FaaS), is managed by Sonra. You can use the Flexter API to convert your data.

If you have any questions please refer to the Flexter FAQ section. You can also request a demo of Flexter or reach out to us directly.

Last but not least we have open sourced Paranoid, our solution to obfuscate data in XML and JSON documents

Who uses Flexter?

Companies in travel, finance, healthcare, automotive, finance, and insurance industry use Flexter.

 

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.