Generating a relational target schema. From XSD. From XML. From XML and XSD
Flexter, our data warehouse automation solution for the conversion of XML, JSON, industry data standards (HL7, FpML, ISO 20022 etc.) automatically creates Target Schema when we create a Data Flow. It allows us to see relational representation of our XML or JSON documents.
In this post we show you how easy it is to create a Target Schema from an XSD, a sample of XML documents, and from a combination of XSD and a sample of XML files. When you create a Target Schema, you can optionally apply the elevate or re-use Optimisation
Table of Contents
Creating Target Schema and Data Flow
A Data Flow is a core concept in Flexter. It maps the data points in the Source to the data points in the Target Schema.
Before we create a Conversion of XML or JSON documents with Flexter, we first have to create a Data Flow. We can then use this Data Flow in a Conversion Task. We are able to re-use the same Data Flow in one or more conversions.
When you create a Data Flow, Flexter collects Statistics (optional), creates the Target Schema and the mapping between Source and Target Schema.
Once the Data Flow has been created we can use it N number of times to convert Source Data. The conversion happens in a Conversion task.
Depending on the available type of Source (XML, JSON, XSD), the steps to create a Data Flow vary.
Creating Target Schema from XSD
Using a Source Schema (XSD) to create a Data Flow has some advantages and disadvantages:
Pros:
- The Target Schema you generate from an XSD is comprehensive. Any XML document that conforms to the XSD can be processed, All the possible data points from the input are mapped to the output with their cardinality and datatypes known and constant. Note this remains true as long as the Source Data fully conform to the XSD definitions used.
- The Target Schema can be built without having to rely on a representative sample of XML documents or even the complete set of XML documents. This makes it much faster to generate the Target Schema.
- It can optimise the output by analysing shared types and creating a single table of these shared types in the target schema (Reuse Optimisation). This simplifies the Target Schema.
Cons:
- XSDs that are used in industry data standards typically model a huge number of business processes. In reality a company typically only implements a subset of the business processes modeled in the XSD. As a result the Target Schema contains many tables and columns that are not used or populated. They are redundant. This makes the Target Schema unnecessarily complex and difficult to work with.
- Not ideal for deep recursive XPaths. Such cases require more memory to build and store as metadata.
You create a Data Flow from a Source Schema (XSD) in a single step. This is done in a single command with optional and mandatory parameters. Flexter automatically creates the Target Schema and Mapping for you.
A Conversion task is a single command with optional and mandatory parameters. A successful Conversion task loads the Source Data into the Target Format, e.g. from an XML archive on an FTP server to a Redshift database.
We will run a command with the Elevate optimisation.
1 |
xsd2er -g1 <INPUTPATH> |
Example of output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
20:00:35.986 INFO Building metadata 20:00:36.064 INFO Writing metadata 20:00:36.120 INFO Generating the mapping: elevate,reuse 20:00:36.210 INFO Registering success of job 3 20:00:36.225 INFO Finished successfully in 1058 milliseconds # schema origin: 3 logical: 1 job: 3 # statistics load: 447 ms parse: 365 ms build: 83 ms write: 56 ms map: 92 ms xpaths: 16 |
At the end of the successful XSD analysis process, Flexter prints out the ID of the Data Flow (logical). In the example above, the ID of the Data Flow is 1.
This should be noted and later used in your Conversion task.
Example of Target Schema
Creating Target Schema from XML only
It is possible to convert XML documents without a Source Schema (XSD). For JSON documents this is the only option as Flexter does not currently support JSON schema.
You need to provide a representative sample of XML/JSON documents for this to work best. Alternatively, you can provide all of the available XML or JSON documents. Flexter collects information from the data sample and stores it in the Metadata DB.
Here is a list of advantages and disadvantages when using a sample of XML documents.
Pros:
- It makes XML processing possible when you don’t have an XSD.
- It makes it possible to build a Target Schema for JSON files
- It generates a set of lightweight metadata: only what is really present in the Source Data is reflected.
- It may offer an alternative solution for scenarios where the XSD and XML samples don’t match, or have different versions.
- It’s ideal for recursive XPaths.
Cons:
- The Reuse optimisation is not applicable.
- It requires collecting statistics over XML/JSON data before it can be translated into a Target Schema. If you can’t easily derive a representative sample of XML files you will require the full set of Source Data to collect Statistics. This may be time consuming.
With Source Data as your only Source you create a Data Flow in two steps. You first collect Statistics from the Source Data. Flexter collects Statistics such as the XML elements, the data types etc.
This step can be repeated N number of times to collect Statistics incrementally. Statistics can optionally be collected incrementally each time you run a conversion process to detect if the source XML documents have changed.
Using the collected Statistics you create the Data Flow in a second step.
For this showcase we will run a command that will do both steps at the same time. We will run it with Elevate optimisation.
1 |
xml2er -g1 <INPUTPATH> |
Example of output
1 2 3 4 5 6 7 8 9 10 11 |
# schema origin: 4 logical: 2 job: 4 # statistics startup: 4107 ms parse: 6855 ms xpath stats: 436 ms doc stats: 4744 ms xpaths: 19 | map:0.0%/0 new:100.0%/19 documents: 1 | suc:100.0%/1 part:0.0%/0 fail:0.0%/0 |
This command collects Statistics (origin ID 4). At the same time it also creates a Data Flow (logical ID 2).
Example of Target Schema
Creating Target Schema from XSD and XML
Generating a Target Schema based on both a Source Schema and Source Data allows you to combine the information from both Sources. This approach tends to give you the best results as you can combine the information Flexter infers from both the Source Data and Source Schema.
Pros:
- It generates the most optimized logical schema, based on information from both Sources.
- The Target Schema is simplified. You can augment the information from the XSD with the intelligence collected from the Source Data documents, e.g. only generating the Data Points in the Target Schema that are in use.
- It lets you process recursive XPaths found in XSD definitions with Statistics.
- Reuse optimisation is applicable
Cons:
- It does not work with JSON documents
- It requires collecting statistics over XML/JSON data before it can be translated into target. If you can’t easily derive a representative sample of XML files you will require the full set of Source Data. This may be time consuming.
With Source Data and a Source Schema you create a Data Flow in two steps. You first collect Statistics from the Source Data. This step can be repeated n number of times. Using the collected Statistics AND a Source Schema you create a Data Flow in a second step.
In the first step we will collect Statistics from the Source Data.
1 |
xml2er <INPUTPATH> |
Example of output
1 2 3 4 5 6 7 8 9 10 |
# schema origin: 4 job: 4 # statistics startup: 4107 ms parse: 6855 ms xpath stats: 436 ms doc stats: 4744 ms xpaths: 19 | map:0.0%/0 new:100.0%/19 documents: 1 | suc:100.0%/1 part:0.0%/0 fail:0.0%/0 |
We have generated a new set of Statistics (origin 4). We are now ready to generate the new Data Flow with the xsd2er module. We need to use the use-stats parameter (-k switch) and also the -g switch for the optimisation algorithm (elevate, reuse, or both elevate and reuse) to generate the new Data Flow.
1 |
-k, --use-stats ID... Use Statistics to generate new data flow |
In the second step we will use xsd2er module to generate new Data Flow.
1 |
xsd2er -k4 -g1 <INPUTPATH> |
Example of output
1 2 3 4 5 6 7 8 9 10 11 12 |
# schema origin: 7 logical: 3 job: 6 # statistics load: 1608 ms stats: 40 ms parse: 368 ms build: 121 ms write: 47 ms map: 128 ms xpaths: 16 |
The generated Data Flow ID (logical 3) can be used in a Conversion task to convert XML documents to a relational Target Schema.
Example of Target Schema
We have made the files we used in this blog post available for download
Conclusion
We can create a relational target schema from an XSD, a sample of XML/JSON files, or a combination of the two. As we have outlined there are advantages and disadvantages for each approach.
In one of our next posts we will show you how we can handle changes to the Source Data in Flexter and options for evolving a target schema.
Why did we create Flexter?
Flexter was born from our own frustration of converting XML and JSON documents into a data warehouse. We saw many projects fail or run over budget. Why waste weeks of converting XML/JSON and risking failure instead of focusing on delivering real value to your business users? We think that the time spent writing manual code to convert the data to It just takes too much time of converting the data
Our customers are delighted with Flexter. Hear what they say
“This will save us weeks”
“You did in one day what we could not achieve in 3 years”
And this is what Ralph Kimball, one of the fathers of data warehousing has to say about XML conversion:
“The construction of an XML parser is a project in itself – not to be attempted by the data warehouse team”. Ralph Kimball, ETL Toolkit
We provide Flexter in Enterprise Edition:
Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.
If you have any questions please refer to the Flexter FAQ section. You can also request a demo of Flexter or reach out to us directly.
Last but not least we have open sourced Paranoid, our solution to obfuscate data in XML and JSON documents
Who uses Flexter?
Companies in travel, finance, healthcare, automotive, finance, and insurance industry use Flexter.