What is data warehouse automation?
I have been struggling to find a good article or post that explains data warehouse automation. To address this gap I have decide to write up my own post. I hope it helps. As always, feedback is welcome.
Data warehouse automation is not some sort of wizardry. It does not automagically build, model, and load a data warehouse for you. This level of automation is (for the time being at least) wishful thinking. Like with any other type of automation we identify patterns and instead of repeating them many times we use automation.
Let's look at ETL as an example. In data integration we come across the same patterns over and over again. Using data warehouse automation we can convert these design patterns into reusable templates of code. The template itself contains placeholders. These are populated from metadata when we run (instantiate) the code template.
Let me give you a simple example to illustrate the point. A very common (and simple) data integration pattern is to truncate-load a table. The truncate-load operation requires two steps.
Truncate the target table
Insert the data from one or more source tables to the target table
These steps are always the same. What changes are the instances of objects in the template: What is the name of our target table? What are the names of our source tables? What are the mappings between source and target tables? All of this information can be stored in a metadata catalog. From there it can be retrieved at runtime to populate the placeholders in our code template. With this information our engine can generate the code that implements and executes the data integration pattern.
If you are looking for a practical example of data warehouse automation and code templates in action have a look at our blog post that shows how to create a code template for loading data from S3 to Redshift.
Similar to data integration, you can also automate certain rules to auto-generate a dimensional model from a normalised model. The approach works more or less well. It will not generate some production ready model for you. It does give you a starting point though. It gives you a starting point for your model and adds agility to the process.
One crucial pillar for any data warehouse automation effort is the availability of a rich set of metadata.
By encapsulating recurring patterns in a code template and automating common tasks we gain a wide variety of benefits:
- Increased productivity. The ability to reuse the same code template over and over again increases productivity. Our engineers have to write a lot less boilerplate code.
- Using a code template will result in less bugs, a higher level of consistency, and higher quality of code.
- Changes can be rolled out at the speed of light. Let’s assume our data integration pattern needs to be modified. We need to add a third step to the truncate-load pattern, e.g. to collect table statistics after the INSERT has completed. All we need to do is to add this step to the code template and by magic it is rolled out to all of the instances where it is used.
- Using an automated approach to code generation aligns well with automated approaches to testing.
I have been working in data warehousing for the last 20 years. I have seen a lot of things come and go. Things have been hyped and then disappear or become legacy. There are however some fundamental truths. Projects that involve complex XML as a data source either fail or run over time and budget. There are many reasons for this. I have listed the main ones here:
- Many XML files are based on industry data standards. A data standard typically covers and standardises on many different business processes. The business processes themselves contain many different entities. We have seen standards with hundreds or thousands of entities. Probably without knowing you are working with one of these standards on a daily basis: Office Open XML. It is the standard that underpins Microsoft office documents. The documentation for this standard has more than 8,000 pages. We have written in detail about this standard elsewhere on this blog: Liberating data from spreadmarts and Excel. As you can imagine it requires a lot of time to understand the standard and mapping back the concepts to XML is not straightforward. As a result, data analysts spend a lot of time trying to make sense of the standard.
- Data engineers are good at working with databases, SQL, and scripting languages. They typically lack the niche skills such as XSLT or XQuery to work with XML files. They don’t have any interest or incentive to acquire these esoteric skills. Rightly so in my opinion.
- Standard data integration tools have limited support for working with XML. They typically just provide a GUI on top of XPath. The whole process is still very manual and time consuming. The approach works reasonably well for simple XML. Not so for complex data standards. Apart from the lack of automation these tools also show terrible performance from our experience. We have seen ETL running for more than 24 hours for a relatively small number of 50,000 XML documents.
- All of these issues lead to long analysis and development lifecycles, poor quality of code, badly performing data pipelines, and a lack of agility. In summary, there are significant risks to these projects.
Data warehouse automation for XML addresses all of these issues. As a result, data warehouse analysts and engineers can focus on adding value to the enterprise rather than converting XML from one format to another. Let’s look at some of the typical tasks than can be automated:
- The analysis phase can be automated. We can collect information and intelligence such as data types, relationships between elements and types, data profiles, and data distribution from the source XML files.
- The generation of the target data model to a database can be automated.
- Using a metadata layer, the documentation can be auto-generated. Think of data lineage or source to target maps for the data analysts or ER diagrams of the target data models for the data engineers.
- Logging of errors and issues can be automated.
- Rogue XML files with unexpected structures can be auto-detected and parked for inspection.
- Relationships and globally unique keys can be automatically defined.
- Last but not least, the data warehouse target schema can be auto-populated.
With automation of these steps we get the usual benefits:
- Code is consistent and of high quality
- Performance has been optimized
- Testing can be automated
- Increased productivity and less bugs. Data engineers and analysts can focus on adding real value to your company.
Our own product Flexter is a data warehouse solution for complex data standards, XML, JSON, and APIs.
We have recently released a SaaS version of the product and are looking for beta testers.
Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.