Data Warehouse,

Home » Data Warehouse » Data warehouse automation explained. Benefits and use cases

Data warehouse automation explained. Benefits and use cases

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on May 13, 2019
Updated on November 20, 2024

What is data warehouse automation?

I have been struggling to find a good article or post that explains data warehouse automation. To address this gap I have decide to write up my own post. I hope it helps. As always, feedback is welcome.
Data warehouse automation is not some sort of wizardry. It does not automagically build, model, and load a data warehouse for you. This level of automation is (for the time being at least) wishful thinking. Like with any other type of automation we identify patterns and instead of repeating them many times we use automation.
Let’s look at ETL as an example. In data integration we come across the same patterns over and over again. Using data warehouse automation we can convert these design patterns into reusable templates of code. The template itself contains placeholders. These are populated from metadata when we run (instantiate) the code template.
Let me give you a simple example to illustrate the point. A very common (and simple) data integration pattern is to truncate-load a table. The truncate-load operation requires two steps.

Truncate the target table

Insert the data from one or more source tables to the target table
These steps are always the same. What changes are the instances of objects in the template: What is the name of our target table? What are the names of our source tables? What are the mappings between source and target tables? All of this information can be stored in a metadata catalog. From there it can be retrieved at runtime to populate the placeholders in our code template. With this information our engine can generate the code that implements and executes the data integration pattern.
If you are looking for a practical example of data warehouse automation and code templates in action have a look at our blog post that shows how to create a code template for loading data from S3 to Redshift.
Similar to data integration, you can also automate certain rules to auto-generate a dimensional model from a normalised model. The approach works more or less well. It will not generate some production ready model for you. It does give you a starting point though. It gives you a starting point for your model and adds agility to the process.
One crucial pillar for any data warehouse automation effort is the availability of a rich set of metadata.

What are the benefits of data warehouse automation?

By encapsulating recurring patterns in a code template and automating common tasks we gain a wide variety of benefits:

Increased productivity. The ability to reuse the same code template over and over again increases productivity. Our engineers have to write a lot less boilerplate code.
Using a code template will result in less bugs, a higher level of consistency, and higher quality of code.
Changes can be rolled out at the speed of light. Let’s assume our data integration pattern needs to be modified. We need to add a third step to the truncate-load pattern, e.g. to collect table statistics after the INSERT has completed. All we need to do is to add this step to the code template and by magic it is rolled out to all of the instances where it is used.
Using an automated approach to code generation aligns well with automated approaches to testing.

Data warehouse automation. A sample use case

I have been working in data warehousing for the last 20 years. I have seen a lot of things come and go. Things have been hyped and then disappear or become legacy. There are however some fundamental truths. Projects that involve complex XML as a data source either fail or run over time and budget. There are many reasons for this. I have listed the main ones here:

Many XML files are based on industry data standards. A data standard typically covers and standardises on many different business processes. The business processes themselves contain many different entities. We have seen standards with hundreds or thousands of entities. Probably without knowing you are working with one of these standards on a daily basis: Office Open XML. It is the standard that underpins Microsoft office documents. The documentation for this standard has more than 8,000 pages. We have written in detail about this standard elsewhere on this blog: Liberating data from spreadmarts and Excel. As you can imagine it requires a lot of time to understand the standard and mapping back the concepts to XML is not straightforward. As a result, data analysts spend a lot of time trying to make sense of the standard.
Data engineers are good at working with databases, SQL, and scripting languages. They typically lack the niche skills such as XSLT or XQuery to work with XML files. They don’t have any interest or incentive to acquire these esoteric skills. Rightly so in my opinion.
Standard data integration tools have limited support for working with XML. They typically just provide a GUI on top of XPath. The whole process is still very manual and time consuming. The approach works reasonably well for simple XML. Not so for complex data standards. Apart from the lack of automation these tools also show terrible performance from our experience. We have seen ETL running for more than 24 hours for a relatively small number of 50,000 XML documents.
All of these issues lead to long analysis and development lifecycles, poor quality of code, badly performing data pipelines, and a lack of agility. In summary, there are significant risks to these projects.

The case for data warehouse automation for industry data standards and XML

Data warehouse automation for XML addresses all of these issues. As a result, data warehouse analysts and engineers can focus on adding value to the enterprise rather than converting XML from one format to another. Let’s look at some of the typical tasks than can be automated:

The analysis phase can be automated. We can collect information and intelligence such as data types, relationships between elements and types, data profiles, and data distribution from the source XML files.
The generation of the target data model to a database can be automated.
Using a metadata layer, the documentation can be auto-generated. Think of data lineage or source to target maps for the data analysts or ER diagrams of the target data models for the data engineers.
Logging of errors and issues can be automated.
Rogue XML files with unexpected structures can be auto-detected and parked for inspection.
Relationships and globally unique keys can be automatically defined.
Last but not least, the data warehouse target schema can be auto-populated.

With automation of these steps we get the usual benefits:

Code is consistent and of high quality
Performance has been optimized
Testing can be automated
Increased productivity and less bugs. Data engineers and analysts can focus on adding real value to your company.

Our own product Flexter is a data warehouse solution for complex data standards, XML, JSON, and APIs.
You can book a demo and find answers to FAQs on our website.
Our enterprise edition can be installed on a single node or for very large volumes of XML on a cluster of servers.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

Data warehouse automation explained. Benefits and use cases

What is data warehouse automation?

What are the benefits of data warehouse automation?

Data warehouse automation. A sample use case

The case for data warehouse automation for industry data standards and XML

Uli Bethke

Data warehouse automation explained. Benefits and use cases

What is data warehouse automation?

What are the benefits of data warehouse automation?

Data warehouse automation. A sample use case

The case for data warehouse automation for industry data standards and XML

Related Articles

Cookies consent