To hell and back with ETL. The unstoppable rise of data warehouse automation.

Published on May 22, 2020
Updated on May 18, 2021

A brief history of ETL

The rise of ETL and the rise of the data warehouse are tightly coupled. It’s like the chicken and the egg. There would be no data warehouse without ETL and no ETL without data warehouse. Once in a while non-sense such as data virtualisation or NoETL will pop up. However, if you want success with your data warehouse you will need ETL.
You get an indication of the close knit relationship between ETL and data warehousing by looking at the graph below.

https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=data%20warehouse,etl
The first ETL programs were custom developed in Cobol, C or C++. These days these languages are quite alien to data engineers. As you can imagine using low level languages like these was a very manual and labour intensive process. It required very specialised skills.
ETL tools started to become more and more popular in the 90s. They took care of some of the complexity and abstracted some of the recurring patterns into a GUI. The idea was for less technical people to click together ETL pipelines in a graphical user interface. This is the reason why these tools often are referred to as point and click tools.

The failure of ETL tools

Over time it became obvious that point and click ETL tools have serious limitations.

The abstractions that these tools offer to implement business logic are not rich enough to meet requirements. It takes more than a lookup to implement complex business logic, e.g. inter row calculations or oven moderately complex logic is impossible to implement.
Closely related to the issue of poor abstraction capabilities is poor productivity. Let me call it the death by a thousand clicks. It is a lot quicker to write an SQL statement than to click and map something together in a point and click GUI. I know of many ETL tool implementations where the tool is only used for data orchestration. All of the transformation logic is built in SQL and then the tool is just used to build the pipeline and workflow. A very expensive way of dealing with your workflow or data orchestration requirement. 🙂
In many cases ETL tools offer poor performance as they use a row based rather than a set based approach. Some don’t have the ability to scale out across a cluster of nodes.
Poor transparency. The code that ETL tools generate internally is a blackbox. It is not transparent to the data engineer, which makes it difficult to debug and requires long sessions with the vendor’s support team.
Poor integration with source control. A GUI is just a lot harder to integrate with granular version control than code.
Lineage. Paradoxically ETL tools claim to offer data lineage support. However, as few people use them in the pointy and clicky way they are intended and instead write the business logic in SQL, lineage does not exist.
Limited or no automation. Each data flow or layer in the data warehouse has to be clicked together.
While metadata may be stored somewhere it is often not accessible to the ETL engineers.
Fool with a tool syndrome. When hiring ETL engineers, management looks for tool knowledge rather than data warehouse foundational knowledge. This is a recipe for disaster. I once interviewed an ETL engineer with many years of point and click experience who had never touched SQL. How it would be possible to create a successful data warehouse implementation with such resources is beyond me.
The business transformation and the data integration logic are tightly coupled. You will see business logic such as transformations mixed with data integration logic such as truncating a table, disabling indexes, collecting table statistics, CDC logic etc. You can’t change the business logic without having to change the data integration logic. Re-use of the data integration logic is not possible.

What is the alternative?

As a result of these limitations a new breed of tools emerged in the early noughties. About 15 years ago I had the pleasure myself to work with one of these tools for the first time. At the time the tool was called Sunopsis. Later on it was acquired by Oracle and licensed as Oracle Data Integrator. It is still widely used today and we still get thousands of hits a month on our website in relation to it.
Sunopsis was probably one of the first data warehouse automation tools on the market. Sunopsis was particularly popular with Teradata implementations as it used the Teradata MPP engine to scale ETL on a cluster of nodes. It followed a push down ELT approach where it generated SQL code and executed the SQL code directly on the data warehouse platform.

It had some core features of what now is referred to as data warehouse automation.

ELT. The tool generated SQL code and executed the code directly on the data warehouse platform of choice. You don’t need separate hardware. You avoid the blackbox approach. ELT can be combined with point and click and in this incarnation should we avoided, e.g. Oracle Warehouse Builder follows this approach.
Automation was achieved through re-usable code templates which are fed by metadata and generate the ETL code in SQL. You can think of a code template as a function where you pass in a data set as input and get a data set as output. Optionally you can also design the function to pass in additional parameters to complement what you have in your metadata or to override it.
Separation of business logic from data integration logic. You can change one without affecting the other. You can re-use both.

We will go through some examples in a minute.

Automation

But let’s take a step back. Before we look at automation in a data warehouse context, let’s take a look at what automation means in general and what its benefits are?
Automation or robotization, is the method by which a process or procedure is performed with minimal human assistance and intervention.
Before a process can be automated a couple of things need to fall into place:

A pattern of recurring steps exists and we are able to identify it.
We need to perform the pattern more than just a few times. Otherwise the overhead of automation is greater than the benefits.

The pattern needs to be well understood. There is no point of automating something that needs to be changed over and over again.

Automation is cheaper than running the process manually
The type of automation does not require human domain knowledge. Some recurring patterns can not be automated because they require human intervention, e.g. driving a car can still not be automated. The same applies for creating an enterprise data model.
Even though you can automate something it may not be desirable for ethical reasons because of the side effects, e.g. mass unemployment.

Benefits of automation

Increase in productivity
Faster time to market. No long project and development lifecycles.
Increase in accuracy. Humans make mistakes. By automating a process you eliminate errors and as a result increase quality.
Decrease in boredom and alienation from repetitive tasks. Resources can work on interesting problems that add value to the organisation.

Data warehouse automation

How does the concept of automation relate to the idea of automating a data warehouse?
First of all a word of warning. Data warehouse automation is not a silver bullet. It does not automagically build the data warehouse for you. However, there are recurring patterns when building a data warehouse that can be automated. Some others can’t.
As an example: There are some tools that use a rules based approach to generate data models for you. This is non-sense and results in data source centric data models. Data modeling requires human domain knowledge and can not be automated without external context. If at all possible the way to go is with semantics, ontologies, and taxonomies. My advice is to stay away from these types of tools.
Data warehouse automation follows the same principle as automation in general. It also has the same benefits.
In data warehousing and data integration we come across recurring patterns like in most other domains. Any of these patterns lend themselves to automation taking the caveats I mentioned earlier on into account, e.g. the overhead of automation should not exceed the benefits.
Some examples of recurring patterns in data integration.

Loading a staging area
Loading a persistent staging area
Loading a Slowly Changing Dimension
Loading a Hub in a Data Vault

We can automate all of these processes with the right metadata. Let’s take the loading of a staging area as an example. If you have a relational database as a source, all you need to know is the connections and the names of the tables you want to load to fully automate the generation of your staging area. Using this input you can connect to the source database, collect metadata from the information schema, translate this information, e.g. data types to your target technology and pull the data across to the target data warehouse. You can take this a step further and also automatically maintain the full history of changes in a Persistent Staging Area. The only additional information you need is a unique key. You can either get this information from the information schema of the source database, derive it from samples, or feed it into the process from outside as a parameter. You can take this one step further and also build some sort of simple schema evolution into the process. By monitoring the columns in the information schema of the source database and checking if any columns were added or deleted. Based on this information you can dynamically adapt the structure of your staging table and the SQL you generate to load the data.
What all of these examples have in common is that they rely on a rich set of metadata being available.

Metadata is everywhere

Let’s take a common scenario as an example: Loading a fact table in overwrite mode from a couple of staging tables.
Both the source tables and the target table come with metadata, e.g. the names of the columns, the data types, information on keys and relationships, the mapping between source and target columns etc. This information can be stored as metadata in a metadata catalog, e.g. an information schema in a database is such a metadata catalog.

We add the data integration logic and the business transformation logic to the metadata mix and can now fully automate the process.
The business transformation logic comes in the form of an SQL statement. In this particular example a very simple one.
The data transformation logic comes in a metadata driven and paramterised code template (a function). In this particular example our data integration logic is a simple overwrite in an Oracle database.
The advantage is that we separate the business transformation logic from the data integration logic. Making a change in a code template / function will be applied to all instances where the code template is used. Making a change in the business logic does not require a change in the data integration logic.

Let’s have a look inside the code template
Our code template / function is pretty simple for illustration purposes. In the real world you can implement any level of complexity. It comprises three recurring steps. In the first step we truncate the target table. We get the information on the target table name from the metadata.
In the second step we inject the metadata such as target table, source table name (in our case a database View), and source column to target column mappings into the code template to generate the INSERT logic. Optionally we add an ORDER BY clause to presort the data. The name of the sort column can be retrieved from the metadata or we can override it by passing in the name as a parameter.
In the last step we collect statistics against the target table for the optimizer

We can now re-use the code template to load other target tables, e.g. F_ORDER_ITEM instead of F_ORDER
We could easily swap out the code template for an Upsert / Merge if our data integration requirement changes.
We can easily modify the business logic in case it changes.
Everything can be easily put under source control.

Here is a final example.
We can easily swap out the target table, e.g. F_ORDER_AGG instead of F_ORDER. Also notice the change in business transformation logic.

Data warehouse automation for XML and JSON

The last example I have is of using data warehouse automation to convert XML or JSON to a relational format. For simple XML it may make sense to write some code to do it.
However, once your XML gets complex or you have large volumes of data, the overhead of developing and maintaining the conversion manually quickly becomes too much.
In the image below we have reverse engineered the XSDs of the FpML standard in an XML editor

As you can see the PDF goes across 18 pages with thousands of elements. Having to deal with this manually will add a lot of overhead.
Our own tool Flexter can fully automate the conversion of XML or JSON to a relational database, text, or big data formats Parquet, ORC, Avro.
Here is the optimised target schema that Flexter generates

Conclusion

So what are my recommendations? Both manual coding and point and click ETL tools should be avoided. I am also not convinced of any of the existing data warehouse automation tools I have seen apart from one. You can reach out to find out more.
At Sonra, we use Apache Airflow to build our own re-usable data warehouse automation templates. We have written a blog post that illustrates how you can leverage Airflow to automate recurring patterns in ETL. We have implemented this appraoch on Snowflake with many of our clients and it just works 🙂
If you need help or have any questions contact us.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

To hell and back with ETL. The unstoppable rise of data warehouse automation.

A brief history of ETL