ETL,

The future of ETL and the limitations of data virtualisation and NoETL.

October 11, 2018

Data has limited value if we don’t transform, integrate, model (either data modelling and building predictive models) or cleanse it. Collecting raw data for no apparent reason or business case leads to the widespread data hoarding disorder.
The concept of NoETL and related ideas of data federation for logically integrating data from disparate data stores (now rebranded as data virtualisation) only work for a limited set of use cases.
AI aided/ontology driven ETL looks promising but is still in its infancy.

The future of ETL looks bright

The future of ETL and for ETL engineers looks bright. We’ve always had ETL in data warehousing. But apart from this obvious use case, ETL is everywhere. Let me give a few examples.

ETL for data science

A huge portion of data science is actually ETL and each data scientist should own an ETL engineer :-). For some reason data scientists call ETL data wrangling (data wrangling just sounds much cooler, I know). Anyway, data science is actually quite straightforward and simple. The difficult bits are thinking outside the box and coming up with good features, finding the data that underpins them, or coming up with correlating substitute features for data that can’t be accessed or is unavailable.
Some examples: Realising that the world wide web and hyperlinks are a graph and that graph algorithms can be applied to power a search engine is one such example. Another is feeding the translations from EU and UN institutions as training data into NLP algorithms to create a working machine translation app is another. Both come from Google.
The other difficult and time consuming work of data scientists is to transform data to features that can be fed to predictive models. This is a two step process. First, we integrate, cleanse and transform the raw data from the sources. This is classic ETL. If the data has already been integrated in the data warehouse and can be pulled from there then this step is not required. Second, features need to be generated from this high quality data. Typically this step takes a lot less time and can be automated to create variations of features, e.g. my friends at Xpanse Analytics are automating this step. As the largest amount of work in data science is actually ETL it makes sense to have ETL engineers on your data science team and leave the work of building models to the data scientists. After all, data science is a team sport and requires a lot of different skills and roles.

ETL for structured data

Based on my own experience, most data engineers that come to the field from a pure engineering background have very limited knowledge of relational databases and despise SQL.
You will hear them moan endlessly about SQL not being a proper language or being limited in the type of data transformations you can apply. While this is true for certain edge cases, e.g. very complex hierarchies and graph like structures it does not apply to 99%+ of scenarios. Using SQL, window functions, user defined aggregate functions, and some of the proprietary SQL extensions by individual vendors will get you a very long way (no stored procedures and other procedural logic allowed please). The problem often is not SQL but the limited skills of the data engineers tasked to use it.
SQL is the lingua franca of data. It’s battle tested and has stood the test of time or as Nassim Taleb would say SQL is anti-fragile.
If a book has been in print for forty years, I can expect it to be in print for another forty years. But, and that is the main difference, if it survives another decade, then it will be expected to be in print another fifty years. This, simply, as a rule, tells you why things that have been around for a long time are not “aging” like persons, but “aging” in reverse. Every year that passes without extinction doubles the additional life expectancy. This is an indicator of some robustness. The robustness of an item is proportional to its life! (Nassim Taleb, Anti-fragile)
Using SQL to transform structured data works very well. For unstructured data this is slightly different.
[blogBannerBigData]

ETL for unstructured data

Most databases now have SQL extensions for JSON. As I have outlined in my post JSON. ETL or NoETL you are better off to convert the hierarchical structure of JSON and XML to a relational schema. Flexter our ETL tool for JSON and XML does exactly that. It automates the whole conversion process for these data formats to a relational format.
It’s a very different story for truly unstructured data.
While you can store unstructured data (text, audio, video, images) in most databases in some sort of BLOB/CLOB/Binary data type this is not ideal. Similarly, it is not a good idea to perform ETL on unstructured data in a database. However, it is an excellent idea to store the output of unstructured ETL in a relational database and a data warehouse.
The term unstructured suggests that this type of data does not have structure. This is not the case. The structure of this data is just not easy to detect and extract. Hence some people refer to unstructured data as multi-structured data. I don’t think this term makes it any clearer either. Multi-structured? What’s it supposed to mean? Maybe complex structured would be better. Or just stick to unstructured. But that’s just semantics and let’s not get distracted.
So what does ETL for unstructured data look like? There are different types of ETL depending on the type of unstructured data we have at hand. Some of these techniques are applicable across all types, others specific to a particular type of unstructured data, e.g. transcription is applicable to audio and video. Entity extraction, which allows us to detect people, locations, products etc. can be applied to all types. Creating an inverted index for keyword search only applies to text (including audio and video after transcription to text).
As Bill Inmon has outlined in his latest book Turning Text into Gold (which I highly recommend) that it can be a complex process to convert text to a structured format. The main difficulty is dealing with ambiguities around the context of the text, which requires us to apply various techniques such as homograph resolution, proximity analysis, negativity inference etc.

The case against data virtualisation in ETL

Data virtualisation is one of those buzzwords. It can work for some edge cases. By and large it is blown out of proportions by vendors’ marketing departments. It just does not work for ETL (in particular complex ETL). Let me explain. Data virtualisation can be defined as providing a logical view over data stored in disparate storage engines.
Let’s look at an example. We might have multiple storage engines, e.g. S3, an Oracle database, a Redshift database, and a SQL Server database. Successful data virtualisation for this scenario depends on many variables, e.g. the types of queries we run, the quality of the storage plugins (can we push down predicates/filters, joins etc.), the type of storage (columnar or row based), the data volumes involved, the quality of the network and its latency, the distance between the storage engines (are they on the same LAN etc.), the caching features of the virtualisation layer, indexes etc. As you can see there are many factors to consider. As a result, data virtualisation by its nature can be a complex affair. For simple scenarios it works reasonably well, e.g. joining a lookup table in Oracle to a large event table on S3 should not be a problem. However, once it gets more complex things get difficult. One such anti-pattern is to join very large tables across storage layers with high network latency and a requirement to retrieve all table columns without filtering.
Typically in ETL we run exactly these types of workloads where we select all columns in complex multi-table joins, with complex inter-row transformations without filters. It gets more complicated. One component of ETL is data cleansing, which is near impossible to achieve without multiple passes over the data and physically setting it down. One extreme example is entity resolution.
One last argument against data federation/virtualisation is that data ingestion from multiple sources into a single storage layer can be completely automated or is of very low complexity if done manually.
And a final argument against data virtualisation is that you can’t keep an audit trail of the modifications to the source data via change data capture.
And one very last point. Constantly hammering the source systems will not make you good friends with the DBAs of those systems.
[cloud_book_banner]

What about Logical ETL?

Logical ETL is an approach used in data warehousing to create business transformations in logical views without setting down the data physically. There are two scenarios where you can use this approach.

  • Create logical data marts on top of a raw/business data vault or if you have a more traditional approach an integration/foundation layer in 3NF. This can work quite well and I would have few concerns deploying virtual marts into production. In the foundation layer/business vault our data has already been integrated and cleansed. So we don’t require complex transformations. In case we run into performance issues we can convert the virtual mart to a physical one with very little effort.
  • Create logical views on top of a staging area/raw data reservoir/raw data vault. This is similar to ETL data virtualisation though not quite as bad. Both have in common that we still need to apply more or less complex transformations on the fly. There is one significant difference though. In logical ETL we don’t pull from different storage layers but from a staging area on a single storage technology. This makes it less complex and should decrease latency. And indeed, logical ETL works very well for prototyping. For relatively simple transformation requirements this can also be deployed into production.

In logical ETL you trade off agility for performance and possible redevelopment work in the future.
I think I have made my point. ETL will only increase in relevance over the coming years (data science) and its future looks brighter than ever. NoETL and data virtualisation for ETL are over-hyped and I personally don’t see any mainstream future for these technologies.
If you want to find out more about running an efficient ETL pipeline, I recommend my popular training course Big Data for Data Warehouse Professionals. You will find out how to make your ETL more agile and resilient, how to efficiently build data pipelines, about the future of ETL tools and much more.