The future of ETL and the limitations of data virtualisation and NoETL.

October 11, 2018

Data has limited value if we don’t transform, integrate, model (either data modelling and building predictive models) or cleanse it. Collecting raw data for no apparent reason or business case leads to the widespread data hoarding disorder.
The concept of NoETL and related ideas of data federation for logically integrating data from disparate data stores (now rebranded as data virtualisation) only work for a limited set of use cases.
AI aided/ontology driven ETL looks promising but is still in its infancy.

Table of Contents

The future of ETL looks bright

The future of ETL and for ETL engineers looks bright. We’ve always had ETL in data warehousing. But apart from this obvious use case, ETL is everywhere. Let me give a few examples.

ETL for data science

A huge portion of data science is actually ETL and each data scientist should own an ETL engineer :-). For some reason data scientists call ETL data wrangling (data wrangling just sounds much cooler, I know). Anyway, data science is actually quite straightforward and simple. The difficult bits are thinking outside the box and coming up with good features, finding the data that underpins them, or coming up with correlating substitute features for data that can’t be accessed or is unavailable.
Some examples: Realising that the world wide web and hyperlinks are a graph and that graph algorithms can be applied to power a search engine is one such example. Another is feeding the translations from EU and UN institutions as training data into NLP algorithms to create a working machine translation app is another. Both come from Google.
The other difficult and time consuming work of data scientists is to transform data to features that can be fed to predictive models. This is a two step process. First, we integrate, cleanse and transform the raw data from the sources. This is classic ETL. If the data has already been integrated in the data warehouse and can be pulled from there then this step is not required. Second, features need to be generated from this high quality data. Typically this step takes a lot less time and can be automated to create variations of features, e.g. my friends at Xpanse Analytics are automating this step. As the largest amount of work in data science is actually ETL it makes sense to have ETL engineers on your data science team and leave the work of building models to the data scientists. After all, data science is a team sport and requires a lot of different skills and roles.

ETL for structured data

Based on my own experience, most data engineers that come to the field from a pure engineering background have very limited knowledge of relational databases and despise SQL.
You will hear them moan endlessly about SQL not being a proper language or being limited in the type of data transformations you can apply. While this is true for certain edge cases, e.g. very complex hierarchies and graph like structures it does not apply to 99%+ of scenarios. Using SQL, window functions, user defined aggregate functions, and some of the proprietary SQL extensions by individual vendors will get you a very long way (no stored procedures and other procedural logic allowed please). The problem often is not SQL but the limited skills of the data engineers tasked to use it.
SQL is the lingua franca of data. It’s battle tested and has stood the test of time or as Nassim Taleb would say SQL is anti-fragile.
If a book has been in print for forty years, I can expect it to be in print for another forty years. But, and that is the main difference, if it survives another decade, then it will be expected to be in print another fifty years. This, simply, as a rule, tells you why things that have been around for a long time are not “aging” like persons, but “aging” in reverse. Every year that passes without extinction doubles the additional life expectancy. This is an indicator of some robustness. The robustness of an item is proportional to its life! (Nassim Taleb, Anti-fragile)
Using SQL to transform structured data works very well. For unstructured data this is slightly different.
[blogBannerBigData]

ETL for unstructured data

Most databases now have SQL extensions for JSON. As I have outlined in my post JSON. ETL or NoETL you are better off to convert the hierarchical structure of JSON and XML to a relational schema. Flexter our ETL tool for JSON and XML does exactly that. It automates the whole conversion process for these data formats to a relational format.
It’s a very different story for truly unstructured data.
While you can store unstructured data (text, audio, video, images) in most databases in some sort of BLOB/CLOB/Binary data type this is not ideal. Similarly, it is not a good idea to perform ETL on unstructured data in a database. However, it is an excellent idea to store the output of unstructured ETL in a relational database and a data warehouse.
The term unstructured suggests that this type of data does not have structure. This is not the case. The structure of this data is just not easy to detect and extract. Hence some people refer to unstructured data as multi-structured data. I don’t think this term makes it any clearer either. Multi-structured? What’s it supposed to mean? Maybe complex structured would be better. Or just stick to unstructured. But that’s just semantics and let’s not get distracted.
So what does ETL for unstructured data look like? There are different types of ETL depending on the type of unstructured data we have at hand. Some of these techniques are applicable across all types, others specific to a particular type of unstructured data, e.g. transcription is applicable to audio and video. Entity extraction, which allows us to detect people, locations, products etc. can be applied to all types. Creating an inverted index for keyword search only applies to text (including audio and video after transcription to text).
As Bill Inmon has outlined in his latest book Turning Text into Gold (which I highly recommend) that it can be a complex process to convert text to a structured format. The main difficulty is dealing with ambiguities around the context of the text, which requires us to apply various techniques such as homograph resolution, proximity analysis, negativity inference etc.

The case against data virtualisation in ETL

Data virtualisation is one of those buzzwords. It can work for some edge cases. By and large it is blown out of proportions by vendors’ marketing departments. It just does not work for ETL (in particular complex ETL). Let me explain. Data virtualisation can be defined as providing a logical view over data stored in disparate storage engines.
Let’s look at an example. We might have multiple storage engines, e.g. S3, an Oracle database, a Redshift database, and a SQL Server database. Successful data virtualisation for this scenario depends on many variables, e.g. the types of queries we run, the quality of the storage plugins (can we push down predicates/filters, joins etc.), the type of storage (columnar or row based), the data volumes involved, the quality of the network and its latency, the distance between the storage engines (are they on the same LAN etc.), the caching features of the virtualisation layer, indexes etc. As you can see there are many factors to consider. As a result, data virtualisation by its nature can be a complex affair. For simple scenarios it works reasonably well, e.g. joining a lookup table in Oracle to a large event table on S3 should not be a problem. However, once it gets more complex things get difficult. One such anti-pattern is to join very large tables across storage layers with high network latency and a requirement to retrieve all table columns without filtering.
Typically in ETL we run exactly these types of workloads where we select all columns in complex multi-table joins, with complex inter-row transformations without filters. It gets more complicated. One component of ETL is data cleansing, which is near impossible to achieve without multiple passes over the data and physically setting it down. One extreme example is entity resolution.
One last argument against data federation/virtualisation is that data ingestion from multiple sources into a single storage layer can be completely automated or is of very low complexity if done manually.
And a final argument against data virtualisation is that you can’t keep an audit trail of the modifications to the source data via change data capture.
And one very last point. Constantly hammering the source systems will not make you good friends with the DBAs of those systems.
[cloud_book_banner]

What about Logical ETL?

Logical ETL is an approach used in data warehousing to create business transformations in logical views without setting down the data physically. There are two scenarios where you can use this approach.

Create logical data marts on top of a raw/business data vault or if you have a more traditional approach an integration/foundation layer in 3NF. This can work quite well and I would have few concerns deploying virtual marts into production. In the foundation layer/business vault our data has already been integrated and cleansed. So we don’t require complex transformations. In case we run into performance issues we can convert the virtual mart to a physical one with very little effort.
Create logical views on top of a staging area/raw data reservoir/raw data vault. This is similar to ETL data virtualisation though not quite as bad. Both have in common that we still need to apply more or less complex transformations on the fly. There is one significant difference though. In logical ETL we don’t pull from different storage layers but from a staging area on a single storage technology. This makes it less complex and should decrease latency. And indeed, logical ETL works very well for prototyping. For relatively simple transformation requirements this can also be deployed into production.

In logical ETL you trade off agility for performance and possible redevelopment work in the future.
I think I have made my point. ETL will only increase in relevance over the coming years (data science) and its future looks brighter than ever. NoETL and data virtualisation for ETL are over-hyped and I personally don’t see any mainstream future for these technologies.
If you want to find out more about running an efficient ETL pipeline, I recommend my popular training course Big Data for Data Warehouse Professionals. You will find out how to make your ETL more agile and resilient, how to efficiently build data pipelines, about the future of ETL tools and much more.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

The future of ETL and the limitations of data virtualisation and NoETL.

The future of ETL looks bright

ETL for data science

ETL for structured data

ETL for unstructured data

The case against data virtualisation in ETL

What about Logical ETL?

The future of ETL and the limitations of data virtualisation and NoETL.

The future of ETL looks bright

ETL for data science

ETL for structured data

ETL for unstructured data

The case against data virtualisation in ETL

What about Logical ETL?

Related Articles

The Snowflake data platform for data engineering, data preparation, and ETL

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Flexter 1.2.0 our ETL tool for JSON/XML has been released. 20x faster. Supports very large XML and JSON

Cookies consent