The evolution of cloud data warehousing

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on February 13, 2019
Updated on November 20, 2024

Let’s have a look at the evolution of cloud data warehousing. It’s hard to believe, but just a few years ago all data warehouses were run on-premise. Some of them on expensive and highly engineered MPP appliances. Very fast. Not cheap. Smaller data warehouses would run on relational databases such as Oracle or SQL Server, which had been originally engineered for OLTP workloads. A workaround that can work reasonably well.

What are some of the challenges of zero generation data warehouses?

Upfront cost

First of all, a big upfront cost for both hardware and software. This negatively affects the cash flow of a company. As we are dealing with large amounts of $$$, we typically need to go through a lengthy approval process.

Sizing

On-premise data warehouses need to be sized for peak usage. In other words a worst case scenario. A lot of planning and needs to go into this process and our assumptions may prove to be completely wrong. If we are an e-commerce company we need to size our hardware for peak demand over the Christmas period. For the rest of the time the hardware is underutilized. It is like only having a single light switch in your house. Either lights on for all rooms or off. As a rule of thumb, on-premise data warehouses are over-provisioned by a factor of 5. It doesn’t help either in this context that hardware depreciates quicker than a car.
It is a very hard task to predict and estimate future capacity needs. Besides our assumptions may prove to be completely wrong. Paradoxically, even though we have sized for the worst case scenario we have only sized for the known unknowns, as we can’t plan for unknown unknowns.

Agility and elasticity

This is the fourth major issue. We can’t react quickly to unexpected changes in the business, e.g. another company is acquired and the the user base for our data warehouse doubles.

Clash of workloads

The fifth problem is in relation to clashes of different types of workloads. Traditionally, we have run both BI and ETL workloads on our data warehouse servers. More recently data exploration and data science workloads have been added to the mix. These workloads are vastly different from each other. BI queries aggregate data and only select a very small subset of columns from a table or dimensional model. They typically run in seconds or sub-seconds. ETL queries tend to be complex and scan and write over large volumes of data. They tend to select all columns in a table. Data exploration workloads are typically quite complex and are unpredictable in nature. These workloads will clash with each other when run on the same data warehouse platform. Data exploration use cases may run riot and tend to choke BI queries. You are also running the danger of rogue queries bringing down the cluster.

Management overhead

The last problem is the labour intensive nature of running a data warehouse: you need ETL and data warehouse engineers and architects. These are the core resources that provide obvious value. However, you also need DBAs, network, and infrastructure resources. These are a necessary evil but don’t provide direct value. They add a significant overhead to your project.
Let’s have a look at how cloud data warehousing is solving these issues.

First generation cloud data warehousing

Amazon Redshift was the first popular cloud data warehouse. Redshift is based on technology from ParAccel, which itself is a fork of the popular Postgres open source database. The first full release of Redshift was made in 2013. There are an estimated 6,500 Redshift customers. We have implemented various projects on Redshift. It works well and we provide Redshift expert services.
However, what Redshift has in common with most other cloud data warehouse platforms is that they were not built with the cloud in mind. The MPP technology was just lifted and shifted from an on-premise implementation into the cloud.
These first generation cloud data warehouse solve some of the problems of traditional on-premise data warehousing but leave others unresolved. Hence I call them first generation cloud data warehouses.
Let’s first have a look at what problems they solve. The elastic nature of the cloud allows us to rent or lease a data warehouse cluster with our credit card. In this sense it is a bit like renting a car for the summer holidays. We don’t need a huge up-front investment. We can use OPEX rather than CAPEX and don’t have to worry too much about our cash flow. We don’t have to go through a lengthy purchasing process with budgets and sign-offs either. As a result, cloud data warehousing has made data analytics feasible for medium sized enterprises
First generation cloud data warehouses also add a certain level of agility to the development process. It now only takes a few hours to spin up and load a cluster with data.
The management burden and administration overhead is also reduced to some degree, e.g. patches are rolled out automatically. The same applies for backups.
However, let’s have a look at the fundamental flaw of the lift and shift approach of these first generation MPP systems.

Challenges of first generation cloud data warehouses

The fundamental flaw of first generation MPP cloud data warehouses is that data and compute, disks and CPU are co-located (the same applies to Hadoop BTW). They are tightly coupled on the same nodes in the cluster. If I power down compute, the nodes in the cluster will be allocated to another costumer. If I power up the cluster I need to reload the data from object storage, e.g. S3 in the case of Redshift. This can take hours for a medium sized cluster. You are back to square one and need to be up and running 24*7. Your hardware will still be underutilised.

Co-locating data and compute

By co-locating data and compute means you force the two into a linear relationship. When we need to add more disks and data we are also forced to add more CPUs and vice versa. This is inefficient for scenarios where we need a lot of storage but very little compute. Just think of data archival. We need lots of storage but very little compute as we only run queries infrequently. We are overprovisioning CPU, under utilize our cluster and overpay. MPP databases have tried to address this issue by using data federation, e.g. with Redshift you can now use Redshift Spectrum to query data directly on S3. This approach can work quite well when used correctly. However, has various limitations, which we have outlined in this blog post. The biggest issue with the Spectrum architecture is that you will still need to have your Redshift cluster up and running 24*7. In fairness though you address the data archival scenario outlined above to a large degree. In general, data federation and virtualisation has its own limitations and is only useful for a narrow set of use cases.

MPP and data distribution

Let’s have a closer look at the design of an MPP database. On an MPP database the data is evenly distributed across the nodes of a cluster. Whenever you are adding a new node you need downtime for data redistribution to complete. So while you can react more quickly to unforeseen business events you still have some limitations and require a significant amount of manual intervention. The other of even data redistribution is that you can’t scale the cluster for concurrency. All nodes take part in the processing of the cluster. Each task in a query typically takes up one thread = one virtual core. Counter-intuitively, the number of vcores on one node in the cluster limits the number of queries you can run concurrently.

Management overhead

While we gain some benefits in reduced management costs (automatic patches and backups) we can’t eliminate them completely with first generation cloud data warehouses.
To get good performance out of your platform you need to have a good understanding of the profile of the data and the likely usage pattern. This is needed to make good decisions on how the data is distributed across the nodes in a cluster, e.g. if you want to join two very large tables it makes sense to co-locate their keys on the same nodes in the cluster.
Workload management headaches are also carried over from the on-premise model. Different types of workloads are still competing for resources with each other. Workloads are not physically separated from each other. To some extent you can address this issue with a logical workload manager that prioritises workloads. However, you need to allocate resources statically upfront rather than dynamically at runtime. This adds to the problem of underutilisation.
Other DBA tasks such as defining partitions, compression, encoding, sort keys, stats gathering remain. In particular the so called vacuum operation on Redshift is dreaded by data warehouse teams as it is requires extensive reordering of data which is extremely resource intensive and ideally requires downtime.
In summary a significant amount of planning and manual management still needs to go into running a first generation cloud data warehouse platform. This can add a substantial overhead and should be included in your TCO calculations.

Second generation cloud data warehousing

Second generation cloud data warehouse platforms such as Snowflake and BigQuery address the shortcomings of first generation cloud data warehouses. They fully deliver on the elasticity of the cloud.
These next generation platforms separate compute and storage. Disks and CPU are not co-located on the nodes of the cluster. They are separated out into their own layers. One for compute and caching. The other for data storage.
As a result you can scale storage independently from compute and vice versa. You are not forced to put the two into the strait jacket of a linear relationship If you have a lot of data you are not forced to buy expensive compute.
Let’s have a look at some of the advantages of this architecture using Snowflake as a reference.
Restarting your cluster is instantaneous. You don’t need to spend hours reloading and redistributing the data.
As a result, there is no need for 24*7 uptime. You are billed by the second. You only pay for what you use up. Just like with your electricity bill. You don’t overpay.
Scaling the cluster up and down is instantaneous. It does not require downtime. Data does not need to be redistributed across the cluster.
Scaling can happen automatically. Snowflake can scale the cluster based on workload. Whenever concurrency increases on the cluster, more compute is added. When users drop off, the cluster can be scaled down again.
Data is separate from compute and not evenly distributed across the nodes in the cluster. As a result there is no maximum limit on concurrency. You can increase parallelism by simply scaling the cluster.
Rogue queries don’t choke the cluster anymore. Different workloads can be physically separated into so called virtual warehouses, e.g. one for ETL, one for BI, one for the marketing department as they are known to run funny queries. This way separate virtual data warehouses can point to the same data but are physically separated from each other. Each virtual warehouse can be sized separately and adapt to the characteristics of the specific workload, e.g. the marketing department are known for running resource intensive queries. You may want to give them a bigger virtual warehouse.
Another interesting feature of this architecture is what Snowflake have branded as the data sharehouse (some similar stuff is available on BigQuery). You can apply row level security to your data and expose a separate virtual warehouse to a third party. The third party get their own URL and account to log in and can query away on the data they are allowed to look at. In the past you had to make XML extracts or expose complex APIs to exchange data. With this feature, data is just one query away at a time. At this stage I should also mention Flexter, which is our own tool for liberating data from data locked away in complex APIs or industry data standards based on XML.

Management burden

With second generation cloud data warehouses, you pretty much eliminate the management overhead of running a data warehouse.
There are no more knobs to tune or any parameters to set. No distribution keys, indexes, manual partitions, vacuum operations, stats collection etc. On zero and to a lesser degree first generation platforms, you can literally set hundreds of parameters to tune performance.
There is no need for this on second generation cloud data warehouse platforms. Load your data and query away without any major capacity planning steps or worrying about unknown unknowns.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

The evolution of cloud data warehousing