Let’s have a look at the evolution of cloud data warehousing. It’s hard to believe, but just a few years ago all data warehouses were run on-premise. Some of them on expensive and highly engineered MPP appliances. Very fast. Not cheap. Smaller data warehouses would run on relational databases such as Oracle or SQL Server, which had been originally engineered for OLTP workloads. A workaround that can work reasonably well.
First of all, a big upfront cost for both hardware and software. This negatively affects the cash flow of a company. As we are dealing with large amounts of $$$, we typically need to go through a lengthy approval process.
On-premise data warehouses need to be sized for peak usage. In other words a worst case scenario. A lot of planning and needs to go into this process and our assumptions may prove to be completely wrong. If we are an e-commerce company we need to size our hardware for peak demand over the Christmas period. For the rest of the time the hardware is underutilized. It is like only having a single light switch in your house. Either lights on for all rooms or off. As a rule of thumb, on-premise data warehouses are over-provisioned by a factor of 5. It doesn’t help either in this context that hardware depreciates quicker than a car.
It is a very hard task to predict and estimate future capacity needs. Besides our assumptions may prove to be completely wrong. Paradoxically, even though we have sized for the worst case scenario we have only sized for the known unknowns, as we can’t plan for unknown unknowns.
This is the fourth major issue. We can’t react quickly to unexpected changes in the business, e.g. another company is acquired and the the user base for our data warehouse doubles.
The fifth problem is in relation to clashes of different types of workloads. Traditionally, we have run both BI and ETL workloads on our data warehouse servers. More recently data exploration and data science workloads have been added to the mix. These workloads are vastly different from each other. BI queries aggregate data and only select a very small subset of columns from a table or dimensional model. They typically run in seconds or sub-seconds. ETL queries tend to be complex and scan and write over large volumes of data. They tend to select all columns in a table. Data exploration workloads are typically quite complex and are unpredictable in nature. These workloads will clash with each other when run on the same data warehouse platform. Data exploration use cases may run riot and tend to choke BI queries. You are also running the danger of rogue queries bringing down the cluster.
The last problem is the labour intensive nature of running a data warehouse: you need ETL and data warehouse engineers and architects. These are the core resources that provide obvious value. However, you also need DBAs, network, and infrastructure resources. These are a necessary evil but don’t provide direct value. They add a significant overhead to your project.
Let’s have a look at how cloud data warehousing is solving these issues.
Amazon Redshift was the first popular cloud data warehouse. Redshift is based on technology from ParAccel, which itself is a fork of the popular Postgres open source database. The first full release of Redshift was made in 2013. There are an estimated 6,500 Redshift customers. We have implemented various projects on Redshift. It works well and we provide Redshift expert services.
However, what Redshift has in common with most other cloud data warehouse platforms is that they were not built with the cloud in mind. The MPP technology was just lifted and shifted from an on-premise implementation into the cloud.
These first generation cloud data warehouse solve some of the problems of traditional on-premise data warehousing but leave others unresolved. Hence I call them first generation cloud data warehouses.
Let’s first have a look at what problems they solve. The elastic nature of the cloud allows us to rent or lease a data warehouse cluster with our credit card. In this sense it is a bit like renting a car for the summer holidays. We don’t need a huge up-front investment. We can use OPEX rather than CAPEX and don’t have to worry too much about our cash flow. We don’t have to go through a lengthy purchasing process with budgets and sign-offs either. As a result, cloud data warehousing has made data analytics feasible for medium sized enterprises
First generation cloud data warehouses also add a certain level of agility to the development process. It now only takes a few hours to spin up and load a cluster with data.
The management burden and administration overhead is also reduced to some degree, e.g. patches are rolled out automatically. The same applies for backups.
However, let’s have a look at the fundamental flaw of the lift and shift approach of these first generation MPP systems.
The fundamental flaw of first generation MPP cloud data warehouses is that data and compute, disks and CPU are co-located (the same applies to Hadoop BTW). They are tightly coupled on the same nodes in the cluster. If I power down compute, the nodes in the cluster will be allocated to another costumer. If I power up the cluster I need to reload the data from object storage, e.g. S3 in the case of Redshift. This can take hours for a medium sized cluster. You are back to square one and need to be up and running 24*7. Your hardware will still be underutilised.
By co-locating data and compute means you force the two into a linear relationship. When we need to add more disks and data we are also forced to add more CPUs and vice versa. This is inefficient for scenarios where we need a lot of storage but very little compute. Just think of data archival. We need lots of storage but very little compute as we only run queries infrequently. We are overprovisioning CPU, under utilize our cluster and overpay. MPP databases have tried to address this issue by using data federation, e.g. with Redshift you can now use Redshift Spectrum to query data directly on S3. This approach can work quite well when used correctly. However, has various limitations, which we have outlined in this blog post. The biggest issue with the Spectrum architecture is that you will still need to have your Redshift cluster up and running 24*7. In fairness though you address the data archival scenario outlined above to a large degree. In general, data federation and virtualisation has its own limitations and is only useful for a narrow set of use cases.
Let’s have a closer look at the design of an MPP database. On an MPP database the data is evenly distributed across the nodes of a cluster. Whenever you are adding a new node you need downtime for data redistribution to complete. So while you can react more quickly to unforeseen business events you still have some limitations and require a significant amount of manual intervention. The other of even data redistribution is that you can’t scale the cluster for concurrency. All nodes take part in the processing of the cluster. Each task in a query typically takes up one thread = one virtual core. Counter-intuitively, the number of vcores on one node in the cluster limits the number of queries you can run concurrently.
While we gain some benefits in reduced management costs (automatic patches and backups) we can’t eliminate them completely with first generation cloud data warehouses.
To get good performance out of your platform you need to have a good understanding of the profile of the data and the likely usage pattern. This is needed to make good decisions on how the data is distributed across the nodes in a cluster, e.g. if you want to join two very large tables it makes sense to co-locate their keys on the same nodes in the cluster.
Workload management headaches are also carried over from the on-premise model. Different types of workloads are still competing for resources with each other. Workloads are not physically separated from each other. To some extent you can address this issue with a logical workload manager that prioritises workloads. However, you need to allocate resources statically upfront rather than dynamically at runtime. This adds to the problem of underutilisation.
Other DBA tasks such as defining partitions, compression, encoding, sort keys, stats gathering remain. In particular the so called vacuum operation on Redshift is dreaded by data warehouse teams as it is requires extensive reordering of data which is extremely resource intensive and ideally requires downtime.
In summary a significant amount of planning and manual management still needs to go into running a first generation cloud data warehouse platform. This can add a substantial overhead and should be included in your TCO calculations.
Second generation cloud data warehouse platforms such as Snowflake and BigQuery address the shortcomings of first generation cloud data warehouses. They fully deliver on the elasticity of the cloud.
These next generation platforms separate compute and storage. Disks and CPU are not co-located on the nodes of the cluster. They are separated out into their own layers. One for compute and caching. The other for data storage.
As a result you can scale storage independently from compute and vice versa. You are not forced to put the two into the strait jacket of a linear relationship If you have a lot of data you are not forced to buy expensive compute.
Let’s have a look at some of the advantages of this architecture using Snowflake as a reference.
Restarting your cluster is instantaneous. You don’t need to spend hours reloading and redistributing the data.
As a result, there is no need for 24*7 uptime. You are billed by the second. You only pay for what you use up. Just like with your electricity bill. You don’t overpay.
Scaling the cluster up and down is instantaneous. It does not require downtime. Data does not need to be redistributed across the cluster.
Scaling can happen automatically. Snowflake can scale the cluster based on workload. Whenever concurrency increases on the cluster, more compute is added. When users drop off, the cluster can be scaled down again.
Data is separate from compute and not evenly distributed across the nodes in the cluster. As a result there is no maximum limit on concurrency. You can increase parallelism by simply scaling the cluster.
Rogue queries don’t choke the cluster anymore. Different workloads can be physically separated into so called virtual warehouses, e.g. one for ETL, one for BI, one for the marketing department as they are known to run funny queries. This way separate virtual data warehouses can point to the same data but are physically separated from each other. Each virtual warehouse can be sized separately and adapt to the characteristics of the specific workload, e.g. the marketing department are known for running resource intensive queries. You may want to give them a bigger virtual warehouse.
Another interesting feature of this architecture is what Snowflake have branded as the data sharehouse (some similar stuff is available on BigQuery). You can apply row level security to your data and expose a separate virtual warehouse to a third party. The third party get their own URL and account to log in and can query away on the data they are allowed to look at. In the past you had to make XML extracts or expose complex APIs to exchange data. With this feature, data is just one query away at a time. At this stage I should also mention Flexter, which is our own tool for liberating data from data locked away in complex APIs or industry data standards based on XML.
With second generation cloud data warehouses, you pretty much eliminate the management overhead of running a data warehouse.
There are no more knobs to tune or any parameters to set. No distribution keys, indexes, manual partitions, vacuum operations, stats collection etc. On zero and to a lesser degree first generation platforms, you can literally set hundreds of parameters to tune performance.
There is no need for this on second generation cloud data warehouse platforms. Load your data and query away without any major capacity planning steps or worrying about unknown unknowns.