Note: There is an updated version of this post. You can download it for free.
In the first part of this series of blog posts on data warehousing in the era of big data we looked at limitations of relational databases for data warehousing. In this post I will elaborate further on the impact of growing data volumes and the ability of an RDBMS to cope with the data tsunami. The growth in data volumes typically goes hand in hand with an increase in license costs for ETL tools, database licenses, and the need to purchase additional hardware.
RDBMS Scalability
The last decade has seen an exponential growth in the volumes of data. The datafication of everything can be attributed to two developments. (1) Digitisation of all aspects of our lives. These days it is even possible to read people’s minds by following them on Twitter. The other source of data growth is machine and sensor generated data emitted from all sorts of devices such as smart phones, smart cars, smart TVs, and smart things in general. Thanks to Moore’s law, advances in distributed computing over the last two decades, and open source software we are able to efficiently process this data at an acceptable cost. So what about relational database technology? Does it scale to Petabytes of data? It certainly does as the Teradata implementations at Walmart, Apple, or ebay show. However, the problem is not so much technical scalability but the costs for software licenses and the engineered hardware that needs to be purchased as well. Below is an extract from a Microsoft commissioned study on pricing of relational MPP appliances from various vendors (costs before discounts and per TB for uncompressed data):
EMC : $55,556
IBM: $49,969
Microsoft: $23,088
Oracle: $301,778
Teradata: $32,000
In contrast, the costs per uncompressed TB on Hadoop lie between $400 and $1,500 per TB. We at Sonra have seen a compression factor of up to 25x on Hadoop ORC files.
Both Hadoop and Massively Parallel databases (MPP) in principle follow the same approach, a shared nothing architecture (apart from Oracle Exadata where only the storage cells are shared nothing). It should not come as a surprise that both systems also scale in a very similar way. However, the big difference between the two is the price tag. While a data warehouse appliance scales well from a technical point of view it does not scale so well from a business point of view. Hadoop now makes it affordable to analyse large volumes of data for small and medium sized companies. However, large enterprises with big budgets can also benefit from Hadoop by offloading some of their data warehouse workloads to a Hadoop based solution. We will cover the opportunities for data warehouse optimization and offload in a separate post. At a high level, the following options are available:
- Cold Data Offload: Store and query infrequently accessed data in Hadoop.
- Staging Offload: Move your data warehouse staging area to Hadoop. Build it out to a data lake at a later stage.
- ETL Offload: Move your most expensive data flows and data transformations to Hadoop. Then over time migrate all of your ETL to Hadoop.
- Data Mart Offload: Store and query machine generated data and other processes that generate huge volumes of data in Hadoop as an adjunct to your core warehouse.
There are still some advantages of an MPP RDBMS over MPP SQL engines running on Hadoop
An RDBMS MPP typically supports transactions and updates to data. This is not something that is supported on Hadoop and distributed transactions are generally quite difficult to achieve. However, the next generation data architecture for data warehouses treats data as immutable (we will cover this in a separate post). Updating or deleting data in the face of batch oriented workloads and incremental aggregation algorithms is not really a requirement. So what initially seems to be an advantage of the RDBMS actually comes along as extra baggage of the engine that we don’t really need.
The original implementation of SQL engines on Hadoop unsurprisingly had very poor performance. Interactive workloads were shoe-horned into a batch framework. More recent implementations such as Apache Drill follow the same approach that relational databases have followed for decades. Essentially they stream data through a pipeline of operators, e.g. the results of a multi table join are immediately emitted for further processing. Shuffling of data is also delivered as a stream. Ilya Katsov has written an excellent article about in-stream big data processing if you want to find out more. Apache Flink takes this to the extreme. It follows the approach outlined by Ilya and uses streams for all workloads: streaming, SQL, micro-batch, and batch. Batch is just treated as a finite set of streamed data. To my mind this makes it the most sophisticated distributed open source processing engine (not the most mature one though).
Where a relational MPP still has a clear advantage over the other engines is in the maturity of the Cost Based Optimizer. RDBMS vendors have decades of experience in this area. However, there are a lot of efforts being made in the open source community to catch up, e.g. Spark Catalyst or Apache Calcite (which is also an excellent SQL parser in its own right).
In an MPP database you can have custom data placement strategies. This is very useful when you have to join two large tables as you can co-locate buckets of data on the same nodes by defining a hash on the join columns. This avoids a shuffle phase (expensive!) when joining and avoids unnecessary movement of data over the network. This cannot be easily done on HDFS. Dmitry Tolpeko has a great post about the issues on his blog. Another use case is in a multi-tenant environment where you want to place all of a tenant’s data on a particular node to avoid unnecessary shuffling of data. So how can you get around this? You could write your own HDFS data placement policy. However, this is not something that can be done easily. If you have a need for a multi-tenant system, I would recommend the MapR Hadoop distribution as it is the only distribution that allows for custom data placement on particular nodes using volumes. One other strategy for joining two large tables is to avoid the join altogether by de-normalizing your data. One other option is to use Spark and its concept of RDDs. In a Spark RDD you can co-locate a set of hash partitioned keys on the same node.
I am a CIO/CTO or data warehouse executive. What should I do?
Hadoop based SQL solutions will not replace data warehouses and appliances overnight. However, the golden age of the data warehouse appliance is over. All of the appliance vendors are hedging their bets, e.g. Teradata recently announced that it will throw its weight behind Presto, the SQL Hadoop engine first developed by Facebook and used by Teradata customers such as Groupon.
What to do then?
- If you have a data warehouse that is bursting at the seams on a general purpose relational database and you can’t scale up any further, migrate it opportunistically to an enterprise ready Hadoop distribution such as the one provided by MapR. Use a non relational MPP engine such as Apache Drill to query the data. Don’t bother looking at a data warehouse appliance.
- If you already have an appliance then gradually implement the data warehouse optimization strategies outlined above. I will cover these in more detail in one of my next posts.
- If you have a critical use case that requires the highest SLAs, the very best performance, and very low latency, then a relational MPP might be the right choice for you. I would recommend to benchmark your particular use case. In any case I would not recommend to go with a data warehouse appliance. Vertica a columnar relational MPP engine developed by Michael Stonebraker runs directly on top of the MapR filesystem.
- If you are only starting out in data warehousing, don’t bother implementing it on a relational MPP or data warehouse appliance. SQL MPP engines on Hadoop such as Apache Drill are cheaper alternatives that offer similar scalability and concurrency.