You might be surprised to hear, but Hadoop is a poor choice for a data lake

May 3, 2018

As there are a lot of definitions on what constitutes a data lake let’s first define what we actually mean by it. A data lake is similar to the staging area of a data warehouse with a couple of core differences. Let’s look at the commonalities first. Just like the staging area of a data warehouse the data lake stores a 1:1 copy of raw data from the source systems. There are some core differences however.

  • The only consumer of the staging area are downstream ETL processes. The lake has multiple other consumers, e.g. sandboxes for self-service analytics/data discovery, external partners/clients, MDM applications etc.
  • The data lake also stores unstructured data such as images, audio, video, text.
  • The data lake sits on cheap storage that is decoupled from compute.

[cloud_book_banner]
This latter point now brings me to the reason why Hadoop or more precisely HDFS is not a good fit for a data lake. It’s simply too expensive. Hadoop uses Directly Attached Storage (DAS). This means that compute and storage are tightly coupled. A system that tightly couples compute and storage is based on the flawed assumption that the relationship between storage and compute is a linear one. This is clearly incorrect. Let me give two examples. For data archival purposes I need a lot of storage but very little compute to infrequently access my data. In another scenario I might have only a small amount of data but thousands or millions of users who need to access it concurrently. I don’t need too much storage but an awful lot of compute. The advantage of DAS typically is or I should say was performance as only small amounts of data have to be shuffled across the network (more so for Hadoop in comparison to MPPs such as Teradata). Because of Gilder’s law bandwidth grows three times faster than compute and what once used to be a bottleneck is not so anymore. Data co-locality becomes less and less important as data can be shuffled ultra fast across the network.
In conclusion you are better off to run your data lake on object storage such as S3 or Google Cloud Storage. For on premise requirements you may want to look at on-premise object storage.