You might be surprised to hear, but Hadoop is a poor choice for a data lake

Uli Bethke Hadoop

As there are a lot of definitions on what constitutes a data lake let’s first define what we actually mean by it. A data lake is similar to the staging area of a data warehouse with a couple of core differences. Let’s look at the commonalities first. Just like the staging area of a data warehouse the data lake stores a 1:1 copy of raw data from the source systems. There are some core differences however.

  • The only consumer of the staging area are downstream ETL processes. The lake has multiple other consumers, e.g. sandboxes for self-service analytics/data discovery, external partners/clients, MDM applications etc.
  • The data lake also stores unstructured data such as images, audio, video, text.
  • The data lake sits on cheap storage that is decoupled from compute.

This latter point now brings me to the reason why Hadoop or more precisely HDFS is not a good fit for a data lake. It’s simply too expensive. Hadoop uses Directly Attached Storage (DAS). This means that compute and storage are tightly coupled. A system that tightly couples compute and storage is based on the flawed assumption that the relationship between storage and compute is a linear one. This is clearly incorrect. Let me give two examples. For data archival purposes I need a lot of storage but very little compute to infrequently access my data. In another scenario I might have only a small amount of data but thousands or millions of users who need to access it concurrently. I don’t need too much storage but an awful lot of compute. The advantage of DAS typically is or I should say was performance as only small amounts of data have to be shuffled across the network (more so for Hadoop in comparison to MPPs such as Teradata). Because of Gilder’s law bandwidth grows three times faster than compute and what once used to be a bottleneck is not so anymore. Data co-locality becomes less and less important as data can be shuffled ultra fast across the network.

In conclusion you are better off to run your data lake on object storage such as S3 or Google Cloud Storage. For on premise requirements you may want to look at on-premise object storage.

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.