In two recent blog posts I have outlined Sonra’s vision for a data lake architecture. In the first one I discussed various flaws in the data lake concept. In the second, I argued that Hadoop or more precisely HDFS is a poor choice for a data lake. Let’s briefly recap the discussion.
What is a data lake?
A data lake is similar to the staging area of a data warehouse with a couple of core differences. Let’s look at the commonalities first. Just like the staging area of a data warehouse the data lake stores a 1:1 copy of raw data from the source systems. There are some core differences however.
- The only consumer of the staging area are downstream ETL processes. The lake has multiple other consumers, e.g. sandboxes for self-service analytics/data discovery, external partners/clients, MDM applications etc.
- The data lake also stores unstructured data such as images, audio, video, text.
- The data lake sits on cheap storage that is decoupled from compute.
This architecture for a data lake is very different from others that tie the data lake to a particular technology. It is also different in the way the data is consumed. In traditional data lake architecture, compute and storage are tightly coupled and the data lake is split into different logical zones, e.g. raw, trusted etc. All zones implemented in the same physical location and on the same cluster. This has various disadvantages, e.g. the cost of directly attached storage for raw data, the difficulties of governing such a monster, the issue of tying storage to a particular technology, the challenge of decreased flexibility, the difficulty of sizing this colossal data lake etc.
Sonra’s data lake vision and checklist
At Sonra our vision is that data preparation and advanced analytics happen on sandboxes that are spun out from the raw data lake and the data warehouse. Each sandbox has a well-defined purpose and use case. The whole process will become a lot more manageable and flexible. After all small is beautiful.
These sandboxes can then be productionised, e.g. by federating insights across the sandbox or data warehouse, creating an API on top of a predictive model, or feeding the output (the learnings) into the data warehouse life cycle. In case a sandbox project does not deliver any insights the sandbox can be easily terminated.
We have put together a list of items that you need to consider when implementing a data lake.
Just leave your e-mail address and we will send you a link for a download of the data lake implementation checklist.