In this series of blog posts we will explore how the age of Big Data is changing the data warehouse landscape for good. There are a multitude of issues that today’s data warehouse implementations face. What are the steps that you as a CIO or Data Warehouse executive can take to implement the next generation of data warehouses.
Note: There is an updated version of this post. You can download it for free.
Apples and Oranges
First things first though. I have recently come across a lot of articles that claim that Hadoop will replace the Data Warehouse. This is like comparing apples to oranges. The data warehouse is a concept. Hadoop is a collection of technologies and just one of the distributed compute frameworks (Flink or Spark just to name a few others). Recently the debate around Hadoop vs the Data Warehouse culminated in a little skirmish between Cloudera and Bill Inmon, the father of data warehousing. The discussion between the two camps was full of misunderstandings and there was a lot of confusion around the definition of core concepts.
[big_data_promotion] What is a data warehouse?
Data warehouses are central repositories of integrated and cleansed data from one or more disparate sources. They store both current and historical data. The data warehouse was built to ingest structured data from transactional systems. These systems are used in the day to day running of the business and include HR, ERP, Sales & Marketing etc. With the evolution of SaaS applications and Services Oriented Architecture (SOA) over recent years, semi-structured data in JSON or XML has also entered the EDW.
The ingestion of data happens in a nightly batch window or in more frequent micro batches. Data from the sources is landed in the staging area. The staging area is a one to one copy of source data. It is part of the data warehouse infrastructure.
From the staging area the data is loaded into an ETL platform that performs the required transformations, data integration, and cleansing tasks. Sometimes dedicated tools for data quality, master data management, and predictive analytics are used as well. The transformed data is then loaded back to the data warehouse into a physical representation of the Enterprise Data Model (EDM). The EDM in essence is a representation of all of the business processes that make up an enterprise. From the data warehouse, data is loaded into data marts or OLAP cubes. These are data models that are optimised for analytics and reporting. Mobile and desktop BI applications connect into these so called dimensional models and expose the data via dashboards, reports, and ad hoc query tools.
Data Warehousing on Relational Databases. The end of an era?
The data warehouse itself typically sits on a relational database (RDBMS). For lower volumes of data a standard RDBMS such as PostgreSQL, MS SQL Server, or Oracle is used. For larger volumes of data a specifically engineered system consisting of optimised hardware and software bundled together in an appliance such as Teradata or Exadata forms the foundation of the data warehouse. More recently massively parallel databases (MPP) such as Vertica or Greenplum have appeared that run on commodity hardware. Because of this close historic link between the data warehouse and the RDBMS, the two are used interchangeably and this is the root cause for the many misunderstandings that we witness in the Hadoop vs the Data Warehouse debate . Over the last couple of years more and more cracks have appeared in this symbiotic relationship. There are now clear signs that a general purpose relational database is not necessarily the best or the only technology that can host a data warehouse. In the next few posts we will cover the decline of the general purpose relational database as the only data warehouse technology in more detail:
- Datafication of everything and the exponential growth of data volumes challenge the cost effectiveness and scalability of the general purpose relational database for data warehousing. Think of license cost models that are based on number of CPUs and hardware costs for data warehouse appliances.
- The general purpose relational database is not fit for purpose to ingest certain data types. Think of unstructured data and complex hierarchical data in graphs or polytrees. It also struggles with complex many to many relationships.
- Dedicated data warehouse appliances perpetuate the legacy enterprise architecture of isolated workloads, which wastes compute capacity and violates the principle of data locality (bring processing to data rather than the other way round). Using data centre operating systems such as Mesosphere allow you to run your data warehouse like Google.
- The data warehouse was meant to be the central data repository for the business. Data volume growth and new data types put this claim into question. The problem that the data warehouse was meant to fix is perpetuated on a general purpose relational database.
- General purpose relational databases require a predefined schema aka schema on write. This makes it rather inflexible to quickly ingest data that has no pre-defined schema. As a result a lot of effort needs to be put into upfront modelling of data. It can take weeks or months to turn around a question by the business if the data has not yet been loaded into the data warehouse.
We are a Big Data company based in Ireland. We are experts in data lake implementations, clickstream analytics, real time analytics, and data warehousing on Hadoop. We can help with your Big Data implementation. Get in touch.