The Reference Big Data Warehouse Architecture
Below is a representation of the big data warehouse architecture. I won’t go into the details of the features and components. If you want to find out more about the gory details I recommend my excellent training course Big Data for Data Warehouse and BI Professionals.
Teach me Big Data to Advance my Career
Comparing Big Data Warehouse Services on Azure, Google Cloud, and Amazon AWS
So how do the components of the data warehouse map to the various services and products that are offered by the three most popular cloud platforms: Microsoft Azure, Google Cloud Platform, and Amazon AWS? A new product or service is almost launched each week. It can get quite daunting to keep track of what is going on. The table below makes it easy to map the various cloud services against the big data warehouse architecture. I have also included a column that lists some open source components to make it easier to compare. Please note that this list is by no means exhaustive, as there are literally hundreds of open source tools that do similar things. I have just listed those that I have had some exposure to.
Update: Due to popular requests I have added Oracle's cloud prducts to the mix. As a reference point and due to popular demand I have also added Hortonworks and MapR to the matrix. For Oracle I am only covering what is on offer in the Oracle cloud. Oracle has various on-premise solutions such as Oracle Stream Analytics, Oracle Enterprise Metadata Management (Catalog and Lineage), Oracle EDQ for data quality etc. that are not (yet?) offered in the cloud. Oracle has some unique products that none of the other vendors can offer. I am talking about Oracle Golden Gate and to some extent also Oracle Data Integrator. There is also Big Data Discovery.
Open Source | Amazon AWS | Microsoft Azure | Google Cloud | |
---|---|---|---|---|
Batch Ingest | Sqoop File Transfer Flume StreamSets |
AWS Data Transfer Services (various options) | Import/Export Service Data Factory |
Cloud DataFlow |
Streaming Ingest | Flume StreamSets |
Amazon Kinesis Firehose | Event Hubs IOT Hub |
Cloud DataFlow |
Persistent Storage | HDFS RDBMS |
S3, Glacier RDS |
Storage Blob HDFS SQL Database |
Persistent Disk Google Cloud Storage Cloud SQL |
Transient Storage | Kafka | Kinesis | Event Hubs IOT Hub HDInsight (Kafka) |
Cloud Pub/Sub Cloud IoT Core |
Batch Processing | Hive Flink, Spark MapReduce PostgreSQL |
EMR Spark EMR Hadoop EMR Presto AWS Batch Redshift |
Azure Batch HDInisght (Spark/Map Reduce) SQL Data Warehouse Data Lake Analytics Azure Functions |
Cloud Dataflow (open source Apache Beam) Cloud DataProc (Spark, Hadoop) |
Stream Processing | Flink Spark Beam |
Amazon Kinesis Streams Amazon Kinesis Analytics EMR Spark |
Stream Analytics HDInsight (Storm, Spark) |
Cloud Dataflow (open source Apache Beam) DataProc (Spark, Hadoop) |
Machine Learning | Scikit Tensorflow Spark MLLib TensorFlow etc. Huge number of libraries |
Lex Polly Recognition Amazon Machine Learning |
Azure ML Cognitive Services |
Natural Language SpeechTranslation Vision Video ML Engine |
Serving Storage Graph | JanusGraph | Neptune | CosmosDB | N/A |
Serving Storage BI/EDW | Impala + Kudu | Redshift Athena |
SQL Data Warehouse Analysis Services (OLAP Cubes) |
BigQuery |
Serving Storage Search (keywords + facets) | Solr | Amazon CloudSearch Amazon Elasticsearch |
Azure Search | N/A Marketplace, e.g. Solr |
Serving Storage RDBMS | PostgreSQL | RDS | SQL DB | Cloud SQL |
Serving Storage NoSQL | HBase | DynamoDB | HDInsight (HBase) CosmosDB |
BigTable Spanner DataStore |
Sandboxes Notebook | Zeppelin | EMR Zeppelin | Azure Notebooks | Cloud Datalab |
Sandboxes Data Science or Preparation Platform | Dataiku DSS Community Edition (not open source) | N/A Marketplace only, e.g. Dataiku DSS | N/A Marketplace only, e.g. Dataiku DSS | Cloud DataPrep (beta). Under the hood this is Trifacta. |
Clients/Data Apps | Superset (BI) | Quicksight | PowerBI | Google Data Studio |
Orchestration | Airflow | AWS Data Pipeline | Data Factory | N/A Marketplace |
ETL Tool | N/A | AWS Glue | Data Factory | N/A Marketplace |
MDM Hub | N/A | N/A Marketplace | N/A Marketplace | N/A Marketplace |
Lineage | N/A | AWS Glue | N/A | N/A |
Catalog | N/A | AWS Glue | Data Catalog | N/A Marketplace |