Mapping AWS, Google Cloud, Azure Services to Big Data Warehouse Architecture
The Reference Big Data Warehouse Architecture
Below is a representation of the big data warehouse architecture. I won’t go into the details of the features and components. If you want to find out more about the gory details I recommend my excellent training course Big Data for Data Warehouse and BI Professionals.
Table of Contents

Comparing Big Data Warehouse Services on Azure, Google Cloud, and Amazon AWS
So how do the components of the data warehouse map to the various services and products that are offered by the three most popular cloud platforms: Microsoft Azure, Google Cloud Platform, and Amazon AWS? A new product or service is almost launched each week. It can get quite daunting to keep track of what is going on. The table below makes it easy to map the various cloud services against the big data warehouse architecture. I have also included a column that lists some open source components to make it easier to compare. Please note that this list is by no means exhaustive, as there are literally hundreds of open source tools that do similar things. I have just listed those that I have had some exposure to.
Download the full matrix that maps Oracle, Hortonworks, MapR, AWS, Azure, Google Cloud, Open Source to the Big Data Architecture (e-mail required).
Open Source | Amazon AWS | Microsoft Azure | Google Cloud | |
---|---|---|---|---|
Batch Ingest | Sqoop File Transfer Flume StreamSets | AWS Data Transfer Services (various options) | Import/Export Service Data Factory | Cloud DataFlow |
Streaming Ingest | Flume StreamSets | Amazon Kinesis Firehose | Event Hubs IOT Hub | Cloud DataFlow |
Persistent Storage | HDFS RDBMS | S3, Glacier RDS | Storage Blob HDFS SQL Database | Persistent Disk Google Cloud Storage Cloud SQL |
Transient Storage | Kafka | Kinesis | Event Hubs IOT Hub HDInsight (Kafka) | Cloud Pub/Sub Cloud IoT Core |
Batch Processing | Hive Flink, Spark MapReduce PostgreSQL | EMR Spark EMR Hadoop EMR Presto AWS Batch Redshift | Azure Batch HDInisght (Spark/Map Reduce) SQL Data Warehouse Data Lake Analytics Azure Functions | Cloud Dataflow (open source Apache Beam) Cloud DataProc (Spark, Hadoop) |
Stream Processing | Flink Spark Beam | Amazon Kinesis Streams Amazon Kinesis Analytics EMR Spark | Stream Analytics HDInsight (Storm, Spark) | Cloud Dataflow (open source Apache Beam) DataProc (Spark, Hadoop) |
Machine Learning | Scikit Tensorflow Spark MLLib TensorFlow etc. Huge number of libraries | Lex Polly Recognition Amazon Machine Learning | Azure ML Cognitive Services | Natural Language SpeechTranslation Vision Video ML Engine |
Serving Storage Graph | JanusGraph | Neptune | CosmosDB | N/A |
Serving Storage BI/EDW | Impala + Kudu | Redshift Athena | SQL Data Warehouse Analysis Services (OLAP Cubes) | BigQuery |
Serving Storage Search (keywords + facets) | Solr | Amazon CloudSearch Amazon Elasticsearch | Azure Search | N/A Marketplace, e.g. Solr |
Serving Storage RDBMS | PostgreSQL | RDS | SQL DB | Cloud SQL |
Serving Storage NoSQL | HBase | DynamoDB | HDInsight (HBase) CosmosDB | BigTable Spanner DataStore |
Sandboxes Notebook | Zeppelin | EMR Zeppelin | Azure Notebooks | Cloud Datalab |
Sandboxes Data Science or Preparation Platform | Dataiku DSS Community Edition (not open source) | N/A Marketplace only, e.g. Dataiku DSS | N/A Marketplace only, e.g. Dataiku DSS | Cloud DataPrep (beta). Under the hood this is Trifacta. |
Clients/Data Apps | Superset (BI) | Quicksight | PowerBI | Google Data Studio |
Orchestration | Airflow | AWS Data Pipeline | Data Factory | N/A Marketplace |
ETL Tool | N/A | AWS Glue | Data Factory | N/A Marketplace |
MDM Hub | N/A | N/A Marketplace | N/A Marketplace | N/A Marketplace |
Lineage | N/A | AWS Glue | N/A | N/A |
Catalog | N/A | AWS Glue | Data Catalog | N/A Marketplace |