Mapping AWS, Google Cloud, Azure Services to Big Data Warehouse Architecture

Uli Bethke Cloud, Data Warehouse

The Reference Big Data Warehouse Architecture

Below is a representation of the big data warehouse architecture. I won’t go into the details of the features and components. If you want to find out more about the gory details I recommend my excellent training course Big Data for Data Warehouse and BI Professionals.

Teach me Big Data to Advance my Career

Big Data Warehouse Reference Architecture

Comparing Big Data Warehouse Services on Azure, Google Cloud, and Amazon AWS

So how do the components of the data warehouse map to the various services and products that are offered by the three most popular cloud platforms: Microsoft Azure, Google Cloud Platform, and Amazon AWS? A new product or service is almost launched each week. It can get quite daunting to keep track of what is going on. The table below makes it easy to map the various cloud services against the big data warehouse architecture. I have also included a column that lists some open source components to make it easier to compare. Please note that this list is by no means exhaustive, as there are literally hundreds of open source tools that do similar things. I have just listed those that I have had some exposure to.

Update: Due to popular requests I have added Oracle's cloud prducts to the mix. As a reference point and due to popular demand I have also added Hortonworks and MapR to the matrix. For Oracle I am only covering what is on offer in the Oracle cloud. Oracle has various on-premise solutions such as Oracle Stream Analytics, Oracle Enterprise Metadata Management (Catalog and Lineage), Oracle EDQ for data quality etc. that are not (yet?) offered in the cloud. Oracle has some unique products that none of the other vendors can offer. I am talking about Oracle Golden Gate and to some extent also Oracle Data Integrator. There is also Big Data Discovery.

Download the full matrix that maps Oracle, Hortonworks, MapR, AWS, Azure, Google Cloud, Open Source to the Big Data Architecture (e-mail required).

A link to PDF was sent to your e-mail
Please specify a valid email
Open Source Amazon AWS Microsoft Azure Google Cloud
Batch Ingest Sqoop
File Transfer
Flume
StreamSets
AWS Data Transfer Services (various options) Import/Export Service
Data Factory
Cloud DataFlow
Streaming Ingest Flume
StreamSets
Amazon Kinesis Firehose Event Hubs
IOT Hub
Cloud DataFlow
Persistent Storage HDFS
RDBMS
S3, Glacier
RDS
Storage Blob
HDFS
SQL Database
Persistent Disk
Google Cloud Storage
Cloud SQL
Transient Storage Kafka Kinesis Event Hubs
IOT Hub
HDInsight (Kafka)
Cloud Pub/Sub
Cloud IoT Core
Batch Processing Hive
Flink, Spark
MapReduce
PostgreSQL
EMR Spark
EMR Hadoop
EMR Presto
AWS Batch
Redshift
Azure Batch
HDInisght (Spark/Map Reduce)
SQL Data Warehouse
Data Lake Analytics
Azure Functions
Cloud Dataflow (open source Apache Beam)
Cloud DataProc (Spark, Hadoop)
Stream Processing Flink
Spark
Beam
Amazon Kinesis Streams
Amazon Kinesis Analytics
EMR Spark
Stream Analytics
HDInsight (Storm, Spark)
Cloud Dataflow (open source Apache Beam)
DataProc (Spark, Hadoop)
Machine Learning Scikit
Tensorflow
Spark MLLib
TensorFlow etc.
Huge number of libraries
Lex
Polly
Recognition
Amazon Machine Learning
Azure ML
Cognitive Services
Natural Language
SpeechTranslation
Vision
Video
ML Engine
Serving Storage Graph JanusGraph N/A Marketplace Only, e.g. OrientDB N/A Marketplace only, e.g OrientDB N/A
Serving Storage BI/EDW Impala + Kudu Redshift
Athena
SQL Data Warehouse
Analysis Services (OLAP Cubes)
BigQuery
Serving Storage Search (keywords + facets) Solr Amazon CloudSearch
Amazon Elasticsearch
Azure Search N/A Marketplace, e.g. Solr
Serving Storage RDBMS PostgreSQL RDS SQL DB Cloud SQL
Serving Storage NoSQL HBase DynamoDB HDInsight (HBase)
CosmosDB
BigTable
Spanner
DataStore
Sandboxes Notebook Zeppelin EMR Zeppelin Azure Notebooks Cloud Datalab
Sandboxes Data Science or Preparation Platform Dataiku DSS Community Edition (not open source) N/A Marketplace only, e.g. Dataiku DSS N/A Marketplace only, e.g. Dataiku DSS Cloud DataPrep (beta). Under the hood this is Trifacta.
Clients/Data Apps Superset (BI) Quicksight PowerBI Google Data Studio
Orchestration Airflow AWS Data Pipeline Data Factory N/A Marketplace
ETL Tool N/A AWS Glue (beta) Data Factory N/A Marketplace
MDM Hub N/A N/A Marketplace N/A Marketplace N/A Marketplace
Lineage N/A AWS Glue (beta) N/A N/A
Catalog N/A AWS Glue (beta) Data Catalog N/A Marketplace