Mapping AWS, Google Cloud, Azure Services to Big Data Warehouse Architecture

June 28, 2017

The Reference Big Data Warehouse Architecture

Below is a representation of the big data warehouse architecture. I won’t go into the details of the features and components. If you want to find out more about the gory details I recommend my excellent training course Big Data for Data Warehouse and BI Professionals.

Big Data Warehouse Reference Architecture


Comparing Big Data Warehouse Services on Azure, Google Cloud, and Amazon AWS

So how do the components of the data warehouse map to the various services and products that are offered by the three most popular cloud platforms: Microsoft Azure, Google Cloud Platform, and Amazon AWS? A new product or service is almost launched each week. It can get quite daunting to keep track of what is going on. The table below makes it easy to map the various cloud services against the big data architecture. I have also included a column that lists some open source components to make it easier to compare. Please note that this list is by no means exhaustive, as there are literally hundreds of open source tools that do similar things. I have just listed those that I have had some exposure to.

Update: Due to popular requests I have added Oracle’s cloud prducts to the mix. As a reference point and due to popular demand I have also added Hortonworks and MapR to the matrix. For Oracle I am only covering what is on offer in the Oracle cloud. Oracle has various on-premise solutions such as Oracle Stream Analytics, Oracle Enterprise Metadata Management (Catalog and Lineage), Oracle EDQ for data quality etc. that are not (yet?) offered in the cloud. Oracle has some unique products that none of the other vendors can offer. I am talking about Oracle Golden Gate and to some extent also Oracle Data Integrator. There is also Big Data Discovery.
Download the full matrix that maps Oracle, Hortonworks, MapR, AWS, Azure, Google Cloud, Open Source to the Big Data Architecture (e-mail required).
 Open SourceAmazon AWSMicrosoft AzureGoogle Cloud
Batch IngestSqoop
File Transfer
AWS Data Transfer Services (various options)Import/Export Service
Data Factory
Cloud DataFlow
Streaming IngestFlume
Amazon Kinesis FirehoseEvent Hubs
Cloud DataFlow
Persistent StorageHDFS
S3, Glacier
Storage Blob
SQL Database
Persistent Disk
Google Cloud Storage
Cloud SQL
Transient StorageKafkaKinesisEvent Hubs
HDInsight (Kafka)
Cloud Pub/Sub
Cloud IoT Core
Batch ProcessingHive
Flink, Spark
EMR Spark
EMR Hadoop
EMR Presto
AWS Batch
Azure Batch
HDInisght (Spark/Map Reduce)
SQL Data Warehouse
Data Lake Analytics
Azure Functions
Cloud Dataflow (open source Apache Beam)
Cloud DataProc (Spark, Hadoop)
Stream ProcessingFlink
Amazon Kinesis Streams
Amazon Kinesis Analytics
EMR Spark
Stream Analytics
HDInsight (Storm, Spark)
Cloud Dataflow (open source Apache Beam)
DataProc (Spark, Hadoop)
Machine LearningScikit
Spark MLLib
TensorFlow etc.
Huge number of libraries
Amazon Machine Learning
Azure ML
Cognitive Services
Natural Language
ML Engine
Serving Storage GraphJanusGraphNeptuneCosmosDBN/A
Serving Storage BI/EDWImpala + KuduRedshift
SQL Data Warehouse
Analysis Services (OLAP Cubes)
Serving Storage Search (keywords + facets)SolrAmazon CloudSearch
Amazon Elasticsearch
Azure SearchN/A Marketplace, e.g. Solr
Serving Storage RDBMSPostgreSQLRDSSQL DBCloud SQL
Serving Storage NoSQLHBaseDynamoDBHDInsight (HBase)
Sandboxes NotebookZeppelinEMR ZeppelinAzure NotebooksCloud Datalab
Sandboxes Data Science or Preparation PlatformDataiku DSS Community Edition (not open source)N/A Marketplace only, e.g. Dataiku DSSN/A Marketplace only, e.g. Dataiku DSSCloud DataPrep (beta). Under the hood this is Trifacta.
Clients/Data AppsSuperset (BI)QuicksightPowerBIGoogle Data Studio
OrchestrationAirflowAWS Data PipelineData FactoryN/A Marketplace
ETL ToolN/AAWS GlueData FactoryN/A Marketplace
MDM HubN/AN/A MarketplaceN/A MarketplaceN/A Marketplace
LineageN/AAWS GlueN/AN/A
CatalogN/AAWS GlueData CatalogN/A Marketplace