SQL,

Home » SQL » SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.

SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on May 10, 2018
Updated on November 20, 2024

I often hear people referring to SQL engines running against HDFS or object storage as MPP. Strictly speaking this is incorrect. Let me first explain what an MPP database is and then explain why engines such as Presto etc. should not be called an MPP engine.

MPP

In an MPP engine we evenly distribute data across all of the nodes in a cluster. If we have a transaction table such as ORDER with 100 M rows and 10 nodes then (ignoring data skew) each node will get 10 M rows. When we run a query on our cluster then (typically) all of the nodes take part in processing this data. If we now have a second transaction table ORDER_ITEM we can co-locate order items for the same order on the same node using the ORDER_ID as distribution key. The big advantage of data co-location is that joins between ORDER and ORDER_ITEM are fast because no data needs to travel across the network. Here is an image that should illustrate the point.

The other important concept that we need to understand in relation to an MPP is that all nodes take part in the processing when we run a query against the ORDER or ORDER_ITEM table or join them together.

Hadoop/HDFS

So what happens on Hadoop or more correctly when we run SQL against HDFS? On HDFS we don’t distribute data at the granular row level across the cluster. Rather we take big chunks of data and spread them around the cluster based on some internal algorithm that we typically don’t have any control over. We can’t co-locate data.

Neither is there the concept of a distribution key on HDFS. As data is not evenly distributed across all nodes in the cluster not all of the nodes participate in processing.

SQL on Hadoop

When we use an SQL engine such as Presto or Impala against HDFS not all of the nodes take part in processing the data. For joins of very large tables there is a high probability that data needs to travel across the network to be joined together. One exception here is running Impala against Kudu storage. Kudu storage follows the same paradigm as traditional MPP databases with a distribution key and evenly spread data across the nodes in a cluster.

BigQuery

What about BigQuery? BigQuery is built on conceptually similar technology than SQL engines on Hadoop. It is also based on a distributed file system. In this case ColossusFS a proprietary distributed file system developed by Google. You can think of BigQuery as Hadoop SQL on steroids.

Exadata

What about Exadata? Exadata separates storage cells from compute cells. The data is stored on the storage cells and only transferred to the compute cells when data is queried. Similar to HDFS, data is not co-located. Exadata relies on storage indexes, predicate pushdown, columnar projection, and extensive caching on compute cells to achieve high performance.

Final thoughts

I accept that there are other definitions for MPP, e.g. any system that executes SQL on a distributed system in parallel. I disagree with this definition as it blurs the significant differences between these systems and as a result is less concise. If we want to trace back the in my opinion incorrect usage of MPP we should start with the time when Hadoop vendors started to market Hadoop SQL engines not using MapReduce as an execution engine. They probably thought that MPP is associated with fast performance and as a result stuck the label (incorrectly) to their very own SQL engines.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.

MPP

Hadoop/HDFS

SQL on Hadoop

BigQuery

Exadata

Final thoughts

Uli Bethke

SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.

MPP

Hadoop/HDFS

SQL on Hadoop

BigQuery

Exadata

Final thoughts

Related Articles

Cookies consent