Why is concurrency overrated to measure performance of data warehouse platforms?

January 11, 2018

The difference between making a good and a bad decisions often comes down to the quality of the pre-defined metrics. If the metric is poor so will be the decision.
When comparing performance between different technologies such as Google Big Query (based on a distributed file system – Colossus to be precise) and MPP technologies such as Redshift, people tend to list concurrency as a limitation of Redshift.
Unfortunately, concurrency is one of those metrics that really suck. In this blog post I will show you why.

Max concurrency

First things first. What do we actually mean when we talk about concurrency? Concurrency is the number of SQL queries that can be run in parallel on a system. When we max out on resources (CPU, memory, and other storage) on a database or some other processing frameworks such as Hadoop, Spark etc. etc. we have reached maximum concurrency.

Resource manager

What happens when we max out on resources depends on the resource manager. In a database, the resource manager allocates CPU, memory, and in some systems also disk to requests that users run against the database, e.g. SQL queries.
If the resource manager has been configured to process queries on a first come first serve basis (FIFO) then queries are serialized. In other words, they have to wait until the next slot becomes available. It’s similar to when you are boarding a plane or checking out at the supermarket. If all tills are serving a customer you have to wait in the queue.

not all databases provide a FIFO resource manager.

If the resource manager is set up to use a fair scheduler, then all of the various queries that have been launched get allocated a fair share of resources. Sounds great. What’s not to be liked about fair?
Well, in this particular case it will lead to an overall degradation of system performance. To serve all queries at the same time, the CPU cores need to switch back and forth between queries. This generates a huge overhead. It also leads to eviction of data from the CPU level caches and in some cases DRAM and that is when things get really slow as we have to endure multiple passes over the same data.
While we have achieved higher concurrency (more queries are served at the same time), the latency of our individual queries increases (in other words they run slower) and overall system throughput has decreased (in other words we can satisfy less queries per time unit). In the worst case we have gone from a state where some users are unhappy because of long wait times to a state where everyone is unhappy.
[blogBannerBigData]

Throughput, latency, concurrency

So how can we increase concurrency on a database? It’s quite simple really. We just need to decrease the latency of our queries (make them run faster). How do we do that? There are various ways.
Write better queries, e.g. by using window functions and looking at explain plans
Model your data for performance, e.g. by de-normalizing data, using columnar storage, storage indexes, using good distribution and sort keys, understanding usage patterns etc.
Throw money at the problem by upgrading from an SMP to an MPP database or by adding nodes to your existing MPP database (Redshift, Teradata etc.). The latency of most queries is cut in half by doubling the number of nodes in a cluster.

How are latency and concurrency related?

This blog post here has a great example. Let me go through it here:

1800 similar SQL queries are running per hour on the database.
On average that gives us a new query every two seconds.
Each query takes 60 seconds to run on average.
On average we are running 30 concurrent queries.
The database we are using supports 15 queries to run concurrently.
We need to queue 15 queries.

Let’s see what happens to our level of concurrency once we tune our SQL queries.

We still have 1800 similar SQL queries running per hour on the database.
Now our queries execute in 10 seconds.
On average we now just have five concurrent queries.
No queries get serialised and we even have ten more available slots.

Conclusion

Concurrency is a derived metric from latency. You increase concurrency by decreasing latency.
For Redshift that means writing better queries, adding more nodes, replacing your nodes with dense compute nodes, and using workload management to create queues to separate interactive from batch queries.
While systems that are built on distributed file systems such as Hadoop and Google Big Query can achieve higher levels of concurrency as they don’t evenly split the data across the nodes in the cluster, they tend to show higher latency for a certain class of queries, e.g. joins of very large tables.
When comparing performance always look at latency and throughput to compare performance.
[big_data_promotion]

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Why is concurrency overrated to measure performance of data warehouse platforms?

Max concurrency

Resource manager

Throughput, latency, concurrency

How are latency and concurrency related?

Conclusion

Why is concurrency overrated to measure performance of data warehouse platforms?

Max concurrency

Resource manager

Throughput, latency, concurrency

How are latency and concurrency related?

Conclusion

Related Articles

Snowflake vs. Redshift – Support for Handling JSON

Converting Trello JSON to Redshift

Comparing Window Function Features by Database Vendors

Cookies consent