SQL, FlowHigh,

Home » SQL » SQL antipatterns: SELECT DISTINCT

SQL antipatterns: SELECT DISTINCT

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on August 9, 2022
Updated on December 18, 2024

The DISTINCT operator

The DISTINCT operator is used to eliminate duplicates in a resultset, e.g. we can use it to identify the unique number of customers who made a purchase.

We run the following query against the Sample Data provided with Snowflake data cloud platform.

1	SELECT DISTINCT SS_CUSTOMER_SK from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES";

This is a table from the 10 TB TPC-DS sample data that ships with Snowflake. We ran the query with an M virtual warehouse.

The query returns 65M unique customers who made a purchase

It took 220s to run this query. It returned ~96GB and it spent 27% of its time on Processing. In other words, time spent on data processing by the CPU.

As you can see DISTINCT is an expensive CPU (and also memory) intensive operation.

Enjoying this article? Subscribe to our FlowForward newsletter for more insights like these!

Your subscription could not be saved. Please try again.

You're In! Welcome to FastForward Congratulations on successfully subscribing to the FastForward Data Engineering Newsletter! You're now part of a growing community of 15,000+ data engineers who are staying ahead in the ever-evolving world of data.

FlowForward.

All Things Data Engineering
Straight to Your Inbox!

SQL DISTINCT algorithms

Under the hood the database and the Cost Based Optimizer use various algorithms to identify a unique set of values in a list.

A simple algorithm runs two nested loops. For every element, it checks if it has appeared before. If it hasn’t then we add the value to the unique list. This is very inefficient.
We can use sorting to make the algorithm more efficient. First sort the input so that all occurrences of every element are ordered. We can then traverse the sorted input to identify each unique element.
Another efficient algorithm would be to use hashing. We can traverse the input and store the unique values in a hash table and then use a lookup.

Let’s go through some examples to illustrate this

In this scenario we select the unique number of various attributes from the STORE_SALES table

1	SELECT DISTINCT <column_name> from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES";

Column name	Size	Unique records	Cluster Key	Query time	CPU time
ss_store_sk	46 GB	751	N	65s	4%
ss_sold_time_sk	64 GB	46K	N	136s	8%
ss_item_sk	53 GB	403K	Y	85s	3%
ss_hdemo_sk	54 GB	7.2K	N	105s	5%
ss_customer_sk	96 GB	65M	N	220s	27%

We can see the following patterns:

The query and CPU time increases for queries where the column cardinality is high (more unique values), e.g. ss_sold_time_sk versus ss_hdemo_sk. For ss_customer_sk where we have 65M unique values the query spends 27% of time on CPU processing.
The query and CPU time increases for queries where the column is not part of a cluster key. A cluster key in Snowflake presorts the result, e.g. ss_item_sk runs in 85s versus 105s for ss_hdemo_sk. ss_item_sk has 56x more unique values than ss_hdemo_sk. So despite having a multiple of unique values it executes faster and uses less CPU time.

SQL DISTINCT versus SQL GROUP BY

We can rewrite a SELECT DISTINCT query as GROUP BY.

Instead of

1	SELECT DISTINCT ss_customer_sk from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES";

We can write

1	SELECT ss_customer_sk from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES" GROUP BY ss_customer_sk;

The two queries generate the exact same explain plan and take the same amount of time to execute.

I recommend using the DISTINCT version as the intent of the query is clearer and more obvious. GROUP BY tells the user to expect an aggregation. DISTINCT tells the user to expect a unique list of values.

SELECT DISTINCT anti patterns.

Apart from being an expensive operation DISTINCT is also widely used incorrectly by SQL developers and analysts.

From my own experience, using DISTINCT is an issue in 99% of query scenarios and typically associated with these types of problems:

Data quality issues in the underlying data
Data quality issues because of a badly designed data model
Mistake in the SQL statement, e.g. Join or Filtering
Incorrect usage of SQL features

DISTINCT and data quality issues

You may have a duplication or multiplication of values in your tables, e.g. the same customer record exists multiple times in your Customer table.

Bad data model

You may get duplicates because the data model has been designed badly, e.g. the Customer table was denormalised to also include one or more email addresses.

CUSTOMER_ID	CUSTOMER_NAME	EMAIL
1	Sonra	[email protected]
2	Sonra	[email protected]

This table has not been normalised properly. As a result you end up with a multiplication of records when you query the customer_name column.

Other mistakes result from the SQL developer misunderstanding the data model. In both transactional and analytics data models we find tables that store an audit trail of changes. In dimensional modelling speak these tables are called Slowly Changing Dimensions Type 2.

Duplicates due to mistakes in SQL

Bad Joins

One of the most common reasons for having duplicates in our results is down to incorrect table Joins. Most of the time this happens when we need to join on a composite key. A composite key is made up of more than one column. Leaving out one of the columns in the Join condition will likely result in a multiplication of records.

For a time zone table we would need to use a composite key to make records unique, e.g. the time zone code IST stands for Irish Summer Time, Israel Standard Time and Indian Standard Time. By adding the UTC offset we can make the records in our time zone table unique.

Code	Name	Offset
IST	Indian Standard Time	UTC+05:30
IST	Irish Standard Time	UTC+01
IST	Israel Standard Time	UTC+02

We can join the time zone table to an event table.

1	SELECT code, event_id from time_zone tz join events ev on (tz.code = ev.code and tz.offset = ev.offset)

If a SQL developer is not familiar with the data model and only uses the time zone code column in the Join disaster will strike 🙂

1	SELECT code, event_id from time_zone tz join events ev on (tz.code = ev.code and tz.offset = ev.offset)

This will result in a multiplication of records and incorrect results.

Bad filters

Another problem is when SQL developers don’t have a good understanding of the data and the correct keys.

The column customer_id in the table below is not the Primary Key. The Primary Key is a compound key made up of columns customer_id and start_year. Leaving out a filter on start_year, e.g. to find the most recent state of a particular customer record may lead to a multiplication of customer records.

CUSTOMER_ID	CUSTOMER_NAME	TOWN	START_YEAR	CURRENT_IND
1	Bethke	Heuchlingen	1973	N
1	Bethke	Aalen	1987	N
1	Bethke	Dublin	2001	Y

Avoid multiplication of records by applying the correct filter logic.

1	SELECT customer_name, customer_id FROM customer WHERE current_ind = ‘Y’

Bad understanding of SQL features

Another common source of multiplication is not correctly applying SQL features.

The window functions row_number, rank and dense_rank produce different results. Using rank or dense_rank instead of row_number may generate a different result with pseudo duplicates. Applying the correct window function would resolve the issue without the need for DISTINCT.

V	ROW_NUMBER	RANK	DENSE_RANK
a	1	1	1
a	2	1	1
a	3	1	1
b	4	4	2
c	5	5	3
c	6	5	3
d	7	7	4
e	8	8	5

How do you detect the anti pattern?

You will need to identify SQL statements that use the DISTINCT operator with multiple columns.

Most databases log the history of SQL statements that were run over time in an information schema. You can use our product FlowHigh to parse these SQL statements and identify those queries that use DISTINCT operations with multiple columns.

With FlowHigh you can parse, visualise, optimise, and format SQL. You can also detect SQL anti patterns that violate SQL best practices such as self joins, implicit cross joins etc. FlowHigh comes with a UI and SDK for programmatic access and automation.

SELECT DISTINCT best practices

All of the scenarios in the previous section have in common that they produce duplicates in the result set. Whenever you have a situation where you have duplicate values resist the urge to use DISTINCT.

Rather than always using DISTINCT whenever you come across duplicates first check that you understand the data model and the data itself.

Make sure to check your Joins.
Make sure to apply the correct filters.
Make sure that you understand the data model and possible data quality limitations

There are legitimate scenarios for the use of DISTINCT to check the unique values in a particular column. These are mainly SQL queries against a transaction or event table without any joins.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL antipatterns: SELECT DISTINCT