SQL antipatterns:  SELECT DISTINCT

August 9, 2022

The DISTINCT operator

The DISTINCT operator is used to eliminate duplicates in a resultset, e.g. we can use it to identify the unique number of customers who made a purchase.

We run the following query against the Sample Data provided with Snowflake data cloud platform.

This is a table from the 10 TB TPC-DS sample data that ships with Snowflake. We ran the query with an M virtual warehouse.

The query returns 65M unique customers who made a purchase

It took 220s to run this query. It returned ~96GB and it spent 27% of its time on Processing. In other words, time spent on data processing by the CPU.

As you can see DISTINCT is an expensive CPU (and also memory) intensive operation.

SQL DISTINCT algorithms

Under the hood the database and the Cost Based Optimizer use various algorithms to identify a unique set of values in a list.

  • A simple algorithm runs two nested loops. For every element, it checks if it has appeared before. If it hasn’t then we add the value to the unique list. This is very inefficient.
  • We can use sorting to make the algorithm more efficient. First sort the input so that all occurrences of every element are ordered. We can then traverse the sorted input to identify each unique element.
  • Another efficient algorithm would be to use hashing. We can traverse the input and store the unique values in a hash table and then use a lookup.

Let’s go through some examples to illustrate this

In this scenario we select the unique number of various attributes from the STORE_SALES table

Column nameSizeUnique recordsCluster KeyQuery timeCPU time
ss_store_sk46 GB751N65s4%
ss_sold_time_sk64 GB46KN136s8%
ss_item_sk53 GB403KY85s3%
ss_hdemo_sk54 GB7.2KN105s5%
ss_customer_sk96 GB65MN220s27%

We can see the following patterns:

  • The query and CPU time increases for queries where the column cardinality is high (more unique values), e.g. ss_sold_time_sk versus ss_hdemo_sk. For ss_customer_sk where we have 65M unique values the query spends 27% of time on CPU processing.
  • The query and CPU time increases for queries where the column is not part of a cluster key. A cluster key in Snowflake presorts the result, e.g. ss_item_sk runs in 85s versus 105s for ss_hdemo_sk. ss_item_sk has 56x more unique values than ss_hdemo_sk. So despite having a multiple of unique values it executes faster and uses less CPU time.

SQL DISTINCT versus SQL GROUP BY

We can rewrite a SELECT DISTINCT query as GROUP BY.

Instead of

We can write

The two queries generate the exact same explain plan and take the same amount of time to execute.

I recommend using the DISTINCT version as the intent of the query is clearer and more obvious. GROUP BY tells the user to expect an aggregation. DISTINCT tells the user to expect a unique list of values.

SELECT DISTINCT anti patterns.

Apart from being an expensive operation DISTINCT is also widely used incorrectly by SQL developers and analysts.

From my own experience, using DISTINCT is an issue in 99% of query scenarios and typically associated with these types of problems:

  • Data quality issues in the underlying data
  • Data quality issues because of a badly designed data model
  • Mistake in the SQL statement, e.g. Join or Filtering
  • Incorrect usage of SQL features

DISTINCT and data quality issues

You may have a duplication or multiplication of values in your tables, e.g. the same customer record exists multiple times in your Customer table.

Bad data model

You may get duplicates because the data model has been designed badly, e.g. the Customer table was denormalised to also include one or more email addresses.

CUSTOMER_IDCUSTOMER_NAMEEMAIL
1Sonra[email protected]
2Sonra[email protected]

This table has not been normalised properly. As a result you end up with a multiplication of records when you query the customer_name column.

Other mistakes result from the SQL developer misunderstanding the data model. In both transactional and analytics data models we find tables that store an audit trail of changes. In dimensional modelling speak these tables are called Slowly Changing Dimensions Type 2.

Duplicates due to mistakes in SQL

Bad Joins

One of the most common reasons for having duplicates in our results is down to incorrect table Joins. Most of the time this happens when we need to join on a composite key. A composite key is made up of more than one column. Leaving out one of the columns in the Join condition will likely result in a multiplication of records.

For a time zone table we would need to use a composite key to make records unique, e.g. the time zone code IST stands for Irish Summer Time, Israel Standard Time and Indian Standard Time. By adding the UTC offset we can make the records in our time zone table unique.

CodeNameOffset
ISTIndian Standard TimeUTC+05:30
ISTIrish Standard TimeUTC+01
ISTIsrael Standard TimeUTC+02

We can join the time zone table to an event table.

If a SQL developer is not familiar with the data model and only uses the time zone code column in the Join disaster will strike 🙂

This will result in a multiplication of records and incorrect results.

Bad filters

Another problem is when SQL developers don’t have a good understanding of the data and the correct keys.

The column customer_id in the table below is not the Primary Key. The Primary Key is a compound key made up of columns customer_id and start_year. Leaving out a filter on start_year, e.g. to find the most recent state of a particular customer record may lead to a multiplication of customer records.

CUSTOMER_IDCUSTOMER_NAMETOWNSTART_YEARCURRENT_IND
1BethkeHeuchlingen1973N
1BethkeAalen1987N
1BethkeDublin2001Y

Avoid multiplication of records by applying the correct filter logic.

Bad understanding of SQL features

Another common source of multiplication is not correctly applying SQL features.

The window functions row_number, rank and dense_rank produce different results. Using rank or dense_rank instead of row_number may generate a different result with pseudo duplicates. Applying the correct window function would resolve the issue without the need for DISTINCT.

VROW_NUMBERRANKDENSE_RANK
a111
a211
a311
b442
c553
c653
d774
e885

How do you detect the anti pattern?

You will need to identify SQL statements that use the DISTINCT operator with multiple columns.

Most databases log the history of SQL statements that were run over time in an information schema. You can use our product FlowHigh to parse these SQL statements and identify those queries that use DISTINCT operations with multiple columns.

With FlowHigh you can parse, visualise, optimise, and format SQL. You can also detect SQL anti patterns that violate SQL best practices such as self joins, implicit cross joins etc. FlowHigh comes with a UI and SDK for programmatic access and automation.

SELECT DISTINCT best practices

All of the scenarios in the previous section have in common that they produce duplicates in the result set. Whenever you have a situation where you have duplicate values resist the urge to use DISTINCT.

Rather than always using DISTINCT whenever you come across duplicates first check that you understand the data model and the data itself.

  • Make sure to check your Joins.
  • Make sure to apply the correct filters.
  • Make sure that you understand the data model and possible data quality limitations

There are legitimate scenarios for the use of DISTINCT to check the unique values in a particular column. These are mainly SQL queries against a transaction or event table without any joins.