SQL antipatterns: SELECT DISTINCT
The DISTINCT operator
The DISTINCT operator is used to eliminate duplicates in a resultset, e.g. we can use it to identify the unique number of customers who made a purchase.
We run the following query against the Sample Data provided with Snowflake data cloud platform.
1 |
SELECT DISTINCT SS_CUSTOMER_SK from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES"; |
This is a table from the 10 TB TPC-DS sample data that ships with Snowflake. We ran the query with an M virtual warehouse.
The query returns 65M unique customers who made a purchase
It took 220s to run this query. It returned ~96GB and it spent 27% of its time on Processing. In other words, time spent on data processing by the CPU.
As you can see DISTINCT is an expensive CPU (and also memory) intensive operation.
Enjoying this article? Subscribe to our FlowForward newsletter for more insights like these!
FlowForward.
All Things Data Engineering
Straight to Your Inbox!
SQL DISTINCT algorithms
Under the hood the database and the Cost Based Optimizer use various algorithms to identify a unique set of values in a list.
- A simple algorithm runs two nested loops. For every element, it checks if it has appeared before. If it hasn’t then we add the value to the unique list. This is very inefficient.
- We can use sorting to make the algorithm more efficient. First sort the input so that all occurrences of every element are ordered. We can then traverse the sorted input to identify each unique element.
- Another efficient algorithm would be to use hashing. We can traverse the input and store the unique values in a hash table and then use a lookup.
Let’s go through some examples to illustrate this
In this scenario we select the unique number of various attributes from the STORE_SALES table
1 |
SELECT DISTINCT <column_name> from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES"; |
Column name | Size | Unique records | Cluster Key | Query time | CPU time |
---|---|---|---|---|---|
ss_store_sk | 46 GB | 751 | N | 65s | 4% |
ss_sold_time_sk | 64 GB | 46K | N | 136s | 8% |
ss_item_sk | 53 GB | 403K | Y | 85s | 3% |
ss_hdemo_sk | 54 GB | 7.2K | N | 105s | 5% |
ss_customer_sk | 96 GB | 65M | N | 220s | 27% |
We can see the following patterns:
- The query and CPU time increases for queries where the column cardinality is high (more unique values), e.g. ss_sold_time_sk versus ss_hdemo_sk. For ss_customer_sk where we have 65M unique values the query spends 27% of time on CPU processing.
- The query and CPU time increases for queries where the column is not part of a cluster key. A cluster key in Snowflake presorts the result, e.g. ss_item_sk runs in 85s versus 105s for ss_hdemo_sk. ss_item_sk has 56x more unique values than ss_hdemo_sk. So despite having a multiple of unique values it executes faster and uses less CPU time.
SQL DISTINCT versus SQL GROUP BY
We can rewrite a SELECT DISTINCT query as GROUP BY.
Instead of
1 |
SELECT DISTINCT ss_customer_sk from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES"; |
We can write
1 |
SELECT ss_customer_sk from "SAMPLE_DATA"."TPCDS_SF10TCL"."STORE_SALES" GROUP BY ss_customer_sk; |
The two queries generate the exact same explain plan and take the same amount of time to execute.
I recommend using the DISTINCT version as the intent of the query is clearer and more obvious. GROUP BY tells the user to expect an aggregation. DISTINCT tells the user to expect a unique list of values.
SELECT DISTINCT anti patterns.
Apart from being an expensive operation DISTINCT is also widely used incorrectly by SQL developers and analysts.
From my own experience, using DISTINCT is an issue in 99% of query scenarios and typically associated with these types of problems:
- Data quality issues in the underlying data
- Data quality issues because of a badly designed data model
- Mistake in the SQL statement, e.g. Join or Filtering
- Incorrect usage of SQL features
DISTINCT and data quality issues
You may have a duplication or multiplication of values in your tables, e.g. the same customer record exists multiple times in your Customer table.
Bad data model
You may get duplicates because the data model has been designed badly, e.g. the Customer table was denormalised to also include one or more email addresses.
CUSTOMER_ID | CUSTOMER_NAME | |
---|---|---|
1 | Sonra | [email protected] |
2 | Sonra | [email protected] |
This table has not been normalised properly. As a result you end up with a multiplication of records when you query the customer_name column.
Other mistakes result from the SQL developer misunderstanding the data model. In both transactional and analytics data models we find tables that store an audit trail of changes. In dimensional modelling speak these tables are called Slowly Changing Dimensions Type 2.
Duplicates due to mistakes in SQL
Bad Joins
One of the most common reasons for having duplicates in our results is down to incorrect table Joins. Most of the time this happens when we need to join on a composite key. A composite key is made up of more than one column. Leaving out one of the columns in the Join condition will likely result in a multiplication of records.
For a time zone table we would need to use a composite key to make records unique, e.g. the time zone code IST stands for Irish Summer Time, Israel Standard Time and Indian Standard Time. By adding the UTC offset we can make the records in our time zone table unique.
Code | Name | Offset |
---|---|---|
IST | Indian Standard Time | UTC+05:30 |
IST | Irish Standard Time | UTC+01 |
IST | Israel Standard Time | UTC+02 |
We can join the time zone table to an event table.
1 |
SELECT code, event_id from time_zone tz join events ev on (tz.code = ev.code and tz.offset = ev.offset) |
If a SQL developer is not familiar with the data model and only uses the time zone code column in the Join disaster will strike 🙂
1 |
SELECT code, event_id from time_zone tz join events ev on (tz.code = ev.code and tz.offset = ev.offset) |
This will result in a multiplication of records and incorrect results.
Bad filters
Another problem is when SQL developers don’t have a good understanding of the data and the correct keys.
The column customer_id in the table below is not the Primary Key. The Primary Key is a compound key made up of columns customer_id and start_year. Leaving out a filter on start_year, e.g. to find the most recent state of a particular customer record may lead to a multiplication of customer records.
CUSTOMER_ID | CUSTOMER_NAME | TOWN | START_YEAR | CURRENT_IND |
---|---|---|---|---|
1 | Bethke | Heuchlingen | 1973 | N |
1 | Bethke | Aalen | 1987 | N |
1 | Bethke | Dublin | 2001 | Y |
Avoid multiplication of records by applying the correct filter logic.
1 |
SELECT customer_name, customer_id FROM customer WHERE current_ind = ‘Y’ |
Bad understanding of SQL features
Another common source of multiplication is not correctly applying SQL features.
The window functions row_number, rank and dense_rank produce different results. Using rank or dense_rank instead of row_number may generate a different result with pseudo duplicates. Applying the correct window function would resolve the issue without the need for DISTINCT.
V | ROW_NUMBER | RANK | DENSE_RANK |
---|---|---|---|
a | 1 | 1 | 1 |
a | 2 | 1 | 1 |
a | 3 | 1 | 1 |
b | 4 | 4 | 2 |
c | 5 | 5 | 3 |
c | 6 | 5 | 3 |
d | 7 | 7 | 4 |
e | 8 | 8 | 5 |
How do you detect the anti pattern?
You will need to identify SQL statements that use the DISTINCT operator with multiple columns.
Most databases log the history of SQL statements that were run over time in an information schema. You can use our product FlowHigh to parse these SQL statements and identify those queries that use DISTINCT operations with multiple columns.
With FlowHigh you can parse, visualise, optimise, and format SQL. You can also detect SQL anti patterns that violate SQL best practices such as self joins, implicit cross joins etc. FlowHigh comes with a UI and SDK for programmatic access and automation.
SELECT DISTINCT best practices
All of the scenarios in the previous section have in common that they produce duplicates in the result set. Whenever you have a situation where you have duplicate values resist the urge to use DISTINCT.
Rather than always using DISTINCT whenever you come across duplicates first check that you understand the data model and the data itself.
- Make sure to check your Joins.
- Make sure to apply the correct filters.
- Make sure that you understand the data model and possible data quality limitations
There are legitimate scenarios for the use of DISTINCT to check the unique values in a particular column. These are mainly SQL queries against a transaction or event table without any joins.